A method of prognosticating metastasis in a breast cancer patient
involves identifying differential modulation of each gene (relative
to the expression of the same genes in a normal population) in a
combination of genes selected from a group consisting of genes.
Gene expression portfolios and kits for employing the method are
further aspects of the invention.
1. A method of prognosticating metastasis in a breast cancer patient
comprising identifying differential modulation of each gene (relative
to the expression of the same genes in a normal population) in a
combination of genes selected from the group consisting of Seq.
2. The method of claim 1 wherein there is at least a 2 fold difference
in the expression of the modulated genes.
3. The method of claim 1 wherein the p-value indicating differential
modulation is less than 0.05.
4. A method of prognosticating the absence of metastasis in a breast
cancer patient comprising identifying a lack of differential modulation
of each gene (relative to the expression of the same genes in a
normal population) in a combination of genes selected from the group
consisting of Seq. ID. No. 70-97.
5. The method of claim 4 wherein there is less than a 2 fold difference
in the expression of the genes used to prognosticate relative to
the expression of same genes in a normal population.
6. The method of claim 4 wherein the p-value indicating a lack
of differential modulation is less than 0.05.
7. The method of claim 4 wherein said prognosis of the absence
of metastasis is for a five year period.
8. A diagnostic portfolio comprising isolated nucleic acid sequences,
their complements, or portions thereof of a combination of genes
selected from the group consisting of Seq. ID. No. 70-97.
9. The diagnostic portfolio of claim 8 in a matrix suitable for
identifying the differential expression of the genes contained therein.
10. The diagnostic portfolio of claim 8 wherein said matrix is
employed in a microarray.
11. The diagnostic portfolio of claim 10 wherein said microarray
is a cDNA microarray.
12. The diagnostic portfolio of claim 10 wherein said microarray
is an oligonucleotide microarray.
13. A kit for prognosticating metastasis in a breast cancer patient
comprising reagents for detecting nucleic acid sequences, their
compliments, or portions thereof in a combination of genes selected
from the group consisting of Seq. ID. No. 70-97.
14. The kit of claim 13 further comprising reagents for conducting
a microarray analysis.
15. The kit of claim 14 further comprising a medium through which
said nucleic acid sequences, their compliments, or portions thereof
16. The kit of claim 15 wherein said medium is a microarray.
17. The kit of claim 13 further comprising instructions.
 This application claims the benefit of U.S. Provisional
Application No. 60/368,789 filed on Mar. 29, 2002.
 The invention relates to the selection of portfolios of
 A few single gene diagnostic markers such as her-2-neu are
currently in use. Usually, however, diseases are not easily diagnosed
with molecular diagnostics for one particular gene. Multiple markers
are often required and the number of such markers that may be included
in a assay based on differential gene modulation can be large, even
in the hundreds of genes. It is desirable to group markers into
portfolios so that the most reliable results are obtained using
the smallest number of markers necessary to obtain such a result.
This is particularly true in assays that contain multiple steps
such as nucleic acid amplification steps.
SUMMARY OF THE INVENTION
 The invention is a method of prognosticating metastasis
in a breast cancer patient by identifying differential modulation
of each gene (relative to the expression of the same genes in a
normal population) in a combination of genes selected from the group
consisting of Seq. ID. No. 70-97.
 Gene expression portfolios and kits for employing the method
are further aspects of the invention.
 The methods of this invention can be used in conjunction
with any method for determining the gene expression patterns of
relevant cells as well as protein based methods of determining gene
expression. Preferred methods for establishing gene expression profiles
include determining the amount of RNA that is produced by a gene
that can code for a protein or peptide. This is accomplished by
reverse transcriptase PCR (RT-PCR), competitive RT-PCR, real time
RT-PCR, differential display RT-PCR, Northern Blot analysis and
other related tests. While it is possible to conduct these techniques
using individual PCR reactions, it is best to amplify copy DNA (cDNA)
or copy RNA (cRNA) produced from mRNA and analyze it via microarray.
A number of different array configurations and methods for their
production are known to those of skill in the art and are described
in U.S. Pat. Nos. such as: 5,445,934; 5,532,128; 5,556,752; 5,242,974;
5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807; 5,436,327;
5,472,672; 5,527,681; 5,529,756; 5,545,531; 5,554,501; 5,561,071;
5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734; and 5,700,637;
the disclosures of which are incorporated herein by reference.
 Microarray technology allows for the measurement of the
steady-state mRNA level of thousands of genes simultaneously thereby
presenting a powerful tool for identifying effects such as the onset,
arrest, or modulation of uncontrolled cell proliferation. Two microarray
technologies are currently in wide use. The first are cDNA arrays
and the second are oligonucleotide arrays. Although differences
exist in the construction of these chips, essentially all downstream
data analysis and output are the same. The product of these analyses
are typically measurements of the intensity of the signal received
from a labeled probe used to detect a cDNA sequence from the sample
that hybridizes to a nucleic acid sequence at a known location on
the microarray. Typically, the intensity of the signal is proportional
to the quantity of cDNA, and thus mRNA, expressed in the sample
cells. A large number of such techniques are available and useful.
Preferred methods for determining gene expression can be found in
U.S. Pat. Nos. 6,271,002 to Linsley, et al.; 6,218,122 to Friend,
et al.; 6,218,114 to Peck, et al.; and 6,004,755 to Wang, et al.,
the disclosure of each of which is incorporated herein by reference.
 Analysis of the expression levels is conducted by comparing
such intensities. This is best done by generating a ratio matrix
of the expression intensities of genes in a test sample versus those
in a control sample. For instance, the gene expression intensities
from a diseased tissue can be compared with the expression intensities
generated from normal tissue of the same type (e.g., diseased colon
tissue sample vs. normal colon tissue sample). A ratio of these
expression intensities indicates the fold-change in gene expression
between the test and control samples.
 Modulated genes are those that are differentially expressed
as up regulated or down regulated in non-normal cells. Up regulation
and down regulation are relative terms meaning that a detectable
difference (beyond the contribution of noise in the system used
to measure it) is found in the amount of expression of the genes
relative to some baseline. In this case, the baseline is the measured
gene expression of a normal cell. The genes of interest in the non-normal
cells are then either up regulated or down regulated relative to
the baseline level using the same measurement method.
 Preferably, levels of up and down regulation are distinguished
based on fold changes of the intensity measurements of hybridized
microarray probes. For example, in the case in which a 1.5 fold
or more difference is used to make such distinctions, the diseased
cell is found to yield at least 1.5 times more, or 1.5 times less
intensity than the normal cells.
 Other methods of making distinctions are available. For
example, statistical tests can be used to find the genes most significantly
different between diverse groups of samples. The Student's t-test
is an example of a robust statistical test that can be used to find
significant differences between two groups. The lower the p-value,
the more compelling the evidence that the gene is showing a difference
between the different groups. Nevertheless, since microarrays measure
more than one gene at a time, tens of thousands of statistical tests
may be asked at one time. Because of this, there is likelihood to
see small p-values just by chance and adjustments for this using
a Sidak correction as well as a randomization/permutation experiment
can be made.
 A p-value less than 0.05 by the t-test is evidence that
the gene is significantly different. More compelling evidence is
a p-value less then 0.05 after the Sidak correct is factored in.
For a large number of samples in each group, a p-value less than
0.05 after the randomization/ permutation test is the most compelling
evidence of a significant difference.
 Genes can be grouped so that information obtained about
the set of genes in the group provides a sound basis for making
clinically relevant judgments such as a diagnosis, prognosis, or
treatment choice. These sets of genes make up the portfolios of
the invention. As with most diagnostic markers, it is often desirable
to use the fewest number of markers sufficient to make a correct
medical judgment. This prevents a delay in treatment pending further
analysis as well as inappropriate use of time and resources. Preferred
optimal portfolio is one that employs the fewest number of markers
for making such judgments while meeting conditions that maximize
the probability that such judgments are indeed correct. These conditions
will generally include sensitivity and specificity requirements.
In the context of microarray based detection methods, the sensitivity
of the portfolio can be reflected in the fold differences exhibited
by a gene's expression in the diseased or aberrant state relative
to the normal state. The detection of the differential expression
of a gene is sensitive if it exhibits a large fold change relative
to the expression of the gene in another state. Another aspect of
sensitivity is the ability to distinguish signal from noise. For
example, while the expression of a set of genes may show adequate
sensitivity for defining a given disease state, if the signal that
is generated by one (e.g., intensity measurements in microarrays)
is below a level that easily distinguished from noise in a given
setting (e.g., a clinical laboratory) then that gene should be excluded
from the optimal portfolio. A procedure for setting conditions such
as these that define the optimal portfolio can be incorporated into
the inventive methods.
 Specificity can be reflected in statistical measurements
of the correlation of the signaling of gene expression with the
condition of interest. If the differential expression of a set of
genes is observed to produce a large fold change but they do so
for a number of conditions other than the condition of interest
(e.g. multiple disease states) then the gene expression profile
for that set of genes is non-specific. Statistical measurements
of correlation of data or the degree of consistency of data such
as standard deviation, correlation coefficients, and the like can
be a used as such measurements. In considering a group of genes
for inclusion in a portfolio, a small standard deviation in expression
measurements correlates with greater specificity. Genes that display
similar expression patterns may be co-regulated by an identical
factor that pushes the genes in the same direction. If this factor
is sufficient but not necessary for classifying a sample, then these
genes will fail to correctly identify a sample if the markers are
all related to this single factor. Diversification then results
in selecting as few markers as possible, yet covers as many different
optimal expression patterns that are contained in the data set
 In the method of the invention, a group of genetic markers
is selected for use in diagnostic applications. These groups of
markers are "portfolios". Diagnostic applications include
the detection or identification of a disease state or condition
of a subject, determining the likelihood that a subject will contract
a given disease or condition, determining the likelihood that a
subject with a disease or condition will respond to therapy, determining
the prognosis of a subject with a disease or condition (or its likely
progression or regression), and determining the effect of a treatment
on a subject with a disease or condition. For example, the method
can be used to establish portfolios for detecting the presence or
likelihood of a subject contracting colon cancer or the likelihood
that such a subject will respond favorably to cytotoxic drugs.
 The portfolios selected by the method of the invention contain
a number and type of markers that assure accurate and precise results
and are economized in terms of the number of genes that comprise
the portfolio. The method of the invention can be used to establish
optimal gene expression portfolios for any disease, condition, or
state that is concomitant with the expression of multiple genes.
An optimal portfolio in the context of the instant invention refers
to a gene expression profile that provides an assessment of the
condition of a subject (based upon the condition for which the analysis
was undertaken) according to predetermined standards of at least
two of the following parameters: accuracy, precision, and number
of genes comprising the portfolio.
 Most preferably, the markers employed in the portfolio are
nucleic acid sequences that express mRNA ("genes"). Expression
of the markers may occur ordinarily in a healthy subject and be
more highly expressed or less highly expressed when an event that
is the object of the diagnostic application occurs. Alternatively,
expression may not occur except when the event that is the object
of the diagnostic application occurs.
 Marker attributes, features, indicia, or measurements that
can be compared to make diagnostic judgments are diagnostic parameters
used in the method. Indicators of gene expression levels are the
most preferred diagnostic parameters. Such indicators include intensity
measurements read from microarrays, as described above. Other diagnostic
parameters are also possible such as indicators of the relative
degree of methylation of the markers.
 Distinctions are made among the diagnostic parameters through
the use of mathematical/statistical values that are related to each
other. The preferred distinctions are mean signal readings indicative
of gene expression and measurements of the variance of such readings.
The most preferred distinctions are made by use of the mean of signal
ratios between different group readings (e.g., microarray intensity
measurements) and the standard deviations of the signal ratio measurements.
A great number of such mathematical/statistical values can be used
in their place such as return at a given percentile.
 A relationship among diagnostic parameter distinctions is
used to optimize the selection of markers useful for the diagnostic
application. Typically, this is done through the 25 use of linear
or quadratic programming algorithms. However, heuristic approaches
can also be applied or can be used to supplement input data selection
or data output. The most preferred relationship is a mean-variance
relationship such as that described in Mean-Variance Analysis in
Portfolio Choice and Capital Markets by Harry M. Markowitz (Frank
J. Fabozzi Associates, New Hope, PA: 2000, ISBN: 1-883249-75-9)
which is incorporated herein by reference. The relationship is best
understood in the context of the selection of stocks for a financial
investment portfolio. This is the context for which the relationship
was developed and elucidated.
 The investor looking to optimize a portfolio of stocks can
select from a large number of possible stocks, each having a historical
rate of return and a risk factor. The mean variance method uses
a critical line algorithm of linear programming or quadratic programming
to identify all feasible portfolios that minimize risk (as measured
by variance or standard deviation) for a given level of expected
return and maximize expected return for a given level of risk. When
standard deviation is plotted against expected return an efficient
frontier is generated. Selection of stocks along the efficient frontier
results in a diversified stock portfolio optimized in terms of return
 When the mean variance relationship is used in the method
of the instant invention, diagnostic parameters such as microarray
signal intensity and standard deviation replace the return and risk
factor values used in the selection of financial portfolios. Most
preferably, when the mean variance relationship is applied, a commercial
computer software application such as the "Wagner Associates
Mean-Variance Optimization Application", referred to as "Wagner
Software" throughout this specification. This software uses
functions from the "Wagner Associates Mean-Variance Optimization
Library" to determine an efficient frontier and optimal portfolios
in the Markowitz sense. Since such applications are made for financial
applications, it may be necessary to preprocess input data so that
it can conform to conventions required by the software. For example,
when Wagner Software is employed in conjunction with microarray
intensity measurements the following data transformation method
 A relationship between each genes baseline and experimental
value must first be established. The preferred process is conducted
as follows. A baseline class is selected. Typically, this will comprise
genes from a population that does not have the condition of interest.
For example, if one were interested in selecting a portfolio of
genes that are diagnostic for breast cancer, samples from patients
without breast cancer can be used to make the baseline class. Once
the baseline class is selected, the arithmetic mean and standard
deviation is calculated for the indicator of gene expression of
each gene for baseline class samples. This indicator is typically
the fluorescent intensity of a microarray reading. The statistical
data computed is then used to calculate a baseline value of (X*Standard
Deviation+Mean) for each gene. This is the baseline reading for
the gene from which all other samples will be compared. X is a stringency
variable selected by the person formulating the portfolio. Higher
values of X are more stringent than lower. Preferably, X is in the
range of 0.5 to 3 with 2 to 3 being more preferred and 3 being most
 Ratios between each experimental sample (those displaying
the condition of interest) versus baseline readings are then calculated.
The ratios are then transformed to base 10 logarithmic values for
ease of data handling by the software. This enables down regulated
genes to display negative values necessary for optimization according
to the Markman mean-variance algorithm using the Wagner Software.
 The preprocessed data comprising these transformed ratios
are used as inputs in place of the asset return values that are
normally used in the Wagner Software when it is used for financial
 Once an efficient frontier is formulated, an optimized portfolio
is selected for a given input level (return) or variance that corresponds
to a point on the frontier. These inputs or variances are the predetermined
standards set by the person formulating the portfolio. Stated differently,
one seeking the optimum portfolio determines an acceptable input
level (indicative of sensitivity) or a given level of variance (indicative
of specificity) and selects the genes that lie along the efficient
frontier that correspond to that input level or variance. The Wagner
Software can select such genes when an input level or variance is
selected. It can also assign a weight to each gene in the portfolio
as it would for a stock in a stock portfolio.
 Determining whether a sample has the condition for which
the portfolio is diagnostic can be conducted by comparing the expression
of the genes in the portfolio for the patient sample with calculated
values of differentially expressed genes used to establish the portfolio.
Preferably, a portfolio value is first generated by summing the
multiples of the intensity value of each gene in the portfolio by
the weight assigned to that gene in the portfolio selection process.
A boundary value is then calculated by (Y*standard deviation+mean
of the portfolio value for baseline groups) where Y is a stringency
value having the same meaning as X described above. A sample having
a portfolio value greater than the boundary value of the baseline
class is then classified as having the condition. If desired, this
process can be conducted iteratively in accordance with well known
statistical methods for improving confidence levels.
 Optionally one can reiterate this process until best prediction
accuracy is obtained.
 The process of portfolio selection and characterization
of an unknown is summarized as follows:
 1. Choose baseline class
 2. Calculate mean, and standard deviation of each gene for
baseline class samples
 3. Calculate (X*Standard Deviation+Mean) for each gene.
This is the baseline reading from which all other samples will be
compared. X is a stringency variable with higher values of X being
more stringent than lower.
 4. Calculate ratio between each Experimental sample versus
baseline reading calculated in step 3.
 5. Transform ratios such that ratios less than 1 are negative
(eg.using Log base 10). (Down regulated genes now correctly have
negative values necessary for MV optimization).
 6. These transformed ratios are used as inputs in place
of the asset returns that are normally used in the software application.
 7. The software will plot the efficient frontier and return
an optimized portfolio at any point along the efficient frontier.
 8. Choose a desired return or variance on the efficient
 9. Calculate the Portfolio's Value for each sample by summing
the multiples of each gene's intensity value by the weight generated
by the portfolio selection algorithm.
 10. Calculate a boundary value by adding the mean Portfolio
Value for Baseline groups to the multiple of Y and the Standard
Deviation of the Baseline's Portfolio Values. Values greater than
this boundary value shall be classified as the Experimental Class.
 11. Optionally one can reiterate this process until best
prediction accuracy is obtained.
 A second portfolio can optionally be created by reversing
the baseline and experimental calculation. This creates a new portfolio
of genes which are up-regulated in the original baseline class.
This second portfolio's value can be subtracted from the first to
create a new classification value based on multiple portfolios.
 Another useful method of pre-selecting genes from gene expression
data so that it can be used as input for a process for selecting
a portfolio is based on a threshold given by 1 1 ( t - n ) ( t +
n ) ,
 , where .mu..sub.t is the mean of the subset known to possess
the disease or condition, .mu..sub.n is the mean of the subset of
normal samples, and .sigma..sub.t+.sigma..sub.n represent the combined
standard deviations. A signal to noise cutoff can also be used by
pre-selecting the data according to a relationship such as 2 0.5
( t - MAX n ) ( t + n ) .
 . This ensures that genes that are pre-selected based on
their differential modulation are differentiated in a clinically
significant way. That is, above the noise level of instrumentation
appropriate to the task of measuring the diagnostic parameters.
For each marker pre-selected according to these criteria, a matrix
is established in which columns represents samples, rows represent
markers and each element is a normalized intensity measurement for
the expression of that marker according to the relationship: 3 (
t - I ) t
 where I is the intensity measurement.
 Using this process of creating input for financial portfolio
software make also allows one to set additional boundary conditions
to define the optimal portfolios. For example, portfolio size can
be limited to a fixed range or number of markers. This can be done
either by making data pre-selection criteria more stringent (e.g,.
4 .8 ( t - MAX n ) ( t + n )
 instead of 5 0.5 ( t - MAX n ) ( t + n ) )
 or by using programming features such as restricting portfolio
size. One could, for example, set the boundary condition that the
efficient frontier is to be selected from among only the optimal
10 genes. One could also use all of the genes pre-selected for determining
the efficient frontier and then limit the number of genes selected
(e.g., no more than 10).
 The process of selecting a portfolio can also include the
application of heuristic rules. Preferably, such rules are formulated
based on biology and an understanding of the technology used to
produce clinical results. More preferably, they are applied to output
from the optimization method. For example, the mean variance method
of portfolio selection can be applied to microarray data for a number
of genes differentially expressed in subjects with breast cancer.
Output from the method would be an optimized set of genes that could
include some genes that are expressed in peripheral blood as well
as in diseased breast tissue. If sample used in the testing method
are obtained from peripheral blood and certain genes differentially
expressed in instances of breast cancer could also be differentially
expressed in peripheral blood, then a heuristic rule can be applied
in which a portfolio is selected from the efficient frontier excluding
those that are differentially expressed in peripheral blood. Of
course, the rule can be applied prior to the formation of the efficient
frontier by, for example, applying the rule during data pre-selection.
 Other heuristic rules can be applied that are not necessarily
related to the biology in question. For example, one can apply the
rule that only a given percentage of the portfolio can be represented
by a particular gene or genes. Commercially available software such
as the Wagner Software readily accommodates these types of heuristics.
This can be useful, for example, when factors other than accuracy
and precision (e.g., anticipated licensing fees) have an impact
on the desirability of including one or more genes.
 Other relationships aside from the mean-variance relationship
can be used in the method of the invention provided that they optimize
the portfolio according to predetermined attributes such as assay
accuracy and precision. Two examples are the Martin simultaneous
equation approach (Elton, Edwin J. and Martin J. Gruber (1987),
Modern Portfolio Theory Investment Analysis, Third Edition, John
Wiley, New York, 1987) and Genetic Algorithms (Davis, L., (1989),
Adapting Operator Probabilities in Genetic Algorithms, in Proceedings
of the Third International Conference on Genetic Algorithms, Morgan
Kaufmann: San Mateo, pp. 61-69). There are also many ways to adapt
the mean-variance relationship to handle skewed data such as where
a marker detection technology exhibits a known bias. These include,
for example, the Semi-Deviation method in which the square root
of the average squared (negative) deviation from a reference signal
and includes only those signal values that fall below the reference