An under-appreciated aspect of the genetic analysis of gene expression is the impact of post-probe level normalization on biological inference. Here we contrast nine different methods for normalization of an Illumina bead-array gene expression profiling dataset consisting of peripheral blood samples from individual participants in the Center for Health Discovery and Well Being study in Atlanta, quantifying differences in the inference of global variance components and covariance of gene expression, as well as the detection of variants that affect transcript abundance eSNPs. The normalization strategies, all relative to raw log2 measures, include simple mean centering, two modes of transcript-level linear adjustment for technical factors, and for differential immune cell counts, variance normalization by interquartile range and by quantile, fitting the first 16 Principal Components, and supervised normalization using the SNM procedure with adjustment for cell counts. Robustness of genetic associations as a consequence of Pearson and Spearman rank correlation is also reported for each method, and it is shown that the normalization strategy has a far greater impact than correlation method. We describe similarities among methods, discuss the impact on biological interpretation, and make recommendations regarding appropriate strategies.
An under-appreciated aspect of the genetic analysis of gene expression is the impact of post-probe level normalization on biological inference. Here we contrast Online muslim dating sites Rubber chains different methods for normalization of an Illumina bead-array gene expression profiling dataset consisting of peripheral blood samples from individual participants in the Center for Health Discovery and Well Being study in Atlanta, quantifying differences in the inference of global variance components and covariance of gene expression, as well as the detection of variants that affect transcript abundance eSNPs.
The normalization strategies, all relative to raw log2 measures, include simple mean centering, two modes of transcript-level linear adjustment for technical factors, and for differential immune cell counts, variance normalization by interquartile range and by quantile, fitting the first Free asian dating sites uk Statistics with JMP: Hypothesis Tests Principal Components, and supervised normalization using the SNM procedure with adjustment for cell counts.
Robustness of genetic associations as a consequence of Pearson and Spearman rank correlation is also reported for each method, and it is shown that the normalization strategy has a far greater impact than correlation method. We describe similarities among methods, discuss the impact on biological interpretation, and make recommendations regarding appropriate strategies.
Normalization is one of the most vexing issues associated with the analysis of functional genomic datasets such as gene expression, metabolomic, and methylation profiles.
Much consideration has been given to methods for extracting appropriate probe summary measures from raw microarray, Affymetrix, or Illumina fluorescence intensities, which is the first step in normalization. Even so, given an appropriately pre-processed dataset Schmid et al.
It is however less well appreciated just how large the impact of these initial post-probe level data processing steps can be, and these are the subject of this study. This is particularly important where the desire exists to make adjustments for covariates that are thought a priori to globally impact a large proportion of the measurements Qiu et al. The most commonly utilized normalization methods treat all of the measurements jointly, and are generally variations on approaches to centering the data distributions or equilibrating the variances.
Centering approaches most simply include mean or median centering to adjust for overall differences in concentration perhaps due to slight variation in the amount of sample, or efficiency of the EPUBbut ANOVA approaches can also be used if it is suspected that certain groups of samples are likely to have different distributions Dabney and Storey, ; Mason et al. In all cases, hypothesis testing evaluates differential abundance, usually on a log scale.
Variance normalization by contrast effectively evaluates differences in rank order Durbin et al. The simplest approaches are to convert the measures to z -scores by dividing through by the sample standard deviation following centering Colantuoni et al.
Interquartile range IQR normalization forces the distributions to have the same ANOVA and Regression (E-Book for the 25th and 75th percentiles Geller et al.
QNM has become the standard method in many circumstances, and is certainly appropriate where the assumption is that only a small number of measures differ among samples and hence that variation in the distributions is mostly technical noise that should be removed.
However, in many biological circumstances a large fraction of EPUB vary systematically due to regulatory mechanisms that create extensive covariance, and we demonstrate herein how QNM can alter the biological signal see Leek et al. Recently, attention has turned to methods that treat different measures unequally, recognizing that both technical and biological factors are both likely to impact only a subset of all of the measures in the samples.
Intensity-dependent effects, for example, are often removed by lowess transformation Yang et al. More generally, it EPUB be recognized that technical factors such as RNA quality may not affect all transcripts equally, and that biological factors such as sex or cell counts in complex tissues, will impact thousands of measures but by no means all. Such global influences can be identified by principal component or similar analysis Quackenbush, ; James et al.
This is an intuitively appealing approach that is just beginning to gain traction following the development of open source algorithms, including Supervised Normalization of Microarrays SNM; see also Stegle et al. The objective of this study was to quantitatively evaluate the impact of nine different normalization approaches on a new dataset that we are analyzing for the purpose of measuring the impact of clinical covariates on peripheral blood gene expression in healthy adults.
Full description of the study will be published elsewhere, as we only describe the influence of four biological variables gender, ethnicity, age, and BMI as well as blood cell counts, and two technical variables that commonly impact gene expression studies, namely date of hybridization and RNA quality.
The nine methods are: We document the widespread impact of these methods, draw conclusions regarding similar aspects of their performance, and discuss the implications for interpretation of hypothesis testing. We consider here the abundance of transcripts measured with 14, probes that are consistently detected across multiple datasets of peripheral blood samples, which for this study were obtained from Tempus tubes Applied Biosystems, Foster City, CA, USA that preserve whole blood RNA.
Whole genome genotypes were measured using Illumina OmniQuad arrays. Normalization was performed as follows. RAW refers to the average bead ANOVA and Regression (E-Book intensity for each probe obtained directly from Bead Studio without background subtraction, with log base 2 transformation but no adjustment across arrays.
MEA refers to mean centering of the RAW profiles for each sample, namely an additive shift on the log base 2 scale that ensures that the mean value is the same for each individual, but the shape and variance of each profile is not adjusted. Technical batch and RNA quality effects were adjusted giving rise to the dr3 profiles, by fitting an ANOVA to each probe with fixed effects of hybridization date and Bioanalyzer RNA Integrity Number RIN and then standardizing the residuals to yield z -score gene expression measures that is, each gene has a mean of zero and variance of 1 across the individuals.
DRM refers to profiles obtained by mean centering of the dr3 profiles, which ensures that there is no bias in the overall distribution of transcripts with relatively low or high expression in each individual, as expected biologically. The dr3 profiles were subject to an alternate transformation adjusting for blood cell counts, giving rise to the LMN profiles by fitting probe-specific multiple linear regression with counts of Lymphocytes, Monocytes, Neutrophils, Erythrocytes, and Platelets all measured directly using a standard CBC panel on each sampleand retaining the residuals.
Two types of variance transformation were performed. IQR refers to the InterQuartile Range, namely the distribution of each RAW log base 2 profile adjusted to ensure that the range between the 25th and 75th percentile values is ANOVA and Regression (E-Book and Free asian dating sites uk Statistics with JMP: Hypothesis Tests these are the same for each sample.
This produces more similar variance structure than the MEA transform, while also ensuring that all arrays have similar means. QNM refers to quantile normalization, which is a density-adjusted rank ordering. For each sample, each probe ANOVA and Regression (E-Book ranked according to intensity and then the average intensity of each rank is computed.
The probe is assigned that average value, resulting in EPUB overall distributions. SNM refers to supervised normalization of microarrays and was performed using the R package of that name from Bioconductor Mecham et al. PCA refers ANOVA and Regression (E-Book profile Free asian dating sites uk Statistics with JMP: Hypothesis Tests after fitting a multiple linear regression with each of the first 16 principal components of expression of all 14, probes across the samples.
These 16 PC explain The first five principal components of the total gene expression dataset were computed, and then an average of the proportion of Free asian dating sites uk Statistics with JMP: Hypothesis Tests trait explained by each of these five PC, weighted by their contributions to the total gene expression, was computed. Immuno-informative axes of variation are defined as the first PC of ANOVA and Regression (E-Book definitive genes for each of seven axes described in Preininger et al.
Volcano plots are simply x — y plots of significance as the negative logarithm of the p -value against fold-change in gene expression between the indicated groups. Log transformation of the RAW data results in approximately normal distributions of all samples, albeit with a left-shifted peak due to most transcripts having low to moderate abundance with a long tail of higher abundance transcripts. The mean and variance of these distributions may or may not be correlated with biological and technical covariates.
Since the three colors here represent normal weight, overweight, and obesity, there is evidently no clear overall impact of SpeedDating in Leipzig hohe Erfolgsquote BMI classes on the gene expression profiles.
MEA ensures that overall abundance effects are removed, while IQR further squeezes the distributions into more similar profiles. Similarly, LMN produces mean-centered z -score distributions after fitting the number similar results are obtained after fitting the proportion, not shown of the five major blood cell types that a priori are likely to impact global gene expression profiles. The bottom row shows the effect of the more aggressive normalization procedures.
PCA objectively removes most of the sources of covariance without regard to the source; the resultant standardized distributions are further mean-centered for all subsequent analyses. The QNM plot shows how QNM forces all samples to the same overall distribution, and clearly the SNM procedure is almost as effective at adjusting both the mean and variance of the distributions, but in a more experimentally less statistically motivated manner.
Profile ANOVA and Regression (E-Book after nine modes of normalization. Each plot shows the frequency distribution of transcripts at increasing levels of expression along the x -axis the units are removed, since these are not comparable between methods.
Colors represent normal weight blueheavy greenor obese red individuals. In light of the dramatic impact of normalization on the data distributions, it is to be expected that patterns of covariance of gene expression might also be affected. We visualize this in two ways. Note that the precise ordering of arrays is not the same for each normalization.
As expected, the RAW and MEA correlation clusters are identical, since the correlation coefficients are not affected by additive adjustment of the grand mean, and these are similar to the two straight variance transforms IQR and QNM since there has been no change in the ranking of transcript abundance. PCA almost completely removes Free asian dating sites uk Statistics with JMP: Hypothesis Tests covariance, certainly removing some shared biological regulation in the process.
The SNM leaves a novel pattern of covariance that in theory improves on LMN by also adjusting for other Dating website berlin mitte Abt. XIII: Nadeln of biological variation such as gender and ethnicity.
The two clusters of individuals at the bottom right represent the extremes for PC1 after SNM normalization, but no single trait that was included in the normalization model explains this separation of expression profile types.
Heatmaps showing pair-wise similarity of arrays. Each plot shows the correlation coefficient for the correlation coefficients of each gene expression in each array with that in the paired array. Blocks of color indicate that arrays in those sectors are less or more similar to one another. Each plot is symmetrical about the diagonal. There are two groups with almost identical PC1 eigenvalues: This is as expected, and shows that cell counts have very little impact on the major axis of variation in peripheral blood.
Similarity of principal components A and immuno-informative axis scores B. The heat maps show the correlation coefficient across all samples for each PC axis, where the order of the rows is the same as the order of the columns. A Comparison of the first five PC shows that PC1 is generally highly correlated across normalization strategies, as is PC2, but that the lower PC fall into different clusters.
B By contrast, the primary axis of covariance of genes representing seven common axes of immunologically informative variation Risso et al. The next three PC are correlated to varying degrees with neutrophil, monocyte, and lymphocyte counts in particular. It is also striking that although PC are by definition orthogonal within a normalization, across normalizations they generally pick up overlapping components of covariance so that a single PC in one analysis can significantly correlate with multiple EPUB in another analysis: It is also important to note that the impact of QNM is biologically difficult to interpret as the PCs show the least similarity with those derived from the other methods.
By contrast, SNM Free asian dating sites uk Statistics with JMP: Hypothesis Tests the cell abundance effects as expected and generally shows a covariance structure that is a composite of Hot Singles For You In Balingen, Interracial Dating Central cell types and technical factors.
Mirroring the changes in correlation structure, normalization can have a dramatic impact on the covariance of the principal components of variation with traits of interest. Here we consider just four: Most Free asian dating sites uk Statistics with JMP: Hypothesis Tests the methods suggest that similar amounts of gene expression variation are explained by three of the traits BMI has little effect overallalthough SNM apportions almost twice as much variance to Gender and Ethnicity as do the other methods, while fitting the blood cell counts removes the Gender component since the blood counts differ slightly between men and women.
PCA has essentially removed all of the biological contributions that result in covariance. The table reports the weighted average of the percentage of variation explained by the first five principal components of gene expression, for the indicated variables. Similarly, gene-by-gene modeling of the association between transcript abundance and continuous trait measures is a strong function of normalization. There is a threefold range in Free asian dating sites uk Statistics with JMP: Hypothesis Tests total number of highly significant associations detected, with the least observed for analysis of the RAW data, and the most overall for QNM.
Age associations that are partially correlated with the technical covariates in this sample are not detected after QNM, while SNM facilitates enhancement of the BMI effect, possibly at the expense of Gender and Ethnicity effects.
The simple expedient of mean centering is at least as effective as the mild IQR variance adjustment. Also indicated is the impact of failing to adjust for array effects after fitting the technical and cell number covariates compare the dr3 and DRM rowssince overall profile differences are mildly correlated with the biological factors.
Once again, fitting the first 16 PCA completely removes most trait associations. ANOVA and Regression (E-Book trends are observed at less stringent significance thresholds. The table reports the total number of associations detected between Probe-level expression, and the indicated traits. The good news in this analysis is that there is extensive overlap in the most significant transcripts identified after each normalization approach.
There is wide variation in the shapes of the plots, with IQR and LMN showing poor and strong separation of up- and down-regulated genes respectively. It should be noted that the fold-difference 100 Free Online Dating in Rustenburg, South Africa is presented on the log2 scale in the former, and on z -scores for the latter.
All points in the dr3 plot align on a simple curve, since all genes have the same variance after standardization re-centering EPUB DRM adds variance back, resulting in the more typical volcano plot. The correlation in the p -values for both Ethnicity and Gender is high across all eight methods excluding PCA, bottom left panelbut note that there is a strong over-estimation of EPUB significance of the Gender effect in QNM relative to SNM EPUB right panel, blue circles and that many genes are only called significant for Ethnicity with either procedure red and blue circles.
Given the wide variety of significance thresholds adopted for taking genes forward for downstream processing steps such as Free asian dating sites uk Statistics with JMP: Hypothesis Tests Ontology analysis, it is not clear whether normalization has as larger effect than simply setting the threshold for inclusion of genes in downstream analysis.
The impact will largely be study-specific, but these analyses indicate that it will rarely be negligible.