1. Introduction

A major obstacle in global liquid chromatography mass spectrometry (LC-MS) based metabolomics is drawing comparisons between samples processed on different runs of the same instrument or on different runs from different instruments. There are a number of reasons for wanting to compare samples from different instrument runs. Single runs using a mass spectrometer are limited to a certain number of samples. When run through a mass spectrometer, samples are prepped and placed on a plate containing a defined number of wells with each well housing an individual sample. The number of wells available depends on the type and size of plate used, but is generally some multiple of 24 [1] . Even instrumentation that can accommodate large plates or multiple small plates is generally restricted to, at most, a few hundred wells [2] . Large epidemiological studies with thousands of samples can easily surpass this capacity. In another example, a designed time course experiment may not have all samples available at one time for analysis. In particular, the clinical environment is analogous to these situations as new patients are regularly being admitted and evaluated. However, mass spectrometry itself is inherently semi-quantitative; the observed value returned by the instrument is the ion count associated with the feature, i.e. “ion peak”, which depends not just on the concentration in the sample but also on metabolite and instrument characteristics.

Exact concentration can be derived though calibration curves, i.e. standard curves, in which known concentrations of the target metabolite are included as a way to orient the ion counts and estimate the levels in samples of interest according to their position on the curve. For a thorough review of standard curves see the five part series by Dolan [3] [4] [5] [6] [7] . This targeted approach is clearly infeasible for global metabolomic profiling as 1) the metabolites to be captured are not known a priori 2) it is a significant challenge to obtain a labeled standard for each metabolite and 3) there are a limited number of wells available to house the standards along with the experimental samples being profiled.

Lacking full quantitation, one must find some way to adjust the ion counts in different batches to each other. Batch effects are typically removed through normalization. The goal of normalization is to reduce the systematic variation but preserve the biological variation. Many normalization techniques used in the field adjust each sample. However, there are other normalization techniques that make adjustments for each metabolite, rather than the sample. Often normalization techniques are deemed successful if the variance has decreased. However, some of the important biological variation may have been removed also. Since the ideal measurement for a metabolite would be from a targeted assay or clinical measurement, we compare the normalized values to the values from a panel of targeted assays, where the concentrations have been measured.

2. Materials and Methods

2.1. Traditional Normalization

For this discussion, it will be assumed that the data sets are organized so that the rows correspond to the samples and the columns refer to the features (metabolites). The most common normalization is total ion count (or total ion current) normalization (TIC) in which all metabolites in a sample is divided by the total number of ions observed in the sample [8] . Although very popular, TIC is susceptible to being overly influenced by a small number of features with very high ion observations. This normalization also assumes that most metabolites are not changing under the conditions being tested and that there are roughly equal numbers of metabolites that are both up and down-regulated. This assumption is clearly violated in some sample sets, such as comparisons between cell lines or when comparing normal tissue to cancerous tissues.

Various adjustments to this basic premise include median normalization, MS-total useful signal (MSTUS) [9] , median absolute deviation (MAD) [10] , probabilistic quotient normalization (PQN) [11] and cyclic locally weighted regression (Cyclic LOWESS) [12] among others [13] [14] ; however, in general, these models rely on the assumption that on “average” the ion count of each sample should be more or less equal if there were no instrument batch effects. In this paper normalizations are separated into three classes depending on the mechanism of action. The first class involves dividing ion intensities by a function of the sample’s spectra. The second class of normalization relies on Minus-Average (MA) plots. The third class is those normalizers that do not fit into either of the first two classes.

2.1.1. Class I―Spectral Division

Normalizers of the first class are defined as being a ratio of the sample’s raw intensity values and a function of the sample vector. Let ${X}_{i}=\left\{{x}_{i1}\cdots {x}_{im}\right\}$ be the vector of observed ion counts for metabolites $1,2,\cdots ,m,$ for sample i, and let ${X}_{i}^{\text{N}}$ represent the resulting vector of normalized metabolites. Normalizers of the first class are defined as

${X}_{i}^{\text{N}}=\frac{{X}_{i}}{{f}_{i}(Xi)}$

where ${\u0192}_{i}(\u2022)$ is some function. For example, for TIC ${\u0192}_{i}$ is the sum of all the raw peak areas in sample i, and thus ${X}_{i}^{\text{N}}$ is a vector where the original values have been scaled by this sum.

Table 1 summarizes ${\u0192}_{i}$ for the first class of normalizations. Several are variations on TIC, such as MS Total Useful Signal (MSTUS) which restricts to only those features that are common to all samples. Vector Normalization (VECT), takes TIC into two dimensions by measuring the Euclidean distance of the observed vector from the origin 0, and for this reason is sometimes referred to as “Euclidean Norm”. Both TIC and VECT are specific versions of the more general form $\sqrt[p]{\Sigma {x}_{ij}^{p}}$ . “Mean” is simply TIC adjusted for the number of features while “Median” used the median spectra from sample. Median Absolute Deviation (MAD) takes “Median” a step further by finding the absolute deviations from the median within a sample and using the median of these to normalize. Some methods normalize to a baseline or control spectrum. Such spectra can be determined a priori or chosen from the available samples, such as the sample with the median TIC. Linear Baseline scaling (LB) and Probabilistic Quotient Normalization (PQN) are examples of this. In LB, each sample is normalized so that the

Table 1. Class I Normalizers.

^{a,}^{b}Baseline/Control spectrum may be taken from a designated sample or calculated from available data, such as sample with median TIC.

TIC of the resulting normalized sample is equal to that of the “baseline”. LB assumes a constant linear relationship between the sample and the baseline. Non-linear extensions are available. Although the name includes “scaling”, the intent is consistent with normalization which seeks to adjust all spectrum of each sample to the same level in some sense and the computation is consistent with the Class I definition. PQN, which involves a four-step process, is the most computation intensive of Class I normalizers listed here. In the first step TIC normalization is performed. Second, a control spectrum is calculated ? this may be based upon a designated sample or the median spectra from all samples may be used. Third, for each feature the ratio, i.e. quotient, of the TIC normalized intensity of the sample and control spectrum is found. The final normalizer is then the median of all quotients. Most of the other class I normalizers are reasonably straightforward to calculate and are not time intensive from a computational standpoint. Hence, these are popular and common choices for normalizing.

2.1.2. Class II―MA Normalizers

The second class of normalizers involve MA plots, which are derived from Altman-Bland plots on the log scale [15] [16] . For any two samples j and ${j}^{\prime}$ the MA plot is a scatter plot where each metabolite, i, has coordinates $\left({\text{minus}}_{ij},av{g}_{ij}\right)$ given by

${\text{minus}}_{ij}={\mathrm{log}}_{2}\left({x}_{ij}\right)-{\mathrm{log}}_{2}\left({x}_{i{j}^{\prime}}\right)$

$av{g}_{ij}=\frac{{\mathrm{log}}_{2}\left({x}_{ij}\right)+{\mathrm{log}}_{2}\left({x}_{i{j}^{\prime}}\right)}{2}$ .

The “M” can be viewed as the log of the ratio, while “A” is the log of the product divided by 2. Orienting the two spectra in this way is intended to magnify trends, both linear and non-linear, related to the systematic variation, such as batch effects. Then an equation is fitted to this curve, so that one can remove the difference between the two samples due to the systematic variation. Under Cyclic LOWESS, a non-linear local regression curve (LOWESS) is fitted to the MA plot for a given pair of samples. The process is then repeated for all possible pairwise combinations of samples in the data set. Following a complete iteration over all samples, the cycle is repeated until some tolerance is achieved between the latest cycle and the preceding one.

Another variation on this is contrast normalization [17] . Under contrast normalization the complete set of all ion features for all samples $X={\left[{X}_{1}\cdots {X}_{n}\right]}^{\prime}$ is log transformed and then linearly transformed using a $k$ by $k$ orthonormal matrix $M$ to produce a new set of orthogonal vectors:

${X}^{O}=\mathrm{log}\left(X\right)\u2022M$ .

The first row of $M$ is the repetition of the constant $1/\sqrt{k}$ . The other rows of $M$ are not uniquely defined, except in the case of $k=2$ which gives

${M}_{2}=\frac{1}{\sqrt{2}}\left[\begin{array}{cc}1& 1\\ 1& -1\end{array}\right]$ .

For $k>2$ , $M$ is not unique, which requires some consideration for the next step in which ${X}_{1}^{0}$ , the first row of ${Y}^{O}$ , is used to predict the remaining rows ${X}_{i}^{0}$ for $i=2,\cdots ,n$ . Referring to these predictions as ${\widehat{X}}_{i}^{0}$ , using LOWESS regression with weighted least squares produces ${\widehat{X}}_{i}^{0}{}^{\prime}s$ that are invariance to the choice of $M$ . Estimation of ${\widehat{X}}_{i}^{0}{}^{\prime}s$ is iterated until some tolerance between the previous and newest estimate is achieved. The final normalized matrix is then given by

${X}^{\text{N}}={\left[{X}_{1}^{0}\text{}\left({X}_{2}^{0}-{\widehat{X}}_{2}^{0}\right)\cdots \left({X}_{m}^{0}-{\widehat{X}}_{m}^{0}\right)\right]}^{\prime}$ .

From this point the data set may be analyzed or mapped back to the original space via the reverse transformation

$\mathrm{exp}\left({X}^{\text{N}}\u2022M\right)$ .

The similarity to cyclic LOWESS may not be immediately obvious; however, notice that when $k=2$ the contrast matrix $M$ coupled with the log transformation is analogous to the orientation of the MA plot. Contrast normalization essentially generalizes the MA concept to higher dimensions.

2.1.3. Class III―Other

Normalizations that do not fit the criteria of Class I or Class II are classified here. One example of this is Quantile Normalization (Quant) [18] . This method rescales so that the distribution of intensities within each sample is the same across all samples. Let X_{i} be the ordered set of intensities for sample $i$ :

${X}_{i}=\left\{{x}_{i[1]}\cdots {x}_{i[m]}\right\}$ ,

and consider the vector of average ordered statistics across all X_{i}

$\overline{X}=\left\{{\overline{x}}_{[1]}\cdots {\overline{x}}_{[m]}\right\}=\left\{\frac{{\displaystyle {\sum}_{i=1}^{n}{x}_{i[1]}}}{n}\cdots \frac{{\displaystyle {\sum}_{i=1}^{n}{x}_{i[m]}}}{n}\right\}$ .

This essentially orders each row of the data set and then takes the average of each column. The normalized vector for a sample is then replaced with these values in the order corresponding to the ranks of un-normalized vector:

${X}_{i}^{\text{N}}=\left\{{\overline{x}}_{\left[rank\left({x}_{i1}\right)\right]}\cdots {\overline{x}}_{\left[rank\left({x}_{im}\right)\right]}\right\}$ .

An advantage to Quant is that it directly puts the intensities of each sample on the same scale, making sample to sample comparisons easier. One drawback is that features with missing values must be removed or imputed. Second, metabolites that are significantly more abundant may be normalized to a near static state. In fact, in the data set used in 2.4, oleic acid had the highest peak areas in every sample, so all of its values would all normalize to the same value. The same issue could apply to metabolites that are significantly lower in abundance than all other metabolites because metabolites near the limit of detection often drop out, i.e., no peak is detected.

2.2. Bridge Normalization (BRDG)

Mass spectrometry returns an ion count that is proportional to the true concentration but also dependent on the instrumentation. Rocke and Lorenzato [19] proposed a model for this ion count meant to account for the different performance behavior observed at low and high concentrations. Using this model, consider two separate instrument runs in which k technical replicates of the same sample are run in both batches. The ion intensity of the

${x}_{ijb}={\beta}_{ib}{y}_{i}{\text{e}}^{{\eta}_{ijb}}+{\epsilon}_{ijb}.$

The subscript of the sample concentration, ${y}_{i}$ , is dependent only on the biochemical since the k samples are technical replicates. ${\beta}_{ib}$ relates to the ionization effeciency of the instrument, and will vary by metabolite and batch. ${\eta}_{ib}\sim N\left(0,{\sigma}_{\eta ib}^{2}\right)$ and ${\epsilon}_{ib}\sim N\left(0,{\sigma}_{\epsilon ib}^{2}\right)$ are both normal, random errors with the former dominating at higher concentrations and latter dominating at lower concentrations. Note that the intercept term, which is related to the background level of the instrument, has been removed, as it is generally regarded as a nuisance parameter and is in fact ignored in single point calibration curves [20] . The expected value of any such replicate is then:

${\mu}_{ib}=E\left[{x}_{ijb}\right]={\beta}_{ib}{y}_{i}{\text{e}}^{{\sigma}_{\eta ib}^{2}/2}.$

Shuffling the order of these terms gives

${\mu}_{ib}=E\left[{x}_{ijb}\right]=\left({\beta}_{ib}{\mathrm{e}}^{{\sigma}_{\eta ib}^{2}/2}\right){y}_{i}.$

As both ${\beta}_{ib}$ and ${\mathrm{e}}^{{\sigma}_{\eta ib}^{2}/2}$ are fixed, but unknown, parameters depending only on the metabolite and batch, these terms may be combined into a single unknown variable. Letting ${\beta}_{ib}^{\ast}={\beta}_{ib}{\mathrm{e}}^{{\sigma}_{\eta ib}^{2}/2}$ it is easy to see that mean ion count for the batch is proportional to true concentration level:

${\mu}_{ib}={\beta}_{ib}^{*}{y}_{i}$

Hence, the mean ion count for the two batches is proportional:

$\frac{{\mu}_{i1}}{{\beta}_{i1}^{\ast}}=\frac{{\mu}_{i2}}{{\beta}_{i2}^{\ast}}$

By the law of large numbers, there exists a k such that average of the replicates within a batch

${\overline{x}}_{ib}={\displaystyle \sum _{j=1}^{k}\frac{{\beta}_{ib}{y}_{i}{\text{e}}^{{\eta}_{jib}}+{\epsilon}_{jib}}{k}}$

is reasonably close to ${\beta}_{ib}^{\ast}{y}_{i}$ . Scaling each batch against the mean of these replicates would thus eliminate the batch differences.

Data processing often includes QC samples as part of the metabolomic workflow in order to monitor instrument performance [21] [22] . These samples are aliquots of a pooled material and can be regarded as technical replicates and provide a convenient source for estimating the scaling factor. These samples will hence forth be referred to as bridge samples. To perform bridge sample normalization (BRDG), for each metabolite in a given instrument batch, divide its values by the median of the bridge samples for that batch. The median rather than the mean is recommended in order to mitigate the influence of outliers.

2.3. Median Run Normalization (MED)

An important part of the experimental protocol should be randomization of the samples across the instrument runs. Under such randomization, for a given metabolite, the expected value of the relative concentration is the same for each instrument run if there were no batch effects. Hence, randomly assigning the samples and dividing the values for each metabolite on each instrument run day by the observed median should put each batch on the same scale. This is similar to the bridge normalization only with the samples themselves serving as the bridging.

Theoretically, the sample mean is generally a more consistent estimator than the sample median, but in skewed distributions and low sample sizes the efficiency of the mean can be impaired. Due to the propensity for extreme outliers in metabolomic data, which could adversely affect the sample mean, the median is used instead. This normalization procedure will be referred to as “MED”, henceforth.

2.4. Human Plasma Data Set

The goal is to compare bridge set (BRDG) and median scaling of experimental samples (MED) to standard -omic normalizations that might be considered for a metabolomic data set. Total ion current (TIC), median absolute deviation (MAD), probabilistic quotient normalization (PQN) and cyclic LOWESS (CLOW) were chosen from the available options. This list includes a good mix of popular normalization methods and representatives of Class I and II normalizers.

Plasma samples were obtained from participants in the Insulin Resistance Atherosclerosis Family Study (IRASFS), which was sponsored by the National Heart, Lung and Blood Institute with the goal of examining the genetic epidemiology of insulin resistance and visceral adiposity [23] . The IRASFS consists of subjects from Hispanic and African American families. From this cohort, 1719 samples were sent to Metabolon for global LC-MS metabolomic profiling ? for further details on this platform see Long et al. [24] . One sample was blank and two other samples were duplicates, so these were removed. Thus, 1716 samples were used for analysis. Accommodating this many samples required between 13 and 15 instrument runs per arm of the platform. The resulting analysis measured 1274 metabolites (922 named, 352 unnamed).

Plasma samples from these participants were also run on a separate targeted assay of seven metabolites, which were shown to be markers for impaired glucose tolerance (IGT) [25] . The metabolites measured in this panel are 2-hydroxybutyrate, 3-hydroxybutrate, 4-methyl-2-oxopentanoate, linoleoyl-GPC, oleic acid, pantothenic acid and serine. Additionally, a targeted sterol panel was run [26] . This panel contained two metabolites that were also present on the untargeted platform: alpha-tocopherol and cholesterol. The bridge samples used were technical replicates of a pool of human plasma obtained from Bioreclamation.

Resulting normalized levels of these nine metabolites in the global panel are compared to the targeted results using Pearson’s correlation, r. All analysis was performed in R version 3.4.3 [27] . The following packages were used: limma package [28] and Data Normalization R-script by Hochrein et al. [29] .

3. Results and Discussion

For a preliminary analysis, a variance components analysis was performed for the bridge samples in order to assess how much of the variation can be attributed to the instrument batch. Those metabolites present in at least 80% of the bridge samples were used for this analysis (1049 metabolites). The variance components were fitted with JMP v13 [30] . From this analysis, one can obtain the percent of the variance that can be explained by the instrument batch. The median batch variance component is 85%, i.e., for a typical metabolite, the variance from the batch is 85% of the total variance. For the metabolites compared to the targeted assays, their batch variance components are shown in Table 2.

The correlations of the normalized data to the targeted assays are shown in Table 3. This is graphically displayed in Figure 1. In Figure 1, those metabolites above the line y = x have correlations better than no transformation, while those below have worse correlations. “NONE” refers to no normalization, i.e., the raw peak areas. Plots of the normalized values for 2-hydroxybutyrate are shown in Figure 2 as an example of the effects of the normalizations in comparison to the raw (unnormalized) values. From Table 3, one can see that even with no normalization, 7 of the 9 metabolites have correlations of at least 0.5, and some are

Table 2. Variance Components for BATCH.

Table 3. Correlation of normalized values to targeted assays.

Figure 1. Comparisons of Correlations to the Targeted Assays. Values above the line y = x are improved correlations, those below are worse. Each vertical section contains the correlations for the same metabolite.

Figure 2. Comparison of normalized values for 2-hydroxybutryate. Colors represent individual batches. The values on the y-axis are the normalized values (except “Raw” which are the unnormalized values).

even greater than 0.9. From Table 3 and Figure 1, one can see that in general, the normalization methods that rely on metabolite-specific adjustments (BRDG, MED) significantly outperform the methods that make adjustments across each sample (TIC, MAD, PQN, CLOW). In fact, for many cases the sample-based normalizations performed worse than performing no normalization. For BRDG and MED, many of the resulting correlations are over 0.9. In general, BRDG and MED have similar performance, except for alpha-tocopherol. For alpha-tocopherol, there were two batches that had a significant drop in peak areas for the bridge samples, but not for the experimental samples. Without these two batches, the correlation is 0.79. The bridge samples were obtained from a different source than the experimental samples. In general, it is preferable to run bridge samples from the same source as the experimental samples, as the quantitation of metabolites can be affected by other substances within the sample.

4. Conclusion

When performing normalization to metabolomics data, it is important that the method appropriately corrects for the systematic variation but preserves the biological variation. Various methods were assessed by comparing their values to targeted data, where the actual concentrations of certain metabolites in the samples are known. Many common normalization techniques that make corrections across each sample, such as TIC normalization, often performed worse than performing no normalization at all. The two methods that relied on metabolite-specific correlations (BRDG, MED) performed much better than the sample-based normalizations, and many of the resulting correlations were over 0.9. Correcting by the median batch value from the experimental samples (MED) can work well in a variety of applications. However, if one wants to run a very small set and merge into previous data sets or compare the values in two different data sets, it is probably better to normalize by bridge samples (BRDG). The main drawback of BRDG is that metabolites that are not present in the bridge samples cannot be normalized. Additionally, if the bridge samples are obtained from a different source, there may be some metabolites that have different batch effects in the bridge samples than in the experimental samples. To avoid this issue, having the bridge samples as similar as possible to the experimental samples is recommended.

Acknowledgements

The IRASFS was supported by the National Institutes of Health (HL060944, HL061019, HL060919, and DK085175).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.