mzDB: A File Format Using Multiple Indexing Strategies for the Efficient Analysis of Large LC-MS/MS and SWATH-MS Data Sets *

Artigo Acesso aberto Revisado por pares

mzDB: A File Format Using Multiple Indexing Strategies for the Efficient Analysis of Large LC-MS/MS and SWATH-MS Data Sets *

2014; Elsevier BV; Volume: 14; Issue: 3 Linguagem: Inglês

10.1074/mcp.o114.039115

ISSN

1535-9484

Autores

David Bouyssié, M. Dubois, Sara Nasso, Anne Gonzalez de Peredo, Odile Burlet‐Schiltz, Ruedi Aebersold, Bernard Monsarrat,

Tópico(s)

Metabolomics and Mass Spectrometry Studies

Resumo

The analysis and management of MS data, especially those generated by data independent MS acquisition, exemplified by SWATH-MS, pose significant challenges for proteomics bioinformatics. The large size and vast amount of information inherent to these data sets need to be properly structured to enable an efficient and straightforward extraction of the signals used to identify specific target peptides. Standard XML based formats are not well suited to large MS data files, for example, those generated by SWATH-MS, and compromise high-throughput data processing and storing.We developed mzDB, an efficient file format for large MS data sets. It relies on the SQLite software library and consists of a standardized and portable server-less single-file database. An optimized 3D indexing approach is adopted, where the LC-MS coordinates (retention time and m/z), along with the precursor m/z for SWATH-MS data, are used to query the database for data extraction.In comparison with XML formats, mzDB saves ∼25% of storage space and improves access times by a factor of twofold up to even 2000-fold, depending on the particular data access. Similarly, mzDB shows also slightly to significantly lower access times in comparison with other formats like mz5. Both C++ and Java implementations, converting raw or XML formats to mzDB and providing access methods, will be released under permissive license. mzDB can be easily accessed by the SQLite C library and its drivers for all major languages, and browsed with existing dedicated GUIs. The mzDB described here can boost existing mass spectrometry data analysis pipelines, offering unprecedented performance in terms of efficiency, portability, compactness, and flexibility. The analysis and management of MS data, especially those generated by data independent MS acquisition, exemplified by SWATH-MS, pose significant challenges for proteomics bioinformatics. The large size and vast amount of information inherent to these data sets need to be properly structured to enable an efficient and straightforward extraction of the signals used to identify specific target peptides. Standard XML based formats are not well suited to large MS data files, for example, those generated by SWATH-MS, and compromise high-throughput data processing and storing. We developed mzDB, an efficient file format for large MS data sets. It relies on the SQLite software library and consists of a standardized and portable server-less single-file database. An optimized 3D indexing approach is adopted, where the LC-MS coordinates (retention time and m/z), along with the precursor m/z for SWATH-MS data, are used to query the database for data extraction. In comparison with XML formats, mzDB saves ∼25% of storage space and improves access times by a factor of twofold up to even 2000-fold, depending on the particular data access. Similarly, mzDB shows also slightly to significantly lower access times in comparison with other formats like mz5. Both C++ and Java implementations, converting raw or XML formats to mzDB and providing access methods, will be released under permissive license. mzDB can be easily accessed by the SQLite C library and its drivers for all major languages, and browsed with existing dedicated GUIs. The mzDB described here can boost existing mass spectrometry data analysis pipelines, offering unprecedented performance in terms of efficiency, portability, compactness, and flexibility. The continuous improvement of mass spectrometers (1Köcher T. Swart R. Mechtler K. Ultra-high-pressure RPLC hyphenated to an LTQ-Orbitrap Velos reveals a linear relation between peak capacity and number of identified peptides.Anal. Chem. 2011; 83: 2699-2704Crossref PubMed Scopus (116) Google Scholar, 2Thakur S.S. Geiger T. Chatterjee B. Bandilla P. Fröhlich F. Cox J. Mann M. Deep and highly sensitive proteome coverage by LC-MS/MS without prefractionation.Mol. Cell. Proteomics. 2011; 10M110.003699 Abstract Full Text Full Text PDF PubMed Scopus (270) Google Scholar, 3Nagaraj N. Alexander Kulak N. Cox J. Neuhauser N. Mayr K. Hoerning O. Vorm O. Mann M. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap.Mol. Cell. Proteomics. 2012; 11M111.013722M111.013722 Abstract Full Text Full Text PDF Scopus (304) Google Scholar, 4Webb K.J. Xu T. Park S.K. Yates J.R. Modified MuDPIT separation identified 4488 proteins in a system-wide analysis of quiescence in yeast.J. Proteome Res. 2013; 12: 2177-2184Crossref PubMed Scopus (48) Google Scholar) and HPLC systems (5Bantscheff M. Schirle M. Sweetman G. Rick J. Kuster B. Quantitative mass spectrometry in proteomics: a critical review.Anal. Bioanal. Chem. 2007; 389: 1017-1031Crossref PubMed Scopus (1256) Google Scholar, 6Bantscheff M. Lemeer S. Savitski M.M. Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present.Anal. Bioanal. Chem. 2012; 404: 939-965Crossref PubMed Scopus (581) Google Scholar, 7Michalski A. Damoc E. Hauschild J.-P. Lange O. Wieghaus A. Makarov A. Nagaraj N. Cox J. Mann M. Horning S. Mass spectrometry-based proteomics using Q Exactive, a high-performance benchtop quadrupole Orbitrap mass spectrometer.Mol. Cell. Proteomics. 2011; 10Abstract Full Text Full Text PDF PubMed Scopus (626) Google Scholar, 8Andrews G.L. Simons B.L. Young J.B. Hawkridge A.M. Muddiman D.C. Performance characteristics of a new hybrid quadrupole time-of-flight tandem mass spectrometer (TripleTOF 5600).Anal. Chem. 2011; 83: 5442-5446Crossref PubMed Scopus (223) Google Scholar, 9Senko M.W. Remes P.M. Canterbury J.D. Mathur R. Song Q. Eliuk S.M. Mullen C. Earley L. Hardman M. Blethrow J.D. Bui H. Specht A. Lange O. Denisov E. Makarov A. Horning S. Zabrouskov V. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates.Anal. Chem. 2013; 85: 11710-11714Crossref PubMed Scopus (176) Google Scholar, 10Hebert A.S. Richards A.L. Bailey D.J. Ulbrich A. Coughlin E.E. Westphall M.S. Coon J.J. The one hour yeast proteome.Mol. Cell. Proteomics. 2014; 13: 339-347Abstract Full Text Full Text PDF PubMed Scopus (411) Google Scholar) and the rapidly increasing volumes of data they produce pose a real challenge to software developers who constantly have to adapt their tools to deal with different types and increasing sizes of raw files. Indeed, the file size of a single MS analysis evolved from a few MB to several GB in less than 10 years. The introduction of high throughput, high mass accuracy MS analyses in data dependent acquisitions (DDA) and the adoption of Data Independent Acquisition (DIA) approaches, for example, SWATH-MS (11Gillet L.C. Navarro P. Tate S. Röst H. Selevsek N. Reiter L. Bonner R. Aebersold R. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (1779) Google Scholar), were significant factors in this development. The management of these huge data files is a major issue for laboratories and raw file public repositories, which need to regularly upgrade their storage solutions and capacity. The availability of XML (eXtensible Markup Language) standard formats (12Pedrioli P.G. a Eng J.K. Hubley R. Vogelzang M. Deutsch E.W. Raught B. Pratt B. Nilsson E. Angeletti R.H. Apweiler R. Cheung K. Costello C.E. Hermjakob H. Huang S. Julian R.K. Kapp E. McComb M.E. Oliver S.G. Omenn G. Paton N.W. Simpson R. Smith R. Taylor C.F. Zhu W. Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research.Nat. Biotechnol. 2004; 22: 1459-1466Crossref PubMed Scopus (652) Google Scholar, 13Martens L. Chambers M. Sturm M. Kessner D. Levander F. Shofstahl J. Tang W.H. Römpp A. Neumann S. Pizarro A.D. Montecchi-Palazzi L. Tasman N. Coleman M. Reisinger F. Souda P. Hermjakob H. Binz P.-A. Deutsch E.W. mzML–a community standard for mass spectrometry data.Mol. Cell. Proteomics. 2011; 10Abstract Full Text Full Text PDF PubMed Scopus (452) Google Scholar) enhanced data exchange among laboratories. However, XMLs causes the inflation of raw file size by a factor of two to three times compared with their original size. Vendor files, although lighter, are proprietary formats, often not compatible with operating systems other than Microsoft Windows. They do not generally interface with many open source software tools, and do not offer a viable solution for data exchange. In addition to size inflation, other disadvantages associated with the use of XML for the representation of raw data have been previously described in the literature (14Shah A.R. Davidson J. Monroe M.E. Mayampurath A.M. Danielson W.F. Shi Y. Robinson A.C. Clowers B.H. Belov M.E. Anderson G.A. Smith R.D. An efficient data format for mass spectrometry-based proteomics.J. Am. Soc. Mass Spectrom. 2010; 21: 1784-1788Crossref PubMed Scopus (15) Google Scholar, 15Lin S.M. Zhu L. Winter A.Q. Sasinowski M. Kibbe W.A. What is mzXML good for?.Expert Rev. Proteomics. 2005; 2: 839-845Crossref PubMed Scopus (50) Google Scholar, 16Askenazi M. Parikh J.R. Marto J.A. mzAPI: a new strategy for efficiently sharing mass spectrometry data.Nat. Methods. 2009; 6: 240-241Crossref PubMed Scopus (53) Google Scholar, 17Wilhelm M. Kirchner M. Steen J.A.J. Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar). These include the verbosity of language syntax, the lack of support for multidimensional chromatographic analyses, and the low performance showed during data processing. Although XML standards were originally conceived as a format for enabling data sharing in the community, they are commonly used as the input for MS data analysis. Latest software tools (18Kohlbacher O. Reinert K. Gröpl C. Lange E. Pfeifer N. Schulz-Trieglaff O. Sturm M. TOPP–the OpenMS proteomics pipeline.Bioinformatics. 2007; 23: e191-e197Crossref PubMed Scopus (214) Google Scholar, 19Barsnes H. Vaudel M. Colaert N. Helsens K. Sickmann A. Berven F.S. Martens L. Compomics-utilities: an open-source Java library for computational proteomics.BMC Bioinformatics. 2011; 12: 70Crossref PubMed Scopus (76) Google Scholar) are usually only compatible with mzML files, limiting de facto the throughput of proteomic analyses. To tackle these issues, some independent laboratories developed open formats relying on binary specifications (14Shah A.R. Davidson J. Monroe M.E. Mayampurath A.M. Danielson W.F. Shi Y. Robinson A.C. Clowers B.H. Belov M.E. Anderson G.A. Smith R.D. An efficient data format for mass spectrometry-based proteomics.J. Am. Soc. Mass Spectrom. 2010; 21: 1784-1788Crossref PubMed Scopus (15) Google Scholar, 17Wilhelm M. Kirchner M. Steen J.A.J. Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar, 20Jaitly N. Mayampurath A. Littlefield K. Adkins J.N. Anderson G.A. Smith R.D. Decon2LS: an open-source software package for automated processing and visualization of high resolution mass spectrometry data.BMC Bioinformatics. 2009; 10: 87Crossref PubMed Scopus (177) Google Scholar, 21Smith C.A. Want E.J. O'Maille G. Abagyan R. Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification.Anal. Chem. 2006; 78: 779-787Crossref PubMed Scopus (3246) Google Scholar), to optimize both file size and data processing performance. Similar efforts started already more than ten years ago, and, among the others, the NetCDF version 4, first described in 2004, added the support for a new data model called HDF5. Because it is particularly well suited to the representation of complex data, HDF5 was used in several scientific projects to store and efficiently access large volumes of bytes, as for the mz5 format (17Wilhelm M. Kirchner M. Steen J.A.J. Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar). Compared with XML based formats, mz5 is much more efficient in terms of file size, memory footprint, and access time. Thus, after replacing the JCAMP text format more than 10 years ago, netCDF is nowadays a suitable alternative to XML based formats. Nonetheless, solutions for storing and indexing large amounts of data in a binary file are not limited to netCDF. For instance, it has been demonstrated that a relational model can represent raw data, as in YAFMS format (14Shah A.R. Davidson J. Monroe M.E. Mayampurath A.M. Danielson W.F. Shi Y. Robinson A.C. Clowers B.H. Belov M.E. Anderson G.A. Smith R.D. An efficient data format for mass spectrometry-based proteomics.J. Am. Soc. Mass Spectrom. 2010; 21: 1784-1788Crossref PubMed Scopus (15) Google Scholar), which is based on SQLite, a technology that allows implementing a portable, self-contained, single file database. Similarly to mz5, YAFMS is definitely more efficient in terms of file size and access times than XML. Despite their improvements, a limitation of these new binary formats relies on the lack of a multi-indexing model to represent the bi-dimensional structure of LC-MS data. The inherently 2D indexing of LC-MS data can indeed be very useful when working with LC-MS/MS acquisition files. At the state-of-the-art, three main raw data access strategies can be identified across DDA and DIA approaches: (1) Sequential reading of whole m/z spectra, for a systematic processing of the entire raw file. Use cases: file format conversion, peak picking, analysis of MS/MS spectra, and MS/MS peak list generation.(2) Systematic processing of the data contained in specific m/z windows, across the entire chromatographic gradient. Use cases: extraction of XICs on the whole chromatographic gradient and MS features detection.(3) Random access to a small region of the LC-MS map (a few spectra or an m/z window of consecutive spectra). Use cases: data visualization, targeted extraction of XICs on a small time range, and targeted extraction of a subset of spectra. The adoption of a certain data access strategy depends upon the particular data analysis algorithms, which can perform signal extraction mainly by unsupervised or supervised approaches. Unsupervised approaches (18Kohlbacher O. Reinert K. Gröpl C. Lange E. Pfeifer N. Schulz-Trieglaff O. Sturm M. TOPP–the OpenMS proteomics pipeline.Bioinformatics. 2007; 23: e191-e197Crossref PubMed Scopus (214) Google Scholar, 22Bellew M. Coram M. Fitzgibbon M. Igra M. Randolph T. Wang P. May D. Eng J. Fang R. Lin C. Chen J. Goodlett D. Whiteaker J. Paulovich A. McIntosh M. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS.Bioinformatics. 2006; 22: 1902-1909Crossref PubMed Scopus (225) Google Scholar, 23Katajamaa M. Oresic M. Processing methods for differential analysis of LC/MS profile data.BMC Bioinformatics. 2005; 6: 179Crossref PubMed Scopus (325) Google Scholar, 24Cox J. Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat. Biotechnol. 2008; 26: 1367-1372Crossref PubMed Scopus (9154) Google Scholar, 25Jaffe J.D. Mani D.R. Leptos K.C. Church G.M. Gillette M.a Carr S.a PEPPeR, a platform for experimental proteomic pattern recognition.Mol. Cell. Proteomics. 2006; 5: 1927-1941Abstract Full Text Full Text PDF PubMed Scopus (126) Google Scholar) recognize LC-MS features on the basis of patterns like the theoretical isotope distribution, the shape of the elution peaks, etc. Conversely, supervised approaches (29Li X.-J. Zhang H. Ranish J.a Aebersold R. Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry.Anal. Chem. 2003; 75: 6648-6657Crossref PubMed Scopus (317) Google Scholar, 30Reiter L. Rinner O. Picotti P. HÜttenhain R. Beck M. Brusniak M.-Y. Hengartner M.O. Aebersold R. mProphet: automated data processing and statistical validation for large-scale SRM experiments.Nat. Methods. 2011; Crossref PubMed Scopus (365) Google Scholar, 31Method of the Year 2012.Nat. Methods. 2012; 10: 1Google Scholar, 32Michalski A. Cox J. Mann M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS.J. Proteome Res. 2011; 10: 1785-1793Crossref PubMed Scopus (476) Google Scholar, 33Mann M. Kulak N.A. Nagaraj N. Cox J. The coming age of complete, accurate, and ubiquitous proteomes.Mol. Cell. 2013; 49: 583-590Abstract Full Text Full Text PDF PubMed Scopus (285) Google Scholar) implement the peak picking as driven data access, using the a priori knowledge on peptide coordinates (m/z, retention time, and m/z precursor for DIA), which are provided by appropriate extraction lists given by the identification search engine or the transition lists in targeted proteomics (34Roest, H. L., Rosenberger, G., Navarro, P., Schubert, O. T., Wolski, W., Collins, B. C., Malmstroem, J., Malmstroem, L., Aebersold, R., A tool for the automated, targeted analysis of data-independent acquisition (DIA) MS-data : OpenSWATH. Nat. Biotechnol., accepted.Google Scholar). Data access overhead can vary significantly, according to the specific algorithm, data size, and length of the extraction list. In the unsupervised approach, feature detection is based first on the analysis of the full set of MS spectra and then on the grouping of the peaks detected in adjacent MS scans; thus, optimized sequential spectra access is required. In the supervised approach, peptide XICs are extracted using their a priori coordinates and therefore sequential spectra access is not a suitable solution; for instance, MS spectra shared by different peptides would be loaded multiple times leading to highly redundant data reloading. Even though sophisticated caching mechanisms can reduce the impact of this issue, they would increase memory consumption. It is thus preferable to perform a targeted access to specific MS spectra by leveraging an index in the time dimension. However, it would still be a sub-optimal solution because of redundant loads of full MS spectra, whereas only a small spectral window centered on the peptide m/z is of interest. Thus the quantification of dozens of thousands of peptides (32Michalski A. Cox J. Mann M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS.J. Proteome Res. 2011; 10: 1785-1793Crossref PubMed Scopus (476) Google Scholar, 33Mann M. Kulak N.A. Nagaraj N. Cox J. The coming age of complete, accurate, and ubiquitous proteomes.Mol. Cell. 2013; 49: 583-590Abstract Full Text Full Text PDF PubMed Scopus (285) Google Scholar) requires appropriate data access methods to cope with the repetitive and high load of MS data. We therefore deem that an ideal file format should show comparable efficiency regardless of the particular use case. In order to achieve this important flexibility and efficiency on any data access, we developed a new solution featuring multiple indexing strategies: the mzDB format (i.e. m/z database). As the YAFMS format, mzDB is implemented using SQLite, which is commonly adopted in several computational projects and is compatible with most programming languages. In contrast to mz5 and YAFMS formats, where each spectrum is referred by a single index entry, mzDB has an internal data structure allowing a multidimensional data indexing, and thus results in efficient queries along both time and m/z dimensions. This makes mzDB specifically suited to the processing of large-scale LC-MS/MS data. In particular, the multidimensional data-indexing model was extended for SWATH-MS data, where a third index is given by the m/z of the precursor ion, in addition to the RT and m/z of the fragment ions. In order to show its efficiency for all described data access strategies, mzDB was compared with the mzML format, which is the official XML standard, and the latest mz5 binary format, which has already been compared with many existing file formats (17Wilhelm M. Kirchner M. Steen J.A.J. Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar). Results show that mzDB outperforms other formats on most comparisons, except in sequential reading benchmarks where mz5 and mzDB are comparable. mzDB access performance, portability, and compactness, as well as its compliance to the PSI controlled vocabulary make it complementary to existing solutions for both the storage and exchange of mass spectrometry data and will eventually address the issues related to data access overhead during their processing. mzDB can therefore enhance existing mass spectrometry data analysis pipelines, offering unprecedented performance and therefore possibilities. To perform the evaluation of the different file formats on DDA data, a total lysate of cultured primary human vascular ECs was used. It was submitted to 1D-SDS-PAGE and fractionated into 12 gel bands, processed as described before (45). Peptides were eluted during an 80 min gradient by nanoLC-MS/MS using an Ultimate 3000 system (Thermo Scientific Dionex, Sunnyvale, CA) coupled to an LTQ-Orbitrap Velos mass spectrometer (Thermo Fisher Scientific Inc., Waltham, MA). The LTQ-Orbitrap Velos was operated in data-dependent acquisition mode with the XCalibur software. Survey scan MS were acquired in the Orbitrap on the 300–2000 m/z range with the resolution set to a value of 60,000. The 10 most intense ions per survey scan were selected for CID fragmentation and the resulting fragments were analyzed in the linear trap (LTQ). Dynamic exclusion was employed within 60 s to prevent repetitive selection of the same peptide. The SWATH-MS data used in this study was part of a recently published data set (34Roest, H. L., Rosenberger, G., Navarro, P., Schubert, O. T., Wolski, W., Collins, B. C., Malmstroem, J., Malmstroem, L., Aebersold, R., A tool for the automated, targeted analysis of data-independent acquisition (DIA) MS-data : OpenSWATH. Nat. Biotechnol., accepted.Google Scholar), corresponding to samples in which 422 synthetic peptides were spiked into three different proteomic backgrounds (water, yeast cell lysate, or Hela cell lysate) in a ten-step dilution series to produce a "gold standard" data set. These samples were submitted to SWATH-MS analysis on a TripleTOF 5600 System (AB SCIEX, Framingham, MA), essentially as described in (34Roest, H. L., Rosenberger, G., Navarro, P., Schubert, O. T., Wolski, W., Collins, B. C., Malmstroem, J., Malmstroem, L., Aebersold, R., A tool for the automated, targeted analysis of data-independent acquisition (DIA) MS-data : OpenSWATH. Nat. Biotechnol., accepted.Google Scholar). From this data set obtained from samples of different complexity, we selected four files of increasing size, ∼2, 5, 10, and 25 GB (final size after mzXML conversion). For DDA data, the raw data files were converted into mz5 and mzML using the ProteoWizard (35Kessner D. Chambers M. Burke R. Agus D. Mallick P. ProteoWizard: open source software for rapid proteomics tools development.Bioinformatics. 2008; 24: 2534-2536Crossref PubMed Scopus (1218) Google Scholar) Msconvert tool with the following settings: default binary encoding (64 bits for m/z and 32 bits for intensities), no data filtering (i.e. profile mode encoding), indexing enabled, and zlib compression disabled. The raw files were converted into mzDB using the in-house software tool "raw2mzDB.exe" (see "Implementations" in the results section) with the default bounding boxes dimensions: time width of 15 s and m/z width of 5 Da for MS bounding boxes, one bounding box per MS/MS spectrum (time width of 0 s and m/z width of 10,000 Da). The integrity of mzDB data was checked by comparing the MD5 signature of spectra values between the mzDB and the mz5 file formats (data not shown). To evaluate the sequential reading time, the twelve acquired DDA files were used, and from this small MS data set, we created a large and heterogeneous panel of data files using a procedure similar to the one used for mz5 benchmarking (17Wilhelm M. Kirchner M. Steen J.A.J. Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets.Mol. Cell. Proteomics. 2012; 11Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar). Each file was repeatedly truncated with an increasing limit on the number of spectra (step size was set to 800 spectra), until the total size of the original file was reached. This led to the generation of 636 sub-files encompassing a wide range of sizes, and the sequential reading times was measured for each of them. To assess more specifically the reading time along the m/z dimension (run slices) and the performance of random access (range queries), the largest raw file from the twelve fractions was used (file size 1.6GB). The benchmarks were performed using different tools. In the case of mz5 files, raw files, and mzXML files, sequential reading time was evaluated using an iterative reading of MS spectra and was computed using the "msBenchmark" ProteoWizard tool by specifying the "-binary" command parameter, which is required for enabling the loading of all m/z-intensity pairs contained in the data file. Benchmarks involving the loading of LC-MS regions were assessed using the "msaccess" ProteoWizard tool by providing the appropriate options, specific to the performed reading operation: run slices iterations and whole LC gradient random extractions were executed with the "SIC" option enabled, whereas extraction of small specific regions was performed with the "slice" option. In the case of the mzDB files, all kinds of data access and tests were performed using the "pwiz-mzDB" library that is built in C++ on the same model as "msaccess," to ensure homogeneous reading methods for all file formats. The benchmarks based on the SWATH-MS data consisted of targeted data extraction of XICs of different sizes (50 ppm × 60 s and 50 ppm × 200 s) on the four files of increasing size (2, 5, 10, and 25 GB). In addition, the time necessary to establish a connection with the files was also evaluated, as was file size shrinkage. The comparison was run against the open mzXML file format, the standard currently adopted in the ETH lab, by means of in-house developed Java software. In particular, the access to mzXML files was implemented using the Java Proteomic Library (an enhanced version of the Java Random Access Library (JRAP) library from the Seattle Proteome Center) to retrieve the spectra (i.e. peak lists) of interest, and the Java platform Collections Framework's binary search to get the (m/z, intensity) points of interest from each spectrum. Data access to the mzDB files was performed using the "mzDB-swath" library developed in Java. DDA hardware configuration: Windows 8, 64bits workstation, Intel Core™ i7 2.93 Ghz, 8 GB of RAM, and SATA HDD of 4 TB. DIA hardware configuration: Mac OS X 10.8.3, Intel Core™ i7 3.4 Ghz, 32 GB of RAM and SATA HDD of 1 TB. The indexing strategy used in mzDB was designed to efficiently tackle the different access cases for LC-MS data. The first access case (sequential reading of spectra) is covered intrinsically by SQL spectrum indexes, which are natively provided by SQLite. Regarding the second access case (systematic loading of m/z windows), the mzDB relational schema (Fig. 1) was designed to have an additional index in the m/z dimension, introducing the "run slice" concept (Fig. 2), that is, a subset of the LC-MS map covering the whole chromatographic gradient but limited to a given m/z scan window. Basically, as shown in Fig. 2, LC-MS data are divided in grid cells of custom m/z and time widths, namely bounding boxes (BBs). Each spectrum is first split into several spectrum slices of a given m/z window. Spectrum slices belonging to the same m/z window and eluting in a given time window are grouped into a BB. A run slice is composed by all BBs having the same m/z window. In the context of quantitative analysis of LC-MS/MS runs, it therefore becomes possible to efficiently extract the signals of all peptides with a m/z falling within a given "run slice" m/z range. Finally, the third access case (random access to a small spectral region) was greatly optimized through the implementation of a multidimensional data indexing model that allows for efficient queries along both time and m/z dimensions. The general performance gain obtained with the multidimensional indexing model was described in previous studies (36Guttman A. in Proceedings of the 1984 ACM SIGMOD international conference on Management of data.ACM. 1984; : 47-57Google Scholar, 37Vitter J.S. External memory algorithms and data structures: dealing with massive data.ACM Comput. Surv. 2001; 33: 209-271Crossref Scopus (416) Google Scholar). Its application to LC-MS acquisitions was first tested on centroid data (38Khan Z. Bloom J.S. Garcia B.a Singh M. Kruglyak L. Protein quantification across hundreds of experimental conditions.Proc. Natl. Acad. Sci. U.S.A. 2009; 106: 15544-15548Crossref PubMed Scopus (35) Google Scholar) and then on profile data (39Nasso S. Silvestri F. Tisiot F. Di Camillo B. Pietracaprina A. Toffolo G.M. An optimized data structure for high-throughput 3D proteomics data: mzRTree.J. Proteomics. 2010; 73: 1176-1182Crossref PubMed Scopus (4) Google Scholar) by mzRTree, an efficiency oriented data format. Here, the mzRTree structu

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

mzDB: A File Format Using Multiple Indexing Strategies for the Efficient Analysis of Large LC-MS/MS and SWATH-MS Data Sets *