This vignette illustrates how to read and input your own data to theSIAMCAT
package. We will cover reading in text files from the disk,formatting them and using them to create an object of siamcat-class
.
The siamcat-class
is the centerpiece of the package. All of the input dataand result are stored inside of it. The structure of the object is describedbelow in the siamcat-class object section.
2.1 SIAMCAT input
Generally, there are three types of input for SIAMCAT
:
2.1.1 Features
The features should be a matrix
, a data.frame
, or an otu_table
,organized as follows:
features (in rows) x samples (in columns).
Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | |
---|---|---|---|---|---|
Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.20 |
Feature_4 | 0.34 | 0.00 | 0.13 | 0.07 | 0.00 |
Feature_5 | 0.06 | 0.16 | 0.00 | 0.00 | 0.00 |
Please note that
SIAMCAT
is supposed to work with relative abundances.Other types of data (e.g.counts) will also work, but not all functions of thepackage will result in meaningful outputs.
An example of a typical feature file is attached to the SIAMCAT
package,containing data from a publication investigating the microbiome in colorectalcancer (CRC) patients and controls (the study can be found here:Zeller et al). The metagenomicsdata were processed with the MOCAT pipeline, returningtaxonomic profiles on the species levels (specI
):
library(SIAMCAT)fn.in.feat <- system.file( "extdata", "feat_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT")
One way to load such data into R
could be the use of read.table
(Beware of the defaults in R! They are not always useful…)
feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE)# look at some featuresfeat[110:114, 1:2]
## CCIS27304052ST-3-0 CCIS15794887ST-4-0## Bacteroides caccae [h:1096] 1.557937e-03 1.761949e-03## Bacteroides eggerthii [h:1097] 2.734527e-05 4.146882e-05## Bacteroides stercoris [h:1098] 1.173786e-03 2.475838e-03## Bacteroides clarus [h:1099] 4.830533e-04 4.589747e-06## Methanohalophilus mahii [h:11] 0.000000e+00 0.000000e+00
2.1.2 Metadata
The metadata should be either a matrix or a data.frame
.
samples (in rows) x metadata (in columns)
:
Age | Gender | BMI | |
---|---|---|---|
Sample_1 | 52 | 1 | 20 |
Sample_2 | 37 | 1 | 18 |
Sample_3 | 66 | 2 | 24 |
Sample_4 | 54 | 2 | 26 |
Sample_5 | 65 | 2 | 30 |
The rownames
of the metadata should match the colnames
of the featurematrix.
Again, an example of such a file is attached to the SIAMCAT
package, takenfrom the same study:
fn.in.meta <- system.file( "extdata", "num_metadata_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT")
Also here, the read.table
can be used to load the data into R
.
meta <- read.table(fn.in.meta, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE)head(meta)
## age gender bmi diagnosis localization crc_stage fobt## CCIS27304052ST-3-0 52 1 20 0 NA 0 0## CCIS15794887ST-4-0 37 1 18 0 NA 0 0## CCIS74726977ST-3-0 66 2 24 1 NA 0 0## CCIS16561622ST-4-0 54 2 26 0 NA 0 0## CCIS79210440ST-3-0 65 2 30 0 NA 0 1## CCIS82507866ST-3-0 57 2 24 0 NA 0 0## wif_test## CCIS27304052ST-3-0 0## CCIS15794887ST-4-0 0## CCIS74726977ST-3-0 NA## CCIS16561622ST-4-0 0## CCIS79210440ST-3-0 0## CCIS82507866ST-3-0 0
2.1.3 Label
Finally, the label can come in different three different flavours:
Named vector: A named vector containing information about cases andcontrols. The names of the vector should match the
rownames
of the metadataand thecolnames
of the feature data.The label can contain either the information about cases and controls either- as integers (e.g.
0
and1
), - as characters (e.g.
CTR
andIBD
), or - as factors.
- as integers (e.g.
Metadata column: You can provide the name of a column in the metadata forthe creation of the label. See below for an example.
Label file:
SIAMCAT
has a function calledread.label
, which willcreate a label object from a label file. The file should be organized asfollows:- The first line is supposed to read:
#BINARY:1=[label for cases];-1=[label for controls]
- The second row should contain the sample identifiers as tab-separatedlist (consistent with feature and metadata).
- The third row is then supposed to contain the actual class labels(tab-separated):
1
for each case and-1
for each control.
An example file is attached to the package again, if you want to have alook at it.
- The first line is supposed to read:
For our example dataset, we can create the label object out of the metadatacolumn called diagnosis
:
label <- create.label(meta=meta, label="diagnosis", case = 1, control=0)
When we later plot the results, it might be nicer to have names for thedifferent groups stored in the label object (instead of 1
and 0
). We canalso supply them to the create.label
function:
label <- create.label(meta=meta, label="diagnosis", case = 1, control=0, p.lab = 'cancer', n.lab = 'healthy')
## Label used as case:## 1## Label used as control:## 0
## + finished create.label.from.metadata in 0.001 s
label$info
## healthy cancer ## -1 1
Note:
If you have no label information for your dataset, you can still create aSIAMCAT
object from your features alone. TheSIAMCAT
object without labelinformation will contain aTEST
label that can be used for making holdoutpredictions. Other functions, e.g.model training, will not work on such anobject.
2.2 LEfSe format files
LEfSe is a tool foridentification of associations between micriobial features and up to twometadata. LEfSe uses LDA (linear discriminant analysis).
LEfSe input file is a .tsv
file. The first few rows contain the metadata. Thefollowing row contains sample names and the rest of the rows are occupied byfeatures. The first column holds the row names:
label | healthy | healthy | healthy | cancer | cancer |
---|---|---|---|---|---|
age | 52 | 37 | 66 | 54 | 65 |
gender | 1 | 1 | 2 | 2 | 2 |
Sample_info | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 |
Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
Feature_4 | 0.34 | 0.00 | 0.43 | 0.00 | 0.00 |
Feature_5 | 0.56 | 0.56 | 0.00 | 0.00 | 0.00 |
An example of such a file is attached to the SIAMCAT
package:
fn.in.lefse<- system.file( "extdata", "LEfSe_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT")
SIAMCAT
has a dedicated function to read LEfSe format files. The read.lefse
function will read in the input file and extract metadata and features:
meta.and.features <- read.lefse(fn.in.lefse, rows.meta = 1:6, row.samples = 7)meta <- meta.and.features$metafeat <- meta.and.features$feat
We can then create a label object from one of the columns of the meta object andcreate a siamcat
object:
label <- create.label(meta=meta, label="label", case = "cancer")
## Label used as case:## cancer## Label used as control:## healthy
## + finished create.label.from.metadata in 0.002 s
2.3 metagenomeSeq format files
metagenomeSeq is an Rpackage to determine differentially abundant features between multiple samples.
There are two ways to input data into metagenomeSeq:
- two files, one for metadata and one for features - those can be usedin
SIAMCAT
just like described in SIAMCAT input withread.table
:
fn.in.feat <- system.file( "extdata", "CHK_NAME.otus.count.csv", package = "metagenomeSeq")feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', row.names = 1, stringsAsFactors = FALSE, check.names = FALSE)
BIOM
format file, that can be used inSIAMCAT
as described in thefollowing section
2.4 BIOM format files
The BIOM format files can be added to SIAMCAT
via phyloseq
. First the fileshould be imported using the phyloseq
function import_biom
. Then aphyloseq
object can be imported as a siamcat
object as descibed in thenext section.
2.5 Creating a siamcat object of a phyloseq object
The siamcat
object extends on the phyloseq
object. Therefore, creatinga siamcat
object from a phyloseq
object is really straightforward. Thiscan be done with the siamcat
constructor function. First, however, we needto create a label object:
data("GlobalPatterns") ## phyloseq example datalabel <- create.label(meta=sample_data(GlobalPatterns), label = "SampleType", case = c("Freshwater", "Freshwater (creek)", "Ocean"))
## Label used as case:## Freshwater,Freshwater (creek),Ocean## Label used as control:## rest
## + finished create.label.from.metadata in 0.003 s
# run the constructor functionsiamcat <- siamcat(phyloseq=GlobalPatterns, label=label)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 26 sample(s).
## +++ checking sample number per class
## Data set has a limited number of training examples:## rest 18 ## Case 8 ## Note that a dataset this small/skewed is not necessarily suitable for analysis in this pipeline.
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.103 s
The siamcat-class
is the centerpiece of the package. All of the is storedinside of the object:
In the figure above, rectangles depict slots of the object and the class ofthe object stored in the slot is given in the ovals. There are twoobligatory slots -phyloseq (containing the metadata as sample_data
andthe original features as otu_table
) and label - marked with thick borders.
The siamcat
object is constructed using the siamcat()
function. There aretwo ways to initialize it:
Features: You can provide a feature
matrix
,data.frame
, orotu_table
to the function (together with label and metadata information):siamcat <- siamcat(feat=feat, label=label, meta=meta)
phyloseq: The alternative is to create a
siamcat
object directly outof aphyloseq
object:siamcat <- siamcat(phyloseq=phyloseq, label=label)
Please note that you have to provide either feat
or phyloseq
and thatyou cannot provide both.
In order to explain the siamcat
object better we will show how each of theslots is filled.
3.1 phyloseq, label and orig_feat slots
The phyloseq and label slots are obligatory.
- The phyloseq slot is an object of class
phyloseq
, which is described in thehelp file of thephyloseq
class. Help can be accessed by typing into Rconsole:help('phyloseq-class')
.- The
otu_table
slot inphyloseq
-seehelp('otu_table-class')
-stores the original feature table. ForSIAMCAT
, this slot can beaccessed byorig_feat
.
- The
- The label slot contains a list. This list has a specific set of entries-see
help('label-class')
- that are automatically generated by theread.label
orcreate.label
functions.
The phyloseq
, label and orig_feat are filled when the siamcat
object isfirst created with the constructor function.
3.2 All the other slots
Other slots are filled during the run of the SIAMCAT
workflow:
3.3 Accessing and assigning slots
Each slot in siamcat
can be accessed by typing
slot_name(siamcat)
e.g.for the eval_data
slot you can types
eval_data(siamcat)
There is one notable exception: the phyloseq slot has to be accessed withphyseq(siamcat)
due to technical reasons.
Slots will be filled during the SIAMCAT
workflow by the package’s functions.However, if for any reason a slot needs to be assigned outside of the workflow,the following formula can be used:
slot_name(siamcat) <- object_to_assign
e.g.to assign a new_label
object to the label
slot:
label(siamcat) <- new_label
Please note that this may lead to unforeseen consequences…
3.4 Slots inside the slots
There are two slots that have slots inside of them. First, the model_list
slot has a models
slot that contains the actual list ofmlr models-can be accessed via models(siamcat)
- and model.type
which is a characterwith the name of the method used to train the model: model_type(siamcat)
.
The phyloseq slot has a complex structure. However, unless the phyloseqobject is created outside of the SIAMCAT
workflow, only two slots of phyloseqslot will be occupied: the otu_table
slot containing the features table andthe sam_data
slot containing metadata information. Both can be accessed bytyping either features(siamcat)
or meta(siamcat)
.
Additional slots inside the phyloseq slots do not have dedicated accessors,but can easily be reached once the phyloseq object is exported from thesiamcat
object:
phyloseq <- physeq(siamcat)tax_tab <- tax_table(phyloseq)head(tax_tab)
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:## Kingdom Phylum Class Order Family ## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA ## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA ## 951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales" "Sulfolobaceae"## 244423 "Archaea" "Crenarchaeota" "Sd-NA" NA NA ## 586076 "Archaea" "Crenarchaeota" "Sd-NA" NA NA ## 246140 "Archaea" "Crenarchaeota" "Sd-NA" NA NA ## Genus Species ## 549322 NA NA ## 522457 NA NA ## 951 "Sulfolobus" "Sulfolobusacidocaldarius"## 244423 NA NA ## 586076 NA NA ## 246140 NA NA
If you want to find out more about the phyloseq data structure, head over tothephyloseqBioConductor page.# Session Info
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 22.04.2 LTS## ## Matrix products: default## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0## ## locale:## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York## tzcode source: system (glibc)## ## attached base packages:## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages:## [1] ggpubr_0.6.0 SIAMCAT_2.4.0 phyloseq_1.44.0 mlr3_0.15.0 ## [5] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2 ## [9] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ## [13] ggplot2_3.4.2 tidyverse_2.0.0 BiocStyle_2.28.0## ## loaded via a namespace (and not attached):## [1] RColorBrewer_1.1-3 jsonlite_1.8.4 shape_1.4.6 ## [4] magrittr_2.0.3 magick_2.7.4 farver_2.1.1 ## [7] corrplot_0.92 nloptr_2.0.3 rmarkdown_2.21 ## [10] zlibbioc_1.46.0 vctrs_0.6.2 multtest_2.56.0 ## [13] minqa_1.2.5 RCurl_1.98-1.12 PRROC_1.3.1 ## [16] rstatix_0.7.2 htmltools_0.5.5 progress_1.2.2 ## [19] curl_5.0.0 broom_1.0.4 Rhdf5lib_1.22.0 ## [22] rhdf5_2.44.0 pROC_1.18.0 sass_0.4.5 ## [25] parallelly_1.35.0 bslib_0.4.2 plyr_1.8.8 ## [28] palmerpenguins_0.1.1 mlr3tuning_0.18.0 cachem_1.0.7 ## [31] uuid_1.1-0 igraph_1.4.2 lifecycle_1.0.3 ## [34] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.5-4 ## [37] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10## [40] future_1.32.0 digest_0.6.31 numDeriv_2016.8-1.1 ## [43] colorspace_2.1-0 S4Vectors_0.38.0 mlr3misc_0.11.0 ## [46] vegan_2.6-4 labeling_0.4.2 fansi_1.0.4 ## [49] timechange_0.2.0 abind_1.4-5 mgcv_1.8-42 ## [52] compiler_4.3.0 beanplot_1.3.1 bit64_4.0.5 ## [55] withr_2.5.0 backports_1.4.1 carData_3.0-5 ## [58] highr_0.10 ggsignif_0.6.4 LiblineaR_2.10-22 ## [61] MASS_7.3-59 biomformat_1.28.0 permute_0.9-7 ## [64] tools_4.3.0 ape_5.7-1 glue_1.6.2 ## [67] lgr_0.4.4 nlme_3.1-162 rhdf5filters_1.12.0 ## [70] grid_4.3.0 checkmate_2.1.0 gridBase_0.4-7 ## [73] cluster_2.1.4 reshape2_1.4.4 ade4_1.7-22 ## [76] generics_0.1.3 gtable_0.3.3 tzdb_0.3.0 ## [79] data.table_1.14.8 hms_1.1.3 car_3.1-2 ## [82] utf8_1.2.3 XVector_0.40.0 BiocGenerics_0.46.0 ## [85] foreach_1.5.2 pillar_1.9.0 vroom_1.6.1 ## [88] bbotk_0.7.2 splines_4.3.0 lattice_0.21-8 ## [91] survival_3.5-5 bit_4.0.5 tidyselect_1.2.0 ## [94] Biostrings_2.68.0 knitr_1.42 infotheo_1.2.0.1 ## [97] gridExtra_2.3 bookdown_0.33 IRanges_2.34.0 ## [100] stats4_4.3.0 xfun_0.39 Biobase_2.60.0 ## [103] matrixStats_0.63.0 stringi_1.7.12 yaml_2.3.7 ## [106] boot_1.3-28.1 evaluate_0.20 codetools_0.2-19 ## [109] archive_1.1.5 BiocManager_1.30.20 cli_3.6.1 ## [112] munsell_0.5.0 jquerylib_0.1.4 mlr3learners_0.5.6 ## [115] Rcpp_1.0.10 GenomeInfoDb_1.36.0 globals_0.16.2 ## [118] parallel_4.3.0 prettyunits_1.1.1 bitops_1.0-7 ## [121] paradox_0.11.1 lme4_1.1-33 listenv_0.9.0 ## [124] glmnet_4.1-7 lmerTest_3.1-3 scales_1.2.1 ## [127] crayon_1.5.2 rlang_1.1.0 mlr3measures_0.5.0