3 Sample Description
bcbio
expects a sample description file that tells it which files map to which samples. The file should be in the .csv format and have at least two columns. An example file for four samples might look like this:
samplename,description
S1.fastq.gz,S1
S2.fastq.gz,S2
S3.fastq.gz,S3
S4.fastq.gz,S4
where file S1.fastq.gz
is mapped to the sample named S1
, and so on. Additional columns can be included in this file and get preserved by bcbio as sample metadata.
Each subsection below shows how to generate a sample description file. The rest of the guide assumes that this sample description file is named alignment.csv
.
3.1 Digital Gene Expression
When working with DGE data, you will typically have two matching files with suffixes R1
and R2
. Only the R1
file needs to be specified in the sample description file. For example, your alignment.csv
might look like
samplename,description
DGE1_XT_S1_R1_001.fastq.gz,MasterPlate
where the matching DGE1_XT_S1_R2_001.fastq.gz
is not listed explicitly. Because of the high-throughput nature of DGE, all reads will generally be inside a single FASTQ file. This makes it easy to compose a sample description file by hand, as it will usually contain only two lines. When dealing with multiple FASTQ files, the sample description file can also be generated automatically by running the following command:
(echo 'samplename,description'; for f in fastq/*R1*fastq*.*z*; do readlink -f $f | perl -pe 's/(.*?_(S[0-9]+)_.*)/\1,\2/'; done) > alignment.csv
3.2 Deep RNAseq
In the case of deep RNAseq data, you may have multiple FASTQ files per sample. Prior to running the aligner, these files need to be merged. bcbio
provides a script for doing so, but it needs to know which files map to which samples. Compose a file named toMerge.csv
that provides such a mapping. For example,
samplename,description
/n/scratch/abc123/myProject/fastq/TRA00140445_S1_L001_R1.fastq.bz2,S1
/n/scratch/abc123/myProject/fastq/TRA00140445_S1_L002_R1.fastq.bz2,S1
/n/scratch/abc123/myProject/fastq/TRA00140445_S1_L003_R1.fastq.bz2,S1
/n/scratch/abc123/myProject/fastq/TRA00140445_S1_L004_R1.fastq.bz2,S1
/n/scratch/abc123/myProject/fastq/TRA00140445_S2_L001_R1.fastq.bz2,S2
/n/scratch/abc123/myProject/fastq/TRA00140445_S2_L002_R1.fastq.bz2,S2
/n/scratch/abc123/myProject/fastq/TRA00140445_S2_L003_R1.fastq.bz2,S2
/n/scratch/abc123/myProject/fastq/TRA00140445_S2_L004_R1.fastq.bz2,S2
lists the mapping for two samples S1
and S2
, each having data collected across four sequencer lanes.
If you are currently in your project directory /n/scratch/abc123/myProject/
and have your data in the fastq/
subdirectory, such a file can be constructed automatically by running the following command:
(echo 'samplename,description'; for f in fastq/*fastq*.*z*; do readlink -f $f | perl -pe 's/(.*?_(S[0-9]+)_.*)/\1,\2/'; done) > toMerge.csv
3.2.1 One file per sample
If you don’t have multiple FASTQ files per sample, then toMerge.csv
that you just constructed becomes your sample description file. Rename it to alignment.csv
using
mv toMerge.csv alignment.csv
and proceed to downloading a reference genome.
3.2.2 Multiple files per sample
Once you have prepared the file toMerge.csv
, run the following command to execute the merge:
bcbio_prepare_samples.py --out merged --csv toMerge.csv
The merging operation might take a while. When it finishes, your merged files will reside in the merged/
subdirectory. You will also have a new filename-to-samplename mapping file toMerge-merged.csv
. For consistency with the remainder of this guide, rename this file to alignment.csv
using the mv
command:
mv toMerge-merged.csv alignment.csv