4 (Optional) Reference Genome

This step is optional. bcbio maintains a current version of human and mouse genomes internally. If you are happy with using the standard, proceed to composing a setting YAML file. Otherwise, read on about composing a custom reference.

  • Navigate to https://ftp.ensembl.org/pub/ and identify the desired release of the Ensembl reference genome. As of 2024-02-14, the latest is release-111.
  • For a given release, you will want to identify cdna and gtf files for your species. For example, when working with release-111 human data, the files are

https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz and https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz

  • Ensure that you are in the reference/ subfolder of your project.
  • Download the reference files using the wget command:
wget ftp://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz
  • If you are interested in also looking at the non-coding regions, you will want to download the matching ncrna file. Following the release-111 example above, the download command will be
wget ftp://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
  • After downloading both cdna and ncrna, merge the two together into a single file:
cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.111.fa.gz
  • Unzip all files:
gunzip *.gz

4.1 Digital Gene Expression

No additional steps required.

4.2 Deep RNAseq

When working with deep RNAseq data, you will also need to account for the ERCC ExFold RNA Spike-In that is often used for quality control.

wget https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip
  • Unzip the downloaded file to retrieve the FASTA sequences: unzip ERCC92.zip