Towards an improved apple reference transcriptome using RNA-seq (Bai et. al. 2014)

Overview
Analysis NameTowards an improved apple reference transcriptome using RNA-seq (Bai et. al. 2014)
MethodCLC Genomics Workbench (v6.5)
Source180.8 million raw reads from 14 Golden Delicious sequence files (SRX392051)
Date performed2013-10-16

Abstract from Bai. et. al, 2014:

The reference genome of apple (Malus × domestica) has been available since 2010. Despite being a milestone in apple genomics, the reference genome is difficult to be used as a reference in RNA-seq (RNA sequencing) analysis, a widespread technology in transcriptomic studies. One of the major limitations appears to be the low coverage of the reference transcriptome in RNA-seq mapping of reads. To improve the reference transcriptome, we obtained 14 sets of strand-specific RNA-seq data of 168.5 million reads (filter passed) in total from fruit of Golden Delicious (GD, the source of the reference genome) in varying growth and developmental stages. Using a combination of genome-guided assembly and de novo assembly, the apple reference transcriptome was improved to a collection of 71,178 genes or transcripts, which includes 53,654 genes predicted originally (with MDP prefixed in their IDs) and 17,524 novel transcripts. Of these novel transcripts, 8,144 were identified from reads directly mapped to the reference genome while the remaining 9,380 were extracted from de novo assemblies of reads that could not be initially mapped to the reference genome. Evaluating the improved apple reference transcriptome with reads from Golden Delicious and other genotypes used in this and other studies showed that it allowed 62.5 ± 9.3-82.3 ± 2.7 % of reads to be mapped, a marked increase from the low rates of 37.4 ± 7.7-46.6 ± 7.1 % offered by the original reference transcriptome. The improved reference transcriptome therefore represents a step forward towards a complete reference transcriptome in apple.

 

 

Downloads

All assembly and annotation files are available for download by selecting the desired data type in the right-hand side bar.  Each data type page will provide a description of the available files and links do download.

Alternatively,  browse the project's FTP directory for all files.

Assembly

Reads and contigs from the RNA-seq data that did not map to the original M. x domestica v1.0 genome contigs were assembled into novel contigs. This set consisted of 9,605 new contigs on which de novo transcripts were later identified.  A FASTA file of these new contigs is available below. Additionally, a file containing 131,712 contigs (122,107 original contigs + 9,605 new contigs) is also available:

File Description File
New transcriptome-derived contigs (FASTA Format). Malus_x_domestica-CU_RNA_seq_genes-new_contigs.fa.gz
All contigs  (FASTA format). Malus_x_domestica-CU_RNA_seq_genes-contigs_all.fa.gz

 

Procedures

Three rounds of analyses were performed to confirm existing gene models or to identify new transcripts:

  • Round 1
    • Purpose: confirm existing gene models and reveal new transcripts directly from the Malus x domestica v1.0 assembly.
    • Details:
      • Created a de-novo transcript assembly from all reads.
      • Transcript contigs and singlets (unmapped reads) were mapped to the genome assembly yielding 8,144 new transcripts not originally present along with 53,654 original gene models (including 45,426 supported with RNA-seq data and 8,213 not supported with RNA-seq data but from the same genomic contigs that are home for the 45,426 genes.).
  • Round 2
    • Purpose: discover new transcripts from reads that could not be mapped to the original genomic assembly.
    • Details:
      • Created a de-novo transcript assembly from reads that did not map to the genome assembly at all.
      • The unmapped reads were aligned to this new transcript assembly and contigs with >= 50 reads were retained after contaminant filtering
      • transcript discovery occurred on contigs of >= 500bp with mapped reads >= 1000, yielding 5,361 new transcripts
  • Round 3
    • Purpose: identify additional transcripts
    • Details
      • transcript discovery occurred on contigs from 300-499bp with 100-999 reads, yielding 4,019 new transcripts
GenesTranscripts

Three types of gene models/transcripts are available from this project:

  • Novel transcripts. Consists of 17,479 models, including 8,134 (with IDs G10####, e.g G101234) discovered by mapping the RNA-seq contigs or reads  to the original M. x domestica v1.0 genomic contigs, and 9,345 (with IDs G20#### or G30####) by de novo assembly of unmapped RNA-seq reads. Note that the number of novel transcripts was 45 less than 17,524, which was reported in Bai et al (2014). The 45 transcripts were excluded because they overlap with other novel transcripts (identified on the other strand) that were supported with the majority of the mapped directional RNA-seq reads.
  • Confirmed gene models. Consists of 45,426 gene models from the original M. x domestica v1.0 assembly supported with alignments of the RNA-seq assembly and/or individual reads. Some were combined into single models.
  • Non-confirmed gene models.  Consists of 18,036 gene models from the original M. x domestica v1.0 assembly that had no alignments with the RNA-seq assembly and/or individual reads..

Gene models from the original M. x domestica v1.0 assembly have had their names shortened for brevity but the unique numerical identifier is the same (e.g. MDP0000122515 is shortened to M122515).

These gene models/novel transcripts are each available below as well as a single file containing all models.

File Description File
Novel transcripts  (GFF format). Malus_x_domestica-CU_RNA_seq_genes-de_novo.gff
Confirmed gene models (GFF format). Malus_x_domestica-CU_RNA_seq_genes-confirmed.gff
Non-confirmed gene models (GFF format). Malus_x_domestica-CU_RNA_seq_genes-non_confirmed.gff
All transcripts. Includes novel transcripts, confirmed and non-confirmed.  (GFF format). Malus_x_domestica-CU_RNA_seq_genes-all.gff
All transcripts (FASTA format) Malus_x_domestica-CU_RNA_seq_genes-all.fa.gz

 

Alignments

There are two alignment files in BAM format that can be accessed using samtools or other BAM file viewers.  These files are large but it may not be necessary to download them.   If the desired tool (e.g. samtools) supports remote access of BAM files, simply cut-and-paste the  URLs below for use by the tool. 

Alignment of transcriptome to genomic contigs

Alignment of all RNA-Seq reads (from the 14 samples) to genes

Acknowledgements

The data and information for this project were provided to GDR by the Kenong Xu lab of Cornell University and were formatted by the GDR team for public access through GDR.