Fragaria vesca (CFRA 2339) Whole Genome v1.0 Assembly & Annotation

Overview
Analysis NameFragaria vesca (CFRA 2339) Whole Genome v1.0 Assembly & Annotation
MethodCanu (v1.9)
SourceOxford Nanopore reads
Date performed2021-06-22

Publication:

Alger EI, Platts AE, Deb SK, Luo X, Ou S, Cao Y, Hummer KE, Xiong Z, Knapp SJ, Liu Z, McKain MR and Edger PP (2021) Chromosome-Scale Genome for a Red-Fruited, Perpetual Flowering and Runnerless Woodland Strawberry (Fragaria vesca). Front. Genet. 12:671371. doi: 10.3389/fgene.2021.671371

Abstract:

Background: Although a high-quality reference genome is available for the diploid woodland strawberry (F. vesca), it is for the 'Hawaii-4' accession that produces runners and yellow fruit. A reference genome that produces red fruit and is runnerless, key variants for two important traits for the commercial strawberry, is not publicly available. Findings: Here we report a near-complete genome of Fragaria vesca 'CFRA 2339' using Oxford Nanopore long read sequencing. The ‘CRFA 2339’ genotype produces red fruit, perpetually flowers, and is completely runnerless. The assembly spanned 229.5Mb and has a contig N50 length of ∼24.3 million base pairs (Mb). Three of the chromosomes are captured by a single contig, another three chromosomes are split into two contigs, and the one remaining chromosome is split into three contigs. These contigs were anchored to 7 pseudomolecules using comparisons to the 'Hawaii-4' genome yielding a final scaffold N50 length of ~29.6Mb. Furthermore, comparative analyses uncovered previously identified mutations associated with fruit color, runnerless and perpetual flowering phenotypes. Conclusions: We anticipate that this new genomic resource will be utilized to uncover the underlying genetics of many important traits for strawberry.

Strawberry (Fragaria sp.) is emerging as an important model system for the family Rosaceae, which includes several fruit crop species (e.g. cherries, peaches, and strawberries). Fragaria research has largely utilized the woodland strawberry species, Fragaria vesca, due to it’s small size, short generation time, and ease of transformation. In addition to its ease of use, F. vesca is also the closest extant relative of the diploid progenitor for the dominant subgenome in the allo-octoploid cultivated garden strawberry (F. ⨉ ananassa) making it an useful organism for studying polyploidy in cultivated strawberry [1–3]. The F. vesca accession ‘Hawaii-4’ has been widely used in previous genetic studies. The genome was first sequenced in 2011 using short read sequencing [3] and then improved in 2014 using linkage maps [4]. A chromosome-scale version of the ‘Hawaii-4’ genome was assembled using a combination of long read and short read sequencing data in 2018 [5] and a revised annotation released in 2019 [6]. However, this single accession fails to capture the agriculturally valuable genetic and phenotypic diversity of F. vesca [1,3,4]. The diverse accessions of F. vesca exhibit a wide range of phenotypes, including several traits relevant to breeding programs such as flowering time, fruit color, and runner production.

In strawberry, the axillary meristem can develop into either daughter-plant producing runners or into fruit-bearing shoots [7-9]. Strawberry cultivars are maintained and grown from these runners rather than from seeds, as runners allow for easy clonal propagation [7-10]. However, runnering is also associated with a decrease in fruit yield as increased runner production reduces the number of fruit-bearing shoots [7,11,12]. Therefore, while the fruit is the consumer product, runners are also essential for strawberry propagation. Beyond influencing fruit yields, runners can also grow quickly and abundantly, resulting in the need for frequent trimming for plant maintenance. With these considerations, better understanding genetic factors influencing the switch between inflorescence and runner growth would be a valuable resource for the strawberry community.

F. vesca accessions range from completely runnerless to extremely high runnering, allowing for further investigation into this important trait in strawberry [12,13]. Here we present a high quality genome of an F. vesca accession, CFRA 2339, which phenotypically differs from Hawaii-4 in two important traits: fruit color and runner production (Figure 1A and 1B). Hawaii-4 has yellow fruit and produces runners while CFRA 2339 produces red fruit and is runnerless [12–15]. Both accessions are perpetual flowering instead of the seasonal flowering displayed by other accessions [16]. This genome will serve as a valuable new resource for the strawberry and the larger Rosaceae community, allowing for the identification of the genetics controlling runner production as well as fruit color in a diploid model organism [13,14].

The aim of this paper is to expand the resources of the model species F. vesca by providing a second high quality genome from an accession with different phenotypic traits compared to the current high quality F. vesca Hawaii-4 genome [5]. To accomplish this, we combined long-read Oxford Nanopore sequencing and high coverage short read Illumina sequencing. We generated ~2.3 million Nanopore reads collectively, totalling over 30 Gb, providing >120x coverage for the CFRA 2339 genome, and exhibiting an N50 length of 34.1kb and a maximum length of 311kb. The raw Nanopore reads were then corrected and assembled using the Canu assembler [17]. Fragaria vesca has seven chromosomes. Chromosomes 1, 5, and 7 were captured by a single contig, 2, 4 and 6 were split between two contigs, and chromosome 3 was split among three contigs, thus requiring minimal further scaffolding (Supplemental Figure 1; Supplemental Table 1). RagTag was then used to correct misassemblies and merge scaffolds into pseudomolecules using Hawaii-4 as a reference and then polished with two rounds of Pilon with over 35.5 Gb Illumina data [18]. The final assembly spanned 229.5Mb across 311 contigs with an N50 length of 29.6 Mb (Figure 1C; Supplemental Table 2). Seven pseudomolecules were obtained for the CFRA 2339 genome (Figure 1C).

The heterozygosity of CFRA 2339 was estimated using Illumina genomic data and Jellyfish [19] with K=31, and the histogram file processed with genomescope [20]. The results suggest that CFRA 2339 has a heterozygosity level of roughly 0.096% (Supplemental Figure 2). There is an additional ~10.2Mb of unanchored sequences not present in the seven pseudomolecules of CFRA 2339, which is equivalent to ~4.6% of anchored sequences. Given the syntenic coverage across the Hawaii-4 (Figure 1C), some subset of these unanchored sequences are likely haplotype variants. This needs to be further investigated in future studies, including with other current and future long read seeuqncing and assembly approaches. Furthermore, there is similar coverage across the putative centromere and ribosomal DNA regions, and may also extend to the telomere end of chromosome as previously reported for the Hawaii-4 genome [5,21].

The genome was annotated using the MAKER annotation pipeline using gene evidence from a broad SRA F. vesca dataset, RNAseq datasets generated from CFRA 2339 tissues, as well as gene and protein evidence from the F. vesca Hawaii-4 v4 annotation and the UniprotKB database [22,23]. We identified 30,349 gene models with 64% having a known Pfam domain (Supplemental Table 3). The overall number of annotated genes is similar to the Hawaii-4 genome annotation (Supplemental Table 3). The most recent version of the Hawaii-4 genome annotated 34,009 genes [6]. The Benchmarking Universal Single-Copy Orthologs (BUSCO) with the eudicot database (eudicot_odb10) was used to estimate the completeness of the genome assembly and the CFRA 2339 genome annotation quality [24] (Supplemental Table 3). The genome was found to have 96% of the core genes in the BUSCO eudicots dataset, supporting a high quality genome assembly and annotation. Transposable elements (TEs) were annotated using the Extensive de novo TE Annotator (EDTA) [25] (Supplemental Table 4). EDTA found that TEs comprise ~29.7% of the F. vesca CFRA 2339 genome, with long-terminal-repeat retrotransposons (LTR-RT) being the most abundant and accounting for ~16% of the overall genome. The amount of annotated TEs in the CFRA 2339 genome is similar to the ~29.3% of TE content previously annotated in the Hawaii-4 genome [5] (Supplemental Table 5).

Based on comparative genomic analyses (Figure 1C), there is consistent coverage of the CFRA 2339 genome across the Hawaii-4 genome. We were able to identify syntelogs for roughly 94.7% of genes or all but 1501 genes encoded on the seven pseudomolecules of Hawaii-4. We investigated these 1501 missing genes and found that the vast majority (1166 total) had syntenic flanking genes. In other words, these missing genes are single loci with flanking genes present in both genomes. We also found 134 instances of two adjacent genes missing (268 genes total) and 13 instances of three adjacent genes missing (39 genes total). This suggests that these are likely either genes exhibiting presence-absence variation and/or were possibly not assembled or annotated. The remaining missing genes were one instance of four adjacent missing genes, two instances of five adjacent missing genes, one instance of six adjacent missing genes and one instance of eight adjacent missing genes. None of these missing gene blocks occurred at the boundaries of contigs. The amount of PAV identified between these two accessions is within the range previously reported for other species (e.g. [26-28]). The verification of these PAV loci is something that needs to be explored in greater detail as part of future strawberry pangenome studies. It's worth noting that missing gene models may be due to remnant sequencing errors from nanopore that were unable to be removed with Illumina polishing.

The woodland Fragaria vesca exhibits natural variation for several important agronomic traits including fruit color, flowering, and runnering. Previous molecular genetic studies have shown that variation in each of these traits can be controlled by single loci. For example, a G-to-C SNP in the FveMYB10 coding region was responsible for the yellow fruit color in several F. vesca accessions [15] (Table 1; Supplemental Figure 3). Also, as an important hormone promoting runner formation, GA is synthesized by several enzymes including gibberellin 20-oxidase encoded by FveGA20ox genes; a 9-bp-deletion in the FveGA20ox4 gene was found to be responsible for the runnerless phenotype of F. vesca [12] (Table 1; Figure 1). Caruana et al. found a nonsense mutation in the DELLA protein coded by FveRGA1, which was responsible for the constitutive runnering even when the FveGA20ox4 gene is mutated [13]. Finally, some F. vesca varieties are perpetual flowering, this phenotype was shown to be caused by a 2-bp deletion in the first exon of FveTFL1, coding for a repressor of flowering [16,29] (Table 1).

The CFRA 2339 accession produces red fruit, flowers perpetually, and is runnerless. We analyzed the sequences of MYB10 [15], GA20ox4 [12], TFL1 [16,29] and RGA1 [13] in the newly assembled genome of CFRA 2339 by aligning their CDS with the orthologs from the Hawaii-4 reference genome [5]. We found that CRFR 2339 has the MYB10 variant for red berries, but carries the 2-bp mutation in the TFL1 gene responsible for perpetual flowering (Table 1). Further, the 9-bp deletion in GA20ox4 (but the functional full-length variant for RGA1) explains the runnerless phenotype (Table 1). The genotypes at these loci are consistent with CFRA 2339 red fruit color, runnerless, and perpetual flowering phenotypes.

The genome described here for F. vesca CFRA 2339 will be a valuable new resource for the strawberry community. Furthermore, being runnerless, CFRA 2339 does not require the frequent trimming that is necessary for most other accessions, but can still be propagated easily by splitting and replanting the crown. The accession does share the perpetual flowering trait with Hawaii-4, leading to high flower and fruit production.

Homology Analysis

Homology of the Fragaria vesca CFRA 2339 genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format. 

 

Protein Homologs

Fragaria vesca CFRA 2339 v1.0 proteins with NCBI nr homologs (EXCEL file) fvesca2339_v1.0_vs_nr.xlsx.gz
Fragaria vesca CFRA 2339 v1.0 proteins with NCBI nr (FASTA file) fvesca2339_v1.0_vs_nr_hit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins without NCBI nr (FASTA file) fvesca2339_v1.0_vs_nr_noHit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins with arabidopsis (Araport11) homologs (EXCEL file) fvesca2339_v1.0_vs_arabidopsis.xlsx.gz
Fragaria vesca CFRA 2339 v1.0 proteins with arabidopsis (Araport11) (FASTA file) fvesca2339_v1.0_vs_arabidopsis_hit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins without arabidopsis (Araport11) (FASTA file) fvesca2339_v1.0_vs_arabidopsis_noHit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins with SwissProt homologs (EXCEL file) fvesca2339_v1.0_vs_swissprot.xlsx.gz
Fragaria vesca CFRA 2339 v1.0 proteins with SwissProt (FASTA file) fvesca2339_v1.0_vs_swissprot_hit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins without SwissProt (FASTA file) fvesca2339_v1.0_vs_swissprot_noHit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins with TrEMBL homologs (EXCEL file) fvesca2339_v1.0_vs_trembl.xlsx.gz
Fragaria vesca CFRA 2339 v1.0 proteins with TrEMBL (FASTA file) fvesca2339_v1.0_vs_trembl_hit.fasta.gz
Fragaria vesca CFRA 2339 v1.0 proteins without TrEMBL (FASTA file) fvesca2339_v1.0_vs_trembl_noHit.fasta.gz

 

Assembly

The Fragaria vesca CFRA 2339 Genome v1.0 assembly file is available in FASTA format.

Downloads

Chromosomes (FASTA file) fvesca2339_v1.0.fasta.gz

 

Gene Predictions

The Fragaria vesca CFRA 2339 v1.0 genome gene prediction files are available in FASTA and GFF3 formats.

Downloads

Protein sequences  (FASTA file) fvesca2339_v1.0.proteins.fasta.gz
CDS  (FASTA file) fvesca2339_v1.0.cds.fasta.gz
Genes (GFF3 file) fvesca2339_v1.0.genes.gff3.gz

 

Functional Analysis

Functional annotation for the Fragaria vesca CFRA 2339 Genome v1.0 are available for download below. The Fragaria vesca CFRA 2339 Genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan fvesca2339_v1.0_genes2GO.xlsx.gz
IPR assignments from InterProScan fvesca2339_v1.0_genes2IPR.xlsx.gz
Proteins mapped to KEGG Pathways fvesca2339_v1.0_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Orthologs fvesca2339_v1.0_KEGG-pathways.xlsx.gz

 

Transcript Alignments
Transcript alignments were performed by the GDR Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the Fragaria persica CFRA 2339 genome assembly. Alignments with an alignment length of 97% and 97% identify were preserved. The available files are in GFF3 format.

 

Fragaria x ananassa GDR RefTrans v1 Fragaria persica 2339_v1.0_f.x.ananassa_GDR_reftransV1
Malus_x_domestica GDR RefTrans v1 Fragaria persica 2339_v1.0_m.x.domestica_GDR_reftransV1
Prunus avium GDR RefTrans v1 Fragaria persica 2339_v1.0_p.avium_GDR_reftransV1
Prunus persica GDR RefTrans v1 Fragaria persica 2339_v1.0_p.persica_GDR_reftransV1
Pyrus GDR RefTrans v1 Fragaria persica 2339_v1.0_p.persica_GDR_reftransV1
Rosa GDR RefTrans v1 Fragaria persica 2339_v1.0_rosa_GDR_reftransV1
Rubus GDR RefTrans v2 Fragaria persica 2339_v1.0_rubus_GDR_reftransV2