banner

News

Oct 14, 2024

Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis | Scientific Data

Scientific Data volume 11, Article number: 135 (2024) Cite this article

4301 Accesses

1 Citations

Metrics details

The Solanaceae species Nicotiana tabacum, an economically important crop plant cultivated worldwide, is an allotetraploid species that appeared about 200,000 years ago as the result of the hybridization of diploid ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis. The previously published genome assemblies for these three species relied primarily on short-reads, and the obtained pseudochromosomes only partially covered the genomes. In this study, we generated annotated de novo chromosome-level genomes of N. tabacum, N. sylvestris, and N. tomentosiformis, which contain 3.99 Gb, 2.32 Gb, and 1.74 Gb, respectively of sequence data, with 97.6%, 99.5%, and 95.9% aligned in chromosomes, and represent 99.2%, 98.3%, and 98.5% of the near-universal single-copy orthologs Solanaceae genes. The completion levels of these chromosome-level genomes for N. tabacum, N. sylvestris, and N. tomentosiformis are comparable to other reference Solanaceae genomes, enabling more efficient synteny-based cross-species research.

The Nicotiana genus belongs to the Solanaceae family, which also includes tomato (Solanum lycopersicum), potato (Solanum tuberosum), and eggplant (Solanum melongena)1,2. While most of the Solanaceae are diploids with 12 chromosome pairs, tobacco (Nicotiana tabacum L.) is an allotetraploid (2n = 4x = 48) resulting from a hybridization event that likely occurred in the Andes within the last 200,000 years between ancestors of Nicotiana sylvestris (S-genome; 2n = 2x = 24) and Nicotiana tomentosiformis (T-genome; 2n = 2x = 24)3,4. In addition to being a modern descendant of the N. tabacum maternal progenitor, N. sylvestris, which is nowadays largely cultivated as an ornamental plant, is also one the closest descendants of the ancestral species from the Alatae/Sylvestres section that hybridized as the paternal donor with an ancestral species from the Noctiflorae/Petunioides section to give rise to the almost all-Australian clade of allopolyploid species constituting the Nicotiana section Suaveolentes5.

Similar to other members of the Nicotiana genus, N. sylvestris, N. tomentosiformis, and N. tabacum produce a wide range of alkaloids that are known to be toxic to insects and are a well-established mechanism of defense against herbivores6. While N. sylvestris accumulates similar amounts of alkaloids in roots and leaves (3.5 mg/g in roots and 2.1 mg/g in leaves), N. tomentosiformis accumulates more alkaloids in roots (8.8 mg/g in roots and 0.6 mg/g in leaves), and N. tabacum has more in leaves (1.3 mg/g in roots and 12.5 mg/g in leaves)7. The composition of the accumulated alkaloids varies between the three species, with N. tabacum benefiting from both of its progenitors’ genetic and regulatory contributions. In N. sylvestris roots, 87% of the alkaloids is nicotine, 11% is anatabine, and 1.9% is anabasine, while in leaves, 100% of the alkaloids is nicotine. In N. tomentosiformis roots, 56% of the alkaloids is nornicotine, 28% is anatabine, 14% is nicotine, 1.6% is anabasine, and 0.57% is cotinine, while in leave 73% of the alkaloids is nicotine and 27% is nornicotine. In N. tabacum roots, 87% of the alkaloids is nicotine, and 13% is nornicotine, while in leaves, 92% of the alkaloids is nicotine, 5.1% is nornicotine, and 2.6% is anatabine7.

The Nicotiana genus is also a rich source of terpenoids, which play a significant role as attractants to several pollinator insects. In N. tabacum, both cembranoid and labdanoid diterpenoids are synthesized in the trichome glands, whereas N. sylvestris produces predominantly cembranoid diterpenoids and N. tomentosiformis predominantly labdanoid diterpenoids8.

Although several Nicotiana species genomes have been published in the last decade, including for N. sylvestris9, N. tomentosiformis9, and N. tabacum10,11, these genomes are primarily based on the assembly of second-generation sequencing data and therefore suffer from an important fragmentation resulting in only partial anchoring to chromosomes.

In the present study, we integrated Illumina short-read sequencing (Illumina, San Diego, CA, USA) with third-generation Oxford Nanopore long-read sequencing and Oxford Nanopore chromosome conformation capture (PoreC) technology (Oxford Nanopore Technologies, Oxford, UK) to generate high-quality chromosome-level reference genomes for N. tabacum, N. sylvestris, and N. tomentosiformis. These new resources will broaden our understanding of the contributions of both N. tabacum progenitors to the genes and the pathways of tobacco and enable more efficient synteny-based cross-species Solanaceae research.

Young leaves from N. tabacum L. Cultivar K326 (PVY resistant derived from USDA ARS GRIN Global NPGS: PI 552505), N. Sylvestris Speg. TW136 (USDA ARS GRIN Global NPGS: PI 555569) and N. tomentosiformis Goodsp. TW142 (USDA ARS GRIN Global NPGS: PI 555572) were snap-frozen with liquid nitrogen and finely ground in a mortar. High molecular weight genomic DNA for long-read sequencing was extracted using Promega Wizard HMW DNA Extraction Kit (Promega AG, Madison, WI, USA).

Short genomic DNA fragments were deleted using Circulomics short-read eliminator kits from PacBio (PacBio, Menlo Park, CA, USA), and long-read sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 139 Gb of raw data were collected for N. tabacum, 159 Gb for N. sylvestris, and 76 Gb for N. tomentosiformis.

To conduct chromosome-level assembly, frozen leaves were cut into one square centimeter pieces and treated with formaldehyde to fix the DNA. The fixed genomic DNA was then digested overnight using the NlaIII restriction enzyme, and the 3′ overhangs were re-ligated using T4 ligase before extraction. PoreC sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 40 Gb of raw data were collected for N. tabacum, 66 Gb for N. sylvestris, and 63 Gb for N. tomentosiformis.

To polish and validate the assembled genomes, Illumina short-reads were prepared for N. tabacum using Tecan Celero EZ DNA-Seq Library Preparation Kits (Tecan, Männedorf, Switzerland) and sequenced as 2 × 151 bp paired-end reads on an Illumina NovaSeq 6000 to generate a total of 139 Gb. Illumina short-reads from ERR27452712 and ERR27452813 for N. sylvestris and from ERR27454014 and ERR27454215 for N. tomentosiformis were retrieved from the Short Read Archive.

For N. tabacum, Oxford Nanopore basecalling was performed using Guppy 6.3.7 using the plant super model. Long-read sequences were filtered using seqkit16 2.2.0 to remove short (length <5000) and low-quality reads (average qscore <9), resulting in 98 Gb (N50 length: 28.5 kb).

For N. sylvestris and N. tomentosiformis, Oxford Nanopore basecalling was performed using Guppy 6.1.1 using the plant super model. Long-read sequences were filtered using seqkit16 2.2.0 to remove short (length <2500) and low-quality reads (average qscore <9), resulting in 108 Gb (N50 length: 25.9 kb) and 41 Gb (N50 length: 28.2 kb) for N. sylvestris and N. tomentosiformis, respectively.

Genomes were assembled using flye17 2.9.1 using the nano-hq input pre-set and a read error rate of 0.03.

The Illumina short-reads were processed for each species using fastp18 0.23.2 to trim adapters and low-quality bases, merge pairs, and remove low complexity and short (length <75) reads. During processing, the reads were split into two sets, one for assembly polishing which contained 80% of the processed Illumina reads and one for assembly validation containing 20% of the processed Illumina reads.

The assembled genomes were polished with processed Illumina short-reads using fmlrc219 0.1.7. The remaining haplotig sequences were removed from the assemblies using purge_dups20 1.2.6, with cut-offs set to 3, 8, and 1000 for N. tabacum, to 5, 10, and 1000 for N. sylvestris, and to 2, 3, and 1000 for N. tomentosiformis.

Illumina short-reads were mapped to the assembly contigs using minimap221,22 2.24, duplicates marked with samblaster23 0.1.26, and filtered using samtools24 1.15.1. The coverage of the assembly contigs by Illumina sequencing was then calculated using samtools24 1.15.1, and contigs with less than 70% of their length with a coverage of at least 5 for N. tabacum and 15 for N. sylvestris and N. tomentosiformis were removed.

Because the biological material used for sequencing originated from inbred plants that can be considered homozygotes, variants were called using freebayes25 1.3.6 with the ploidy parameter set to 1 and ignoring sites with coverage higher than 200 and filtered with vcflib26 1.0.3 vcffilter using the parameters --filter-sites–info --filter “QUAL >20 & QUAL/AO >10 & SAF >0 & SAR >0 & RPL >1 & RPR >1”. Variants were then applied to the genomes using bcftools24 1.15.1 consensus to generate the polished assembly contigs.

Assembly contigs from plastid and mitochondrion were removed by mapping the polished assembly contigs to the N. tabacum plastid and mitochondrion sequences (NC_001879.227 and NC_006581.128, respectively) using minimap221,22 2.24 and filtering out contig mapping on more than 50% of their length.

Assembly contigs from possible contamination were identified using kraken229 2.1.2 using the k2_pluspfp_20220908 database30 and removed by only retaining contigs identified as belonging to Nicotiana or Solanum species.

PoreC reads were mapped to the cleaned assembly contigs using minimap221,22 2.24. Alignments with a mapping quality lower than 60 for N. tabacum and 30 for N. sylvestris and N. tomentosiformis were discarded, and contact pairs were created from the remaining alignments. The positions on the contigs of each contact pair were recorded as two consecutive lines in a BED file. The scaffolding of the contigs to a chromosome-level assembly was performed using yahs31 1.2a1. Contact maps were prepared using PretextMap32 0.1.9, manually curated and annotated in PretextView33 0.2.5, and the resulting scaffolds exported as chromosome-level sequences.

To name and orient the N. tabacum chromosome-level sequences, the PT markers, mapped to the sequences using hisat234 2.2.1 and the tobacco genetic map35, were used. Similarly, the N. tomentosiformis chromosome-level sequences were named and oriented using the N genetic map36 combined with the tobacco PT markers35. The chromosome-level assembly of the N. tomentosiformis genome was then used as a reference to name and orient the N. sylvestris chromosome-level sequences based on minimap221,22 2.24 mapping (Fig. 1).

PoreC contact maps. Intra-chromosomal and inter-chromosomal contacts are shown for the Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum genome assemblies. The black bottom and right edges correspond to unplaced sequences.

The proportion of the assembly anchored to chromosomes reached 99.5%, 95.9%, and 97.6% of the total assembly lengths for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively (Table 1).

When compared to the previously available N. tabacum genome assembly11 generated from short-read sequencing, whole genome profiling and optical and genetic mapping data, the new N. tabacum genome assembly has fewer contigs (decrease from 1,257,801 to 1410) with a larger N50 length (increase from 9.1 kb to 11.8 Mb), and the proportion of the assembly anchored to chromosomes consequently improved from 64% to 97.6%.

Nested retrotransposons were annotated by iteratively running genometools 1.6.2 ltrharvest37 using the parameters -similar 70 -seed 20 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 3 -vic 10 -overlaps best, retaining the predictions matching to the RepeatExplorer Viridiplantae 3.0 dataset38 using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --ultra-sensitive --frameshift 15, and excising them from the assembly using samtools24 1.17. At most, 20 prediction-filtering-excision iterations were performed.

The predicted retrotransposons were classified by their homology to the RepeatExplorer Viridiplantae 3.0 dataset38 sequences. Their age was estimated under the assumption that their long terminal repeats (LTRs) were identical at the time of insertion by aligning their 3′ and 5′ LRTs using clustalo40,41 1.2.4, calculating their divergence (K) using the Kimura-2-parameter distance and dividing it by twice 1.5 × 10−8 substitution per site per year (r)42.

The predicted retrotransposons covered 26.6%, 32.2%, and 29.3% of the N. sylvestris, N. tomentosiformis, and N. tabacum genomes, respectively (Table 2). Regardless of the species, the most frequent element subclass is Ty3/gypsy|chromovirus|Tekay, representing between 40% and 56% of the total predicted retrotransposon length. The only element subclass that shows a marked difference between the three species is Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre, which covers 116,167,517 bp (18.8% of the total predicted retrotransposon length) in N. sylvestris, and only 21,672,795 bp (3.9%) in N. tomentosiformis. In N. tabacum, it covers 135,653,424 bp (11.6%), close to the sum of its coverage in the two precursor species (137,840,312 bp). Looking at the predicted insertion ages, a recent expansion of the Alesia and Angela subclasses of Ty1/copia and of the Ogre subclass of Ty3/gypsy retrotransposons in N. sylvestris and N. tabacum, but not in N. tomentosiformis, is observed (Fig. 2).

Predicted retrotransposon insertion ages. (a) Predicted insertion ages in millions of years for retrotransposons of the Ty1/copia superfamily; (b) Predicted insertion ages in millions of years for retrotransposons of the Ty3/gypsy superfamily.

Genomes were masked using blast43,44 2.14.0 windowmasker with dusting, and augustus45 3.5.0 was used for gene prediction. A training dataset was created by separately mapping S. lycopersicum, S. tuberosum, and Nicotiana attenuata cDNA and CDS from Ensembl 56 using minimap221,22 2.26 to the N. sylvestris and N. tomentosiformis genomes. Any sequence with an annotation matching ‘hypothetical’, ‘unknown’, ‘polyprotein’, ‘domain-containing’, ‘chloroplast’, or ‘mitochondria’ were omitted from the mapping. Gene models were constructed from the mapped sequences using bedtools46 2.30.0 and filtered using gffread47 0.12.7 with the parameters -V -H -U -N -P -J -M -K -Q -Y -Z -F --keep-exon-attrs. Training sequences were then extracted from the genomes using the obtained GFF annotation file and adding 1,000 bp flaking regions. One-fourth of the gene models were set aside for testing for each combination of species and dataset. After merging the training and testing datasets, a Nicotiana model was trained using the etraining and optimize_augustus.pl programs bundled with augustus45 3.5.0. A total of 10,092 loci were used for training, and 3,362 loci were used for testing.

To hint at the augustus predictions, Ensembl 56 proteins from S. lycopersicum, S. tuberosum, and N. attenuata were mapped to the genomes using miniprot48 0.11, and aletsch49 1.0.3 was used to construct transcripts from Illumina paired-end RNA-Seq reads from SRR1191245750, SRR210653151, ERR27438752, ERR27438853, ERR27438954, ERR27439055, ERR27439156, ERR27439257, ERR27439358, ERR27439459, ERR27439560, ERR27439661, ERR27439762, ERR27439863, ERR27439964, ERR27440065, ERR27440166, ERR27440267, ERR27440368, ERR27440469, and ERR27440570 mapped using hisat234 2.2.1, and Oxford Nanopore long cDNA reads from SRR1204599171, SRR1204599272, SRR1204599373, and SRR1204599474 mapped with minimap221,22 2.26.

Augustus45 3.5.0 predictions were obtained using the trained Nicotiana model, the extrinsic.MPE.cfg extrinsic configuration file, and hints derived from the miniport48 0.11 and aletsch49 1.0.3 output with priorities of 4 and 3, respectively. Other augustus45 3.5.0 parameters used were --alternatives-from-evidence=off --alternatives-from-sampli ng=off --softmasking=1 --strand=both --genemodel=complete --UTR=on. Predicted gene models without supporting hints that did not encode a protein found in a uniprot eudicotyledons proteins dataset filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’, or ‘At.g’ when using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 were removed.

To complement the augustus predictions, additional gene models were created by separately mapping the predicted N. sylvestris, N. tomentosiformis, and N. tabacum cDNA and CDS and the S. lycopersicum, S. tuberosum, and N. attenuata cDNA and CDS from Ensembl 56 to the genomes using minimap221,22 2.26. Models that overlapped augustus predictions by 25% or more according to bedtools46 2.30.0 intersect were then filtered out by IDs using gffread47 0.12.7 with the parameters -P -M -K -Q -Y -Z -F, and the remaining genes models were added to those predicted with augustus45 3.5.0.

Functional annotation of the gene models was performed using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 and uniprot eudicotyledons proteins filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’ or ‘At.g’. Gene models overlapping with retrotransposons by 75% or more according to bedtools46 2.30.0 intersect and those with annotations matching ‘transposon’, ‘transposase’, ‘polyprotein’, ‘gagpol’, or ‘gag-pol’ were excluded to yield the final set of annotated gene models.

The genomes and annotations are available from Zenodo under records 825625275, 825625476, and 825625677. The trained Nicotiana model for augustus gene prediction is available from Zenodo under record 825628078.

The genomes have been deposited at DDBJ/ENA/GenBank under the accessions ASAF0000000079, ASAG0000000080 and AWOJ0000000081.

Raw sequencing data are available from the National Center for Biotechnology Information Short Read Archive under accessions SRR2568512682, SRR2568512783, SRR2568512884, SRR2568512985, and SRR2568513086 in BioProject PRJNA182500, SRR2568503487, SRR2568503588, SRR2568503689, SRR2568503790, SRR2568503891, SRR2568503992, and SRR2568504093 in BioProject PRJNA182501, and SRR2568538694, SRR2568538795, SRR2568538896 SRR2568538997, SRR2568539098, SRR2568539199, SRR25685392100, SRR25685393101, SRR25685394102, SRR25685395103, and SRR25685396104 in BioProject PRJNA208210 for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively.

The quality and completeness of the assemblies were assessed with yak105 0.1 using 20% of the processed Illumina short-reads which were set aside for that purpose. For N. tabacum, Quality Coverage and Quality Value of 0.982 and 38.1 were obtained; for N. sylvestris, they were of 0.993 and 41.5; and for N. tomentosiformis they were of 0.991 and 43.2.

The quality of the gene predictions from the trained Nicotiana model was evaluated using the prepared testing sets and compared with results obtained using already available models for arabidopsis, tomato, and coyote_tobacco models (Table 3).

The completeness of the gene model sets was evaluated using BUSCO106 5.4.7 with the solanales_odb10 lineage dataset. Completeness of 98.1%, 95.1%, and 96.1% at the transcript level and of 97.0%, 92.8%, and 93.4% at the protein level were obtained for N. tabacum, N. sylvestris, and N. tomentosiformis, respectively (Table 4). These values are similar to those obtained for S. lycopersicum, of 95.0% at the transcript level and 92.3% at the protein level.

All software used in this work is publicly available, with versions and parameters clearly described in Methods. If no detailed parameters were mentioned for a software, the default parameters suggested by the developer were used. No custom code was used during this study for the curation and/or validation of the datasets.

Knapp, S., Bohs, L., Nee, M. & Spooner, D. M. Solanaceae—A model for linking genomics with biodiversity. Comp. Funct. Genomics 5, 285–291 (2004).

Article CAS PubMed PubMed Central Google Scholar

Olmstead, R. G. et al. A molecular phylogeny of the Solanaceae. Taxon 57, 1159–1181 (2008).

Article Google Scholar

Clarkson, J. J. et al. Phylogenetic relationships in Nicotiana (Solanaceae) inferred from multiple plastid DNA regions. Mol. Phylogenet. Evol. 33, 75–90 (2004).

Article CAS PubMed Google Scholar

Clarkson, J. J. et al. Long‐term genome diploidization in allopolyploid Nicotiana section Repandae (Solanaceae). New Phytol. 168, 241–252 (2005).

Article CAS PubMed Google Scholar

D’Andrea, L. et al. Polyploid Nicotiana section Suaveolentes originated by hybridization of two ancestral Nicotiana clades. Front. Plant Sci. 14 (2023).

Baldwin, I. T. Inducible Nicotine Production in Native Nicotiana as an Example of Adaptive Phenotypic Plasticity. J. Chem. Ecol. 25, 3–30 (1999).

Article CAS Google Scholar

Kaminski, K. P. et al. Alkaloid chemophenetics and transcriptomics of the Nicotiana genus. Phytochemistry 177, 112424 (2020).

Article CAS PubMed Google Scholar

Tissier, A. Trichome Specific Expression: Promoters and Their Applications. in Transgenic Plants - Advances and Limitations (InTech, 2012).

Sierro, N. et al. Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis. Genome Biol. 14, R60 (2013).

Article PubMed PubMed Central Google Scholar

Sierro, N. et al. The tobacco genome sequence and its comparison with those of tomato and potato. Nat. Commun. 5, (2014).

Edwards, K. D. et al. A reference genome for Nicotiana tabacum enables map-based cloning of homeologous loci implicated in nitrogen utilization efficiency. BMC Genomics 18, (2017).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274527 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274528 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274540 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274542 (2013).

Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11, e0163962 (2016).

Article PubMed PubMed Central Google Scholar

Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

Article CAS PubMed Google Scholar

Chen, S. Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta 2, (2023).

Mak, Q. X. C., Wick, R. R., Holt, J. M. & Wang, J. R. Polishing De Novo nanopore assemblies of bacteria and eukaryotes with FMLRC2. Mol. Biol. Evol. 40, (2023).

Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, (2018).

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

Article CAS PubMed PubMed Central Google Scholar

Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).

Article CAS PubMed PubMed Central Google Scholar

Faust, G. G. & Hall, I. M. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30, 2503–2505 (2014).

Article CAS PubMed PubMed Central Google Scholar

Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, (2021).

Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing https://doi.org/10.48550/ARXIV.1207.3907 (2012).

Article Google Scholar

Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S. & Prins, P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput. Biol. 18, e1009123 (2022).

Article ADS CAS PubMed PubMed Central Google Scholar

NCBI Genome Project. Nicotiana tabacum plastid, complete genome. Nucleotide https://identifiers.org/nucleotide/NC_001879.2 (2000).

NCBI Genome Project. Nicotiana tabacum mitochondrion, complete genome. Nucleotide https://identifiers.org/nucleotide/NC_006581.1 (2004).

Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).

Article CAS PubMed PubMed Central Google Scholar

Langmead, B. Kraken 2, KrakenUniq and Bracken indexes https://benlangmead.github.io/aws-indexes/k2 (2022).

Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).

Article CAS PubMed Google Scholar

High Performance Algorithms Group. The Wellcome Sanger Institute. Paired REad TEXTure Mapper https://github.com/wtsi-hpag/PretextMap (2022).

High Performance Algorithms Group. The Wellcome Sanger Institute. OpenGL Powered Pretext Contact Map Viewer https://github.com/wtsi-hpag/PretextView (2022).

Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

Article CAS PubMed PubMed Central Google Scholar

Bindler, G. et al. A high density genetic map of tobacco (Nicotiana tabacum L.) obtained from large scale microsatellite marker development. Züchter Genet. Breed. Res. 123, 219–230 (2011).

Google Scholar

Wu, F. & Tanksley, S. D. Chromosomal evolution in the plant family Solanaceae. BMC Genomics 11, 182 (2010).

Article PubMed PubMed Central Google Scholar

Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, (2008).

Neumann, P., Novák, P., Hoštáková, N. & Macas, J. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob. DNA 10, (2019).

Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

Article CAS PubMed PubMed Central Google Scholar

Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences: Clustal Omega for Many Protein Sequences. Protein Sci. 27, 135–145 (2018).

Article CAS PubMed Google Scholar

Sievers, F. et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, (2011).

Mokhtar, M. M., Alsamman, A. M. & El Allali, A. PlantLTRdb: An interactive database for 195 plant species LTR-retrotransposons. Front. Plant Sci. 14, (2023).

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

Article CAS PubMed Google Scholar

Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 1–9 (2009).

Article Google Scholar

Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).

Article CAS PubMed Google Scholar

Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

Article CAS PubMed PubMed Central Google Scholar

Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).

Article Google Scholar

Li, H. Protein-to-genome alignment with miniprot. Bioinformatics 39, btad014 (2023).

Shao, M. Assembler for multiple RNA-seq samples https://github.com/Shao-Group/aletsch (2020).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR11912457 (2020).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR2106531 (2016).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274387 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274388 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274389 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274390 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274391 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274392 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274393 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274394 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274395 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274396 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274397 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274398 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274399 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274400 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274401 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274402 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274403 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274404 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274405 (2013).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045991 (2021).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045992 (2021).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045993 (2021).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045994 (2021).

Sierro, N. Nicotiana sylvestris genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256252 (2023).

Sierro, N. Nicotiana tomentosiformis genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256254 (2023).

Sierro, N. Nicotiana tabacum genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256256 (2023).

Sierro, N. Nicotiana model for augustus gene prediction, Zenodo, https://doi.org/10.5281/zenodo.8256280 (2023).

Sierro, N. & Ivanov, N. V. Nicotiana sylvestris, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:ASAF00000000 (2023).

Sierro, N. & Ivanov, N. V. Nicotiana tomentosiformis, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:ASAG00000000 (2023).

Sierro, N. & Ivanov, N. V. Nicotiana tabacum cultivar K326, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:AWOJ00000000 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685126 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685127 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685128 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685129 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685130 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685034 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685035 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685036 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685037 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685038 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685039 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685040 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685386 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685387 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685388 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685389 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685390 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685391 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685392 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685393 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685394 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685395 (2023).

NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685396 (2023).

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

Article CAS PubMed PubMed Central Google Scholar

Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).

Article CAS PubMed PubMed Central Google Scholar

Download references

We thank Simon Goepfert and Nicolas Bakaher for scientific discussions, and Rebecca Higgins for manuscript editorial revision.

PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, CH-2000, Neuchâtel, Switzerland

Nicolas Sierro, Mehdi Auberson, Rémi Dulize & Nikolai V. Ivanov

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

N.S. and N.V.I. conceived this project; M.A. and R.D. performed the experiments; N.S. assembled the genomes, generated the annotation sets, and performed the data analysis; N.S. and N.V.I. wrote and revised the manuscript. All authors have read and approved the final manuscript.

Correspondence to Nicolas Sierro.

N.S., M.A., R.D., and N.V.I. are employees of Philip Morris International.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

Sierro, N., Auberson, M., Dulize, R. et al. Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis. Sci Data 11, 135 (2024). https://doi.org/10.1038/s41597-024-02965-2

Download citation

Received: 09 November 2023

Accepted: 12 January 2024

Published: 26 January 2024

DOI: https://doi.org/10.1038/s41597-024-02965-2

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

SHARE