HornwortBase 2022 ================= Updated: 2023-04-5 by PWS Files ----- *genome.fasta -- contains scaffolds and unscaffolded contigs, numbered incrementally from longest to shortest. Number is preceded by an 'S' or 'C' to designate a scaffold (i.e. chromosome/pseudomolecule) vs. an unscaffolded contig, respectively. First two letters of genus and three of specific epithet are used as taxon identifiers (with infraspecific ID if applicable). Sequence metadata includes a unique hash ID that will be used to designate different assemblies if necessary. *gene_annotations.gtf/gff -- these are the main gene annotations. Features include genes, transcripts, CDS, exons, and introns. Some features may differ between GTF and GFF because the conversion software tries to make small corrections. Genes are named in a Phytozome-like style where the gene names are made of the sequence name + 'G' + a number. The number increments by 100 along each sequence, and restarts on each sequence. *gene_functional_annotations.tsv -- these are functional annotations generated by eggNOG emapper based on the PROT.fasta for each species. *masked_gene_annotations.gtf/gff -- these are genes annotated in repeat-masked parts of the genome, provided for those studying repetitive elements or in case a gene of interest isn't found in the main gene annotations. They are named similarly as above, except they include 'REG' (repetitive element gene) instead of 'G'. *repeat_annotations.gff -- these are annotations of repetitive elements identified in the genome. It is the combined output of EDTA and tandemrepeatsfinder. *TRANS/CDS/PROT.fasta -- these are intron-containing transcripts/spliced coding sequences/translated amino acid sequences of all transcripts in the main gene annotation file. Sequence metadata includes the hash ID from the genome assembly and hash ID for this version of annotations. *primary_transcripts.fasta -- these are subsets of CDS/PROT files with only one transcript per gene. See below for details on how primary transcripts were chosen. MD5.txt -- MD5 hashes of all the above files are provided in this file to detect any changes. Methods ------- This directory contains new hornwort genomes and annotations following HiC scaffolding. Genomes were assembled with Oxford Nanopore seq data using Flye v2.8, polished with Illumina reads using Pilon v1.24, and scaffolded by Phase Genomics. Gaps in the scaffolds were closed with TGS-Gapcloser v1.1.1, which attempts to bridge adjacent scaffolds using Oxford Nanopore reads, then polishes any inserted sequence with Illumina reads. Genomes were checked for contamination using a combination of evidence from: 1) different HiC contact patterns, 2) taxonomic annotation of contigs, 3) different levels of sequencing depth, and 4) different GC content of contigs. After removing likely contaminants based on at least two of these criteria, scaffolds/contigs were reordered from longest to shortest and numbered sequentially. A custom repetitive element (RE) library was constructed for each genome using EDTA v2. LTRs of unknown type were extracted from the RE library and BLASTed against a curated set of transposaes (located at /home/fay-wei/bin/Tpases020812). Any LTRs that could not be assigned to a known type were removed from the RE library. Then the RE library was BLASTed against a set of identified plant proteins (located at /home/fay-wei/bin/uniprot/uniprot_sprot_plants.fasta) to remove anything erroneously identified in the RE library. After removing any plant proteins from the RE library, it was used as input to RepeatMasker v4.1.0 to mask the genome. Tandem repeats were annotated from the repeat-masked genome using tandemrepeatsfinder (TRF) (1). TRF output, converted to GFF format using a custom script, was combined with the GFF file output by EDTA to create the final repeat annotation file. Genes were predicted using BRAKER v2 on the repeat-masked genome with training data from Illumina RNA reads mapped to the repeat-masked genome with HISAT2, combined with proteins predicted from draft versions of all the hornwort genomes (2). A second run of BRAKER was done on the full genome, without any masked repeats (3). To create an annotation set of only genes in the repeat-masked regions, bedtools was used to identify genes in the full-genome annotations that did not have any overlap in the masked-genome annotations. All features associated with those genes (whether or not they were found in both annotation sets) were moved to a new GTF file (4). Genes were renamed to a Phytozome-style format (described above), those containing internal stop codons were marked with "pseudo=true" in the description field (5). GTF files were converted to GFF using the AGAT toolkit v1. CDS and protein sequences were generated by BRAKER, while unspliced transcript sequences were created from the GTF and genome files with custom scripts. Primary transcripts of the unspliced transcripts, CDS and protein sequences were selected as the longest transcript for each gene, and moved to new files using custom scripts (6). Functional annotations were generated from the amino acid translations of all transcripts (*PROT.fasta) using eggNOG emapper v2.1.9. For additional details and code snippets see here (private Li lab repo): https://github.com/Li-Lab-BTI/Li-Lab-BTI/blob/master/peter/Hornwort_genomes_post-scaffold.md Footnotes: (1) The masked genome was used to avoid double-annotation of the same elements (2) Only genes that had at least one ortholog in another hornwort species were used (3) This was done to annotate genes associated with REs and to check for any non-RE genes that were accidentally masked despite filtering the RE library. I saw a small decrease in the number of BUSCO genes found in the masked genome annotations. (4) Retaining all features associated with each gene was necessary to stop some GFF/GTF software from acting unpredictably and creating errors downstream. (5) CDS and PROT sequences with internal stop codons have regions masked with N or X, respectively, while unspliced transcript sequences do not. Marking these genes as potential pseudogenes follows NCBI procedure. (6) The longest transcript from each gene (as opposed to the transcript output as "t1" by BRAKER) was used because it greatly improved the BUSCO score.