HornwortBase 2022
=================
Updated: 2023-04-5 by PWS


Files
-----
*genome.fasta -- contains scaffolds and unscaffolded contigs, numbered incrementally 
                 from longest to shortest. Number is preceded by an 'S' or 'C' to 
                 designate a scaffold (i.e. chromosome/pseudomolecule) vs. an 
                 unscaffolded contig, respectively. First two letters of genus and 
                 three of specific epithet are used as taxon identifiers (with 
                 infraspecific ID if applicable). Sequence metadata includes a 
                 unique hash ID that will be used to designate different assemblies
                 if necessary.    

*gene_annotations.gtf/gff -- these are the main gene annotations. Features include
                             genes, transcripts, CDS, exons, and introns. Some features
                             may differ between GTF and GFF because the conversion software
                             tries to make small corrections. Genes are named in a 
                             Phytozome-like style where the gene names are made of the 
                             sequence name + 'G' + a number. The number increments by 100
                             along each sequence, and restarts on each sequence.

*gene_functional_annotations.tsv -- these are functional annotations generated by eggNOG
                                    emapper based on the PROT.fasta for each species.

*masked_gene_annotations.gtf/gff -- these are genes annotated in repeat-masked parts of 
                                    the genome, provided for those studying repetitive 
                                    elements or in case a gene of interest isn't found 
                                    in the main gene annotations. They are named 
                                    similarly as above, except they include 'REG' 
                                    (repetitive element gene) instead of 'G'. 

*repeat_annotations.gff -- these are annotations of repetitive elements identified in 
                           the genome. It is the combined output of EDTA and  
                           tandemrepeatsfinder. 

*TRANS/CDS/PROT.fasta -- these are intron-containing transcripts/spliced coding 
                         sequences/translated amino acid sequences of all transcripts in 
                         the main gene annotation file. Sequence metadata includes the 
                         hash ID from the genome assembly and hash ID for this version of 
                         annotations. 

*primary_transcripts.fasta -- these are subsets of CDS/PROT files with only one transcript
                              per gene. See below for details on how primary transcripts 
                              were chosen. 

MD5.txt -- MD5 hashes of all the above files are provided in this file to detect any changes.
 

Methods
-------
This directory contains new hornwort genomes and annotations following HiC scaffolding. 
Genomes were assembled with Oxford Nanopore seq data using Flye v2.8, polished with
Illumina reads using Pilon v1.24, and scaffolded by Phase Genomics. Gaps in the scaffolds 
were closed with TGS-Gapcloser v1.1.1, which attempts to bridge adjacent scaffolds using
Oxford Nanopore reads, then polishes any inserted sequence with Illumina reads. Genomes were 
checked for contamination using a combination of evidence from: 1) different HiC contact 
patterns, 2) taxonomic annotation of contigs, 3) different levels of sequencing depth, and
4) different GC content of contigs. After removing likely contaminants based on at least
two of these criteria, scaffolds/contigs were reordered from longest to shortest and
numbered sequentially. 

A custom repetitive element (RE) library was constructed for each genome using EDTA v2. 
LTRs of unknown type were extracted from the RE library and BLASTed against a curated set 
of transposaes (located at /home/fay-wei/bin/Tpases020812). Any LTRs that could not be 
assigned to a known type were removed from the RE library. Then the RE library was BLASTed 
against a set of identified plant proteins (located at 
/home/fay-wei/bin/uniprot/uniprot_sprot_plants.fasta) to remove anything erroneously 
identified in the RE library. After removing any plant proteins from the RE library, it 
was used as input to RepeatMasker v4.1.0 to mask the genome. Tandem repeats were annotated
from the repeat-masked genome using tandemrepeatsfinder (TRF) (1). TRF output, converted to
GFF format using a custom script, was combined with the GFF file output by EDTA to create 
the final repeat annotation file.

Genes were predicted using BRAKER v2 on the repeat-masked genome with training data from 
Illumina RNA reads mapped to the repeat-masked genome with HISAT2, combined with proteins 
predicted from draft versions of all the hornwort genomes (2). A second run of BRAKER was
done on the full genome, without any masked repeats (3). To create an annotation set of 
only genes in the repeat-masked regions, bedtools was used to identify genes in the 
full-genome annotations that did not have any overlap in the masked-genome annotations. 
All features associated with those genes (whether or not they were found in both annotation
sets) were moved to a new GTF file (4). Genes were renamed to a Phytozome-style format
(described above), those containing internal stop codons were marked with "pseudo=true" in
the description field (5). GTF files were converted to GFF using the AGAT toolkit v1. CDS and
protein sequences were generated by BRAKER, while unspliced transcript sequences were created
from the GTF and genome files with custom scripts. Primary transcripts of the unspliced 
transcripts, CDS and protein sequences were selected as the longest transcript for each gene, 
and moved to new files using custom scripts (6). 

Functional annotations were generated from the amino acid translations of all transcripts 
(*PROT.fasta) using eggNOG emapper v2.1.9.

For additional details and code snippets see here (private Li lab repo): 
https://github.com/Li-Lab-BTI/Li-Lab-BTI/blob/master/peter/Hornwort_genomes_post-scaffold.md


Footnotes:
(1) The masked genome was used to avoid double-annotation of the same elements
(2) Only genes that had at least one ortholog in another hornwort species were used 
(3) This was done to annotate genes associated with REs and to check for any non-RE genes
    that were accidentally masked despite filtering the RE library. I saw a small decrease 
    in the number of BUSCO genes found in the masked genome annotations.
(4) Retaining all features associated with each gene was necessary to stop some GFF/GTF 
    software from acting unpredictably and creating errors downstream. 
(5) CDS and PROT sequences with internal stop codons have regions masked with N or X,
    respectively, while unspliced transcript sequences do not. Marking these genes as 
    potential pseudogenes follows NCBI procedure.
(6) The longest transcript from each gene (as opposed to the transcript output as "t1" by
    BRAKER) was used because it greatly improved the BUSCO score.