Issues with creating Genome Browsers for Whole Genome

Issues with creating Genome Browsers for Whole Genome

Issues with creating Genome Browsers for Whole Genome Assemblies G-OnRamp Beta Users Workshop Wilson Leung 06/2017 Outline Obtain genome assemblies from NCBI Transfer large genomics datasets to Galaxy Obtain RNA-Seq data from NCBI SRA Identifying and masking repeats

Obtain protein sequences for tblastn searches Obtain RNA GenBank files for translated BLAT searches Types of evidence tracks on a Genome Browser Protein alignments (SPALN) Geneid N-SCAN PASA-EST Augustus

(with RNA-Seq) RNA PolII ChIP-Seq (MACS2) RNA-Seq Coverage TopHat junctions StringTie + TransDecoder RepeatMasker Obtaining the genome assembly from NCBI BioProject https://www.ncbi.nlm.nih.gov/bioproject/

Entry point to all genomic datasets (e.g., genome assembly, transcriptome) that pertain to a study Data from the 1000 Genome Project available through NCBI BioProject SRA = Sequence Read Archive Database of high-throughput sequencing data Data available

through NCBI, EBI, and DDBJ Access genome assemblies from the NCBI Assembly database https://www.ncbi.nlm.nih.gov/assembly Download data files for GenBank and RefSeq whole genome assemblies Types of genome assemblies

RefSeq categories: Reference genome High quality assembly Standard for comparison Example: D. melanogaster Representative genome Best genome assembly available https://www.ncbi.nlm.nih.gov/assembly/help/

within a clade Obtain genome assemblies from the NCBI FTP site Download genome sequence, predicted transcript and protein products Consistent primary sequence IDs (accession.version) for both GFF and FASTA files https://www.ncbi.nlm.nih.gov/books/NBK431016/# Naming conventions for GenBank assemblies

__. Content type Description genomic Genome assembly (Repeats identified by WindowMasker are in lower-case)

rm Transposons identified by RepeatMasker (Eukaryotes only) See the README.txt file within the directory for details Common data formats used by GenBank assemblies Forma Description

t fna faa Nucleotide sequence in FASTA format Protein sequence in FASTA format gbff GenBank flat file format

gff General Feature Format Version 3 Large data files are compressed by gzip File suffix = .gz Supported by Galaxy Built-in support in macOS Use 7-Zip on MS Windows

http://www.7-zip.org/ Genome assembly in FASTA format: _genomic.fna.gz Example: GCA_000269505.2_DroMir_2.2_genomic.fna.gz DEMO: Access the D. miranda genome assembly from the NCBI FTP site Benefits of using FTP to transfer large files to Galaxy Problems with standard file upload

Most servers have a 2 GB file upload size limit Cannot monitor progress of file upload Cannot resume interrupted file upload Galaxy Main and G-OnRamp support FTP file upload Support transfer of large gzip, bzip2, and zip files https://galaxyproject.org/ftp-upload/ Overview of the File Transfer Protocol (FTP) Data transfer protocol between a client and a server

May allow anonymous access Insecure connection Partial built-in support in most operating systems macOS: Go Connect to Server MS Windows: File Explorer Other graphical clients Cyberduck, FileZilla, Fugu, Use FTP to upload files to Galaxy

Use a FTP client to initiate a FTP connection to Galaxy Galaxy Main FTP server: ftp://usegalaxy.org Use your Galaxy account credentials to authenticate Transfer files to the Galaxy FTP server Use the Upload File tool to import contents of the FTP directory into Galaxy Files available through the Choose FTP file button Directly transfer files from

the NCBI FTP site to Galaxy Open Connection to Galaxy Main in Cyberduck Server: usegalaxy.org Enter the username and password for your Galaxy account File New Browser Copy the FTP link to the GenBank assembly at NCBI Paste link into the Quick Connect textbox and press Enter Select and drag files from the NCBI connection

window to the Galaxy connection window Compatible with version 6.0.0 of DEMO: Use FTP to upload the D. miranda genome assembly to Galaxy Transfer high-throughput sequencing data from the SRA to Galaxy Second and third generation sequencing data available through the Sequence Read Archive (SRA)

NCBI SRA stores sequencing data in sra format Use the SRA Toolkit to convert files to fastq (fastq-dump) Paired-end reads might split at the wrong position: https://www.biostars.org/p/12569/ Goals of repeat analysis Improve G-OnRamp workflow: Improve performance of tblastn and BLAT searches Reduce number of false positives in gene

predictions Survey of the repetitive contents of a genome: Estimate total repeat density Types and distributions of transposons Develop repeat pipeline to handle genome assemblies with different sizes and quality Assembly sizes: 111Mb - 2.8Gb Number of scaffolds: 54 - 402,501

Strategies used to identify repeats in five genome assemblies k-mer based: WindowMasker, Tallymer tRNA derived SINEs: tRNAscan-SE Structure based: LTRharvest + LTRdigest, TRF, TanTan Conserved domains within transposons: transposonPSI Species-specific repeat library: RepBase repeats from closely-related species (if available) RepeatScout MUMmer + PILER RepeatModeler

Repeat classification: RepeatClassifier Repeat tracks available on the G-OnRamp Assembly Hubs WindowMask Tallymer er TRF RepeatMask er Nested repeats LTRHarvest

TransposonP SI http://old-gep.wustl.edu/~wilson/gonramphubs/ Accurate repeat identification requires the use of multiple techniques Repeat libraries Arabidopsis thaliana repeatome

Maumus F, Quesneville H. PLoS One. 2014 Apr 7;9(4):e94101. Run time (seconds) RepeatScout run time vs. genome size RepeatSco ut Genome Size (Mb) Schaeffer CE et al. Bioinformatics.

2016 Jun 15;32(12):i209- Memory required (Gb) High memory requirement of k-mer based repeat finders RepeatSco ut Genome Size (Mb)

Schaeffer CE et al. Bioinformatics. 2016 Jun 15;32(12):i209- Partition genome assembly into smaller batches Shuffle scaffolds in genome assembly Scaffolds in the original assembly are often ordered by size Batch size optimization criteria: Avoids memory errors (i.e., segmentation faults)

Can be processed in a reasonable amount of time Batch size for RepeatScout and PILER: 100 Mb per batch Compare only within each batch Random sample of 600 Mb for the X. laevis genome Use tandem repeat masked genome assembly to improve performance Some genomes (e.g., C. reinhardtii) contain high density of tandem repeats Degrades performance of many repeat finding algorithms

Results in large number of spurious matches RepeatModeler analysis of C. reinhardtii (111 Mb) Requires ~130 hours to process unmasked genome Requires ~90 hours to process tandem repeat masked genome Requires ~30 hours to process A. vittata genome (1.2 Gb) tandem repeat masked assembly in the Use RepeatModeler and PILER analyses Recent changes to

RepeatMasker and RepeatModeler New Dfam_consensus database: Creative Commons CC0 1.0 public domain license http://www.dfam-consensus.org/ Support searches using profile Hidden Markov Models HMMER + Dfam Obtain protein sequences

for tblastn searches Species-specific databases FlyBase: dmel-all-translation-r6.15.fasta.gz http:// flybase.org/static_pages/downloads/bulkdata7.html Swiss-Prot High quality, manually annotated section of UniProtKB http://www.uniprot.org/downloads NCBI RefSeq Use only curated RefSeq records (accession prefix = NP_)

Protein sequences from RefSeq reference genomes https://www.ncbi.nlm.nih.gov/books/NBK50679/ Misannotations in public databases # sequences in family > 50 11-50 10 X

None Average % misannotati on Schnoes AM, et al. PLoS Comput Biol. 2009 Dec;5(12):e1000605. Obtain Swiss-Prot protein sequences UniProt download page

(http://www.uniprot.org/downloads) Entire Swiss-Prot database Swiss-Prot sequences separated by taxonomic divisions Human, invertebrates, mammals, plants, rodents, vertebrates, Download files with the uniprot_sprot prefix Use the seqret EMBOSS tool in Galaxy to create FASTA file Search for reviewed:yes entries in UniProtKB http://www.uniprot.org/uniprot/?query=reviewed%3Ayes Filter protein sequences by taxonomy, keywords, gene ontology, enzyme class or pathways

DEMO: Download Swiss-Prot protein sequences from UniProt NCBI Reference Sequence database More comprehensive than Swiss-Prot Two major types of RefSeq records: Known RefSeq: NP_ Model RefSeq: XP_ Model RefSeq records are based on results from computational pipelines More likely to propagate annotation errors

https://www.ncbi.nlm.nih.gov/refseq/about/ Obtain protein sequences from the NCBI RefSeq database Download from the NCBI Genome database https://www.ncbi.nlm.nih.gov/genome/ Search the NCBI Protein database with the RefSeq and reviewed filters

Obtain RNA GenBank files for translated BLAT searches Available through the NCBI FTP server File with the _rna.gbff.gz suffix Obtain the RNA GenBank file for D. melanogaster Summary Obtain genome assemblies from NCBI Use FTP to transfer large genome assemblies to Galaxy

Use EBI SRA to transfer fastq files to Galaxy Use different approaches to identify repetitive sequences in a genome Obtain transcript and protein sequences from NCBI and UniProtKB for sequence similarity searches Questions? https://flic.kr/p/bhyT8B

Recently Viewed Presentations