Issues with creating Genome Browsers for Whole Genome
Issues with creating Genome Browsers for Whole Genome Assemblies G-OnRamp Beta Users Workshop Wilson Leung 06/2017 Outline Obtain genome assemblies from NCBI Transfer large genomics datasets to Galaxy Obtain RNA-Seq data from NCBI SRA Identifying and masking repeats
Obtain protein sequences for tblastn searches Obtain RNA GenBank files for translated BLAT searches Types of evidence tracks on a Genome Browser Protein alignments (SPALN) Geneid N-SCAN PASA-EST Augustus
Entry point to all genomic datasets (e.g., genome assembly, transcriptome) that pertain to a study Data from the 1000 Genome Project available through NCBI BioProject SRA = Sequence Read Archive Database of high-throughput sequencing data Data available
through NCBI, EBI, and DDBJ Access genome assemblies from the NCBI Assembly database https://www.ncbi.nlm.nih.gov/assembly Download data files for GenBank and RefSeq whole genome assemblies Types of genome assemblies
RefSeq categories: Reference genome High quality assembly Standard for comparison Example: D. melanogaster Representative genome Best genome assembly available https://www.ncbi.nlm.nih.gov/assembly/help/
within a clade Obtain genome assemblies from the NCBI FTP site Download genome sequence, predicted transcript and protein products Consistent primary sequence IDs (accession.version) for both GFF and FASTA files https://www.ncbi.nlm.nih.gov/books/NBK431016/# Naming conventions for GenBank assemblies
__. Content type Description genomic Genome assembly (Repeats identified by WindowMasker are in lower-case)
rm Transposons identified by RepeatMasker (Eukaryotes only) See the README.txt file within the directory for details Common data formats used by GenBank assemblies Forma Description
t fna faa Nucleotide sequence in FASTA format Protein sequence in FASTA format gbff GenBank flat file format
gff General Feature Format Version 3 Large data files are compressed by gzip File suffix = .gz Supported by Galaxy Built-in support in macOS Use 7-Zip on MS Windows
http://www.7-zip.org/ Genome assembly in FASTA format: _genomic.fna.gz Example: GCA_000269505.2_DroMir_2.2_genomic.fna.gz DEMO: Access the D. miranda genome assembly from the NCBI FTP site Benefits of using FTP to transfer large files to Galaxy Problems with standard file upload
Most servers have a 2 GB file upload size limit Cannot monitor progress of file upload Cannot resume interrupted file upload Galaxy Main and G-OnRamp support FTP file upload Support transfer of large gzip, bzip2, and zip files https://galaxyproject.org/ftp-upload/ Overview of the File Transfer Protocol (FTP) Data transfer protocol between a client and a server
May allow anonymous access Insecure connection Partial built-in support in most operating systems macOS: Go Connect to Server MS Windows: File Explorer Other graphical clients Cyberduck, FileZilla, Fugu, Use FTP to upload files to Galaxy
Use a FTP client to initiate a FTP connection to Galaxy Galaxy Main FTP server: ftp://usegalaxy.org Use your Galaxy account credentials to authenticate Transfer files to the Galaxy FTP server Use the Upload File tool to import contents of the FTP directory into Galaxy Files available through the Choose FTP file button Directly transfer files from
the NCBI FTP site to Galaxy Open Connection to Galaxy Main in Cyberduck Server: usegalaxy.org Enter the username and password for your Galaxy account File New Browser Copy the FTP link to the GenBank assembly at NCBI Paste link into the Quick Connect textbox and press Enter Select and drag files from the NCBI connection
window to the Galaxy connection window Compatible with version 6.0.0 of DEMO: Use FTP to upload the D. miranda genome assembly to Galaxy Transfer high-throughput sequencing data from the SRA to Galaxy Second and third generation sequencing data available through the Sequence Read Archive (SRA)
NCBI SRA stores sequencing data in sra format Use the SRA Toolkit to convert files to fastq (fastq-dump) Paired-end reads might split at the wrong position: https://www.biostars.org/p/12569/ Goals of repeat analysis Improve G-OnRamp workflow: Improve performance of tblastn and BLAT searches Reduce number of false positives in gene
predictions Survey of the repetitive contents of a genome: Estimate total repeat density Types and distributions of transposons Develop repeat pipeline to handle genome assemblies with different sizes and quality Assembly sizes: 111Mb - 2.8Gb Number of scaffolds: 54 - 402,501
Strategies used to identify repeats in five genome assemblies k-mer based: WindowMasker, Tallymer tRNA derived SINEs: tRNAscan-SE Structure based: LTRharvest + LTRdigest, TRF, TanTan Conserved domains within transposons: transposonPSI Species-specific repeat library: RepBase repeats from closely-related species (if available) RepeatScout MUMmer + PILER RepeatModeler
Repeat classification: RepeatClassifier Repeat tracks available on the G-OnRamp Assembly Hubs WindowMask Tallymer er TRF RepeatMask er Nested repeats LTRHarvest
TransposonP SI http://old-gep.wustl.edu/~wilson/gonramphubs/ Accurate repeat identification requires the use of multiple techniques Repeat libraries Arabidopsis thaliana repeatome
Maumus F, Quesneville H. PLoS One. 2014 Apr 7;9(4):e94101. Run time (seconds) RepeatScout run time vs. genome size RepeatSco ut Genome Size (Mb) Schaeffer CE et al. Bioinformatics.
2016 Jun 15;32(12):i209- Memory required (Gb) High memory requirement of k-mer based repeat finders RepeatSco ut Genome Size (Mb)
Schaeffer CE et al. Bioinformatics. 2016 Jun 15;32(12):i209- Partition genome assembly into smaller batches Shuffle scaffolds in genome assembly Scaffolds in the original assembly are often ordered by size Batch size optimization criteria: Avoids memory errors (i.e., segmentation faults)
Can be processed in a reasonable amount of time Batch size for RepeatScout and PILER: 100 Mb per batch Compare only within each batch Random sample of 600 Mb for the X. laevis genome Use tandem repeat masked genome assembly to improve performance Some genomes (e.g., C. reinhardtii) contain high density of tandem repeats Degrades performance of many repeat finding algorithms
Results in large number of spurious matches RepeatModeler analysis of C. reinhardtii (111 Mb) Requires ~130 hours to process unmasked genome Requires ~90 hours to process tandem repeat masked genome Requires ~30 hours to process A. vittata genome (1.2 Gb) tandem repeat masked assembly in the Use RepeatModeler and PILER analyses Recent changes to
RepeatMasker and RepeatModeler New Dfam_consensus database: Creative Commons CC0 1.0 public domain license http://www.dfam-consensus.org/ Support searches using profile Hidden Markov Models HMMER + Dfam Obtain protein sequences
for tblastn searches Species-specific databases FlyBase: dmel-all-translation-r6.15.fasta.gz http:// flybase.org/static_pages/downloads/bulkdata7.html Swiss-Prot High quality, manually annotated section of UniProtKB http://www.uniprot.org/downloads NCBI RefSeq Use only curated RefSeq records (accession prefix = NP_)
Protein sequences from RefSeq reference genomes https://www.ncbi.nlm.nih.gov/books/NBK50679/ Misannotations in public databases # sequences in family > 50 11-50 10 X
None Average % misannotati on Schnoes AM, et al. PLoS Comput Biol. 2009 Dec;5(12):e1000605. Obtain Swiss-Prot protein sequences UniProt download page
(http://www.uniprot.org/downloads) Entire Swiss-Prot database Swiss-Prot sequences separated by taxonomic divisions Human, invertebrates, mammals, plants, rodents, vertebrates, Download files with the uniprot_sprot prefix Use the seqret EMBOSS tool in Galaxy to create FASTA file Search for reviewed:yes entries in UniProtKB http://www.uniprot.org/uniprot/?query=reviewed%3Ayes Filter protein sequences by taxonomy, keywords, gene ontology, enzyme class or pathways
DEMO: Download Swiss-Prot protein sequences from UniProt NCBI Reference Sequence database More comprehensive than Swiss-Prot Two major types of RefSeq records: Known RefSeq: NP_ Model RefSeq: XP_ Model RefSeq records are based on results from computational pipelines More likely to propagate annotation errors
https://www.ncbi.nlm.nih.gov/refseq/about/ Obtain protein sequences from the NCBI RefSeq database Download from the NCBI Genome database https://www.ncbi.nlm.nih.gov/genome/ Search the NCBI Protein database with the RefSeq and reviewed filters
Obtain RNA GenBank files for translated BLAT searches Available through the NCBI FTP server File with the _rna.gbff.gz suffix Obtain the RNA GenBank file for D. melanogaster Summary Obtain genome assemblies from NCBI Use FTP to transfer large genome assemblies to Galaxy
Use EBI SRA to transfer fastq files to Galaxy Use different approaches to identify repetitive sequences in a genome Obtain transcript and protein sequences from NCBI and UniProtKB for sequence similarity searches Questions? https://flic.kr/p/bhyT8B
The course uses a combination of self-directed pre-reading(usually via the VLE), lectures, exercises, workshops, and case studies. However, it places a greater emphasis on working in groups than other courses as Systems Engineering is a team/group activity.
The network might not run all that efficiently E.g., many Web transfers sharing a single busy link Network operations Detect, diagnose, and fix problems Measure, model, and control the network Traffic Measurement: Science vs. Engineering Science: characterizing the network End-to-end...
La Chanson de Roland Struttura e autore La vicenda Tra storia e leggenda I temi La Chanson de Roland Struttura e autore La vicenda Tra storia e leggenda I temi Forse il giullare Turoldo E' un poema di circa 4000...
SATs are used to inform Teacher Assessment. This means that they are one part of the information we use to assess the standard at which a child is working. We also look at work in books, teacher observations and records....
All About Informational Texts Informational Text What does it mean????? - are there any key words? - what does text mean? Informational texts are written to inform about a specific topic such as penguins or mammals.
Ready to download the document? Go ahead and hit continue!