Next Generation Sequencing analysis June 6th, 2017 Course instructors Antonio Marco

Stuart Newman Vladimir Teif Course plan 11.00-12.00: Introductory lecture 12.00-12.30: Lunch 12.30-14.00: ChIP-seq practical

14.15-16.00: RNA-seq practical 16.15-18.00: Integrative analysis 1 Generation Sequencing st Microarrays

Affimetrix microarrays 2nd (Next) Generation Sequencing Illumina MiSeq

Microarrays and NGS are used for different purposes http://www.genengnews.com/Contributor/ShawnCBakerPhD/5687/ NGS METHODS AND THEIR APPLICATIONS Chromatin

domains Hi-C Figure adapted from http://www.scienceinschool.org

NGS data types RNA-seq, GRO-seq, CAGE, SAGE, CLIP-seq, Drop-seq gene expression; non-coding RNA ChIP-seq, MNase-seq, DNase-seq, ATAC-se, etc protein binding; histone modifications chromatin accessibility; nucleosome positioning Bisulfite sequencing (DNA methylation)

Hi-C, 3C, 4C, ChIA-PET, etc (Chromatin loops in 3D) Amplicon sequencing targeted regions; philogenomics; metagenomics Whole Genome Sequencing (WGS) de-novo assembly (new species or new analyses) Curated bibliography of NGS methods (~100 methods) can be found at https://liorpachter.wordpress.com/seq/

Where to get NGS data? Do your own experiment Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo Sequence read archive (SRA) https://www.ncbi.nlm.nih.gov/sra

European Nucleotide Archive https://www.ebi.ac.uk/ena The Cancer Genome Atlas (TCGA) https://tcga-data.nci.nih.gov/tcga Exome Aggregation Consortium (ExAC) http://exac.broadinstitute.org/ You also have to upload your data!

How to analyze NGS data? Ask a bioinformatician you need to explain what do you want, and for that you need to understand what/how can be done Do it yourself Command line > become a bioinformatician

Online wrappers > simpler, but file size limits Example of a convenient online tool: Galaxy http://galaxy.essex.ac.uk/ ChIP-seq experiment workflow 1. Crosslink Protein-DNA complexes in situ

2. Isolate nuclei and fragment DNA (sonication or digestion) 3. Immunoprecipitate with antibody against target nuclear protein and reverse crosslinks 4. Release DNA, prepare sequencing library

and submit for sequencing Adapted from www.VisiScience.com ChIP-seq analysis workflow www.utsouthwestern.edu/labs.bioinformatics-core/analysis/chip-seq.png

NGS output after sequencing: .fastq files (FASTQ format) NGS data after mapping: .bed files (BED format)

Bowtie, BWA, ELAND, Novoalign, BLAST, ClustalW TopHat (for RNA-seq) Data view in genome browsers Jung et al., NAR 2014

UCSC Genome Browser (online) IGV (install on a local computer) Peak shapes can be different Park P. J., Nature Genetics, 2009

ChIP-seq: reads to peaks/regions MACS2 (universal) HOMER (universal) CISER (histones ) PeakSeq edgeR CisGenome

Park P. J., Nature Genetics, 2009 RNA-seq: reads to genes/regions DESeq, edgeR, Cuffdiff

DNA methylation data BISMARK DMRcaller Intersecting genomic regions

BedTools (command line) Galaxy (online) Genomic features are also regions Is ChIP-seq signal enriched there?

Mattout et al., Genome Biology, 2015 Lets look at many similar regions deepTools 2.0 https://github.com/fidelram/deepTools/wiki/Visualizations

ChIP-seq heat maps for all genes, scaled with respect to their start (TSS) and

end (TES) deepTools 2.0 https://github.com/fidelram/deepTools/wiki/Visualizations Cluster heatmaps deepTools 2.0

https://github.com/fidelram/deepTools/wiki/Visualizations Comparing cluster heatmaps between two cell conditions NucTools https://homeveg.github.io/nuctools/

Histone modifications around TSS http://www.ie-freiburg.mpg.de/bioinformaticsfac NGS data integration

http://determinedtosee.com/wp-content/uploads/2014/08/jigsaw-puzzle.jpg Different datasets in several tracks of a genome browser 5mC Gifford et.al., Cell 2013

Heat maps again: Signal from data 1 around regions in data 2 Here: Nucleosome occupancy around

bound CTCF in mouse stem cells Vainshtein et.al., BMC Genomics 2017 Correlation analysis:

any 2 datasets can be correlated http://homer.salk.edu/homer/ngs/quantification.html Correlation of regulatory protein binding with gene expression Expression fold change

100 10 1

0.1 0.01 0.01 0.1

1 CTCF occupancy fold change Pavlaki et al., 2016 10

Gene ontology (GO) analysis Calo et al. (2015) Nature 518, 249253 DAVID, Gorilla, GREAT, EnrichR Motif enrichment analysis

Pavlaki et al., 2016 HOMER, MEME Motif enrichment analysis

MEME-ChIP Summary of typical analyses: Differential peak calling Differential gene expression Intersection of different signals Correlation of different signals

Motif sequence analysis Gene Ontology analysis Questions? Computer cluster and Linux NGS data are stored in very large text files

NGS analysis is usually performed on a computer cluster using Linux. Why Linux? Because it is free, open-source, and very stable. Plus historic reasons. Linux likes working with large text files :) WinSCP: Windows file manager

WinSCP WinSCP: Windows file manager genome.essex.ac.uk

WinSCP: Windows file manager Putty: Linux command line Putty Putty: Linux command line

genome.essex.ac.uk Putty: Linux command line Putty: Linux command line Learning Linux in 5 minutes

There are two options for your work in Linux: 1) Type your commands one by one in Putty 2) Write all commands in a file called bash file, then execute this file, and all your commands written there will be executed We have prepared your bash files, you will just need to execute them

5 Linux commands you need cd DirectoryName change directory less FileName read file FileName qsub FileName execute bash file qstat check progress of all users wc FileName count lines in FileName

Useful shortcuts To copy/paste from Windows to Putty: Copy [CTRL]+[C], then right-click in Putty to paste it Anywhere in Command Line in Putty: [up], [down] keys - scrolls through command history

Auto completion of file/directory names: [TAB] When specifying directory name: ".." (dot dot) - refers to the parent directory

"~" (Tilda) or "~/" - refers to the home directory Additional Linux hints All commands, usernames, passwords, file & directory names in Linux are case sensitive. File paths (locations of files) use /, not \, e.g. /storage/projects/.

Avoid using spaces in filenames Questions?

