SeqMonk: NGS Analysis on your desktop

SeqMonk: NGS Analysis on your desktop

Analysing ChIP-Seq Data Simon Andrews [email protected] @simon_andrews v2019-09 Data Creation and Processing Starting DNA

Fragmented DNA ChIPped DNA Mapped BAM File FastQ Sequence File

Sequence Library Filtered BAM File Exploration Analysis Steps in Analysis

Define enriched regions Based around features De-novo peak prediction Quantitate Corrections and Normalisation Compare Categorical

Quantitative Defining Regions - Should I peak call? You need a single set of reference positions for analysis Peak calling to define solely from the data Feature based measurements if your exploration showed linkage to features If exploration showed strong and reasonably complete feature association then this is a good option

No worries about missing weaker peaks More complete background (get both enriched and unenriched) If no feature linkage then peak call Only looking at enriched regions More difficult to do functional interpretation later on How Peak Callers Work (MACS) Optimise the

starting data Build a background model Test sliding windows Report

Apply per-site adjustment Optimise the starting data Correct the for/rev offset Deduplicate Build a background model

Lambda value Observed Build a background model Lambda value Critical p-value (n=18)

Model Build a background model Lambda value Critical p-value (n=18) Observed + Model

Test Sliding Windows Generally use half of the library fragment size Windows whose count exceeds the critical value are kept Merge adjacent windows over the critical value to form peaks Generates candidate (not final) peak set Correct for local variation Critical value

Generate localised model if input density is higher than the global value Most pessimistic p-value is kept Broad Peaks Added in MACS2 suitable where larger regions with variable enrichment exist Uses two thresholds for enrichment

How should you apply peak callers Multiple ChIPs (over multiple conditions) Multiple Inputs Multiple Inputs Input variability generally reflects general trends Mapability Genome Assembly

Fragmentation biases Normally best to merge all inputs to one common reference input Multiple ChIPs Peak Sets BAM Files WT ChIP 1

WT ChIP 2 KO ChIP 1 KO ChIP 2 Peaks WT ChIP 1

+ WT ChIP 2 + KO ChIP 1 + KO ChIP 2 WT ChIP 1 +

WT ChIP 2 + KO ChIP 1 + KO ChIP 2 Multiple ChIPs BAM Files

Peak Sets WT ChIP 1 WT Peaks 1 WT ChIP 2 WT Peaks 2

WT Peaks 1 And WT Peaks 2 WT Peaks 1 And WT Peaks 2 Or

KO ChIP 1 KO Peaks 1 KO ChIP 2 KO Peaks 2

KO Peaks 1 And KO Peaks 2 KO Peaks 1 And KO Peaks 2 Why isn't a peak called

Calling a peak is a combination of Degree of enrichment Behaviour of the background Total number of sequences Why isn't a peak called Fewer peaks are called by just sub-sampling the same data

Why isn't a peak called With no input the region around the peak is used to model the background Combining Peak sets Dont make claims based solely on the number of peaks (there were more WT peaks than KO peaks for example)

Dont make claims based on regions being peaks in 1 set but not another (there were 465 peaks which were specific to KO) It is OK to make statements about overlap (there were 794 peaks which were common to WT and KO) You have to address differential enrichment problems quantitatively Quantitating ChIP data for analysis Quantitation of ChIP is not a simple problem Can start with something simple but in many cases you will

need to refine this Simple linear, globally corrected counts are a good place to start Normalising to input? Do you have significant variability in your input If not then there's no need to normalise Check it's not just related to different peak sizes! Input outliers should be removed, not corrected against

Do you have ChIP signal which is correlated with the input level? Only if you see correlation between ChIP and input is normalisation to be considered Most of the time this isn't the case See if the input has an influence

For truly enriched regions the input level is not predictive of the ChIP level. Normalising to input would make things worse here. Why not always do "fold over input"? Inputs are generally poorly measured Coverage over enriched areas will be low in the input

Fold values more influenced by input than ChIP Biases in input are smaller than enrichment power of the antibody Normalised Read Count Evaluating and Normalising Enrichment Percentile through data

Look for systematic enrichment changes (real biology!!) Normalising Enrichment Simple Single point of reference (percentile, size factor etc.) Works for small differences, not for large ones Enrichment specific Two points of reference

Low percentile to reflect baseline High percentile to reflect close to saturation Add to match first, Multiply to match second Quick and Dirty Quantile normalisation to force a common distribution Don't normalise the input or use it to calculate distributions!

Normalising Enrichment Normalising Enrichment Checking Normalisation Before Normalisation After Normalisation

Differential enrichment analysis Needs to be quantitative Needs to operate on non-deduplicated data Two statistical options Count based stats on raw uncorrected counts DESeq EdgeR

Continuous quantitation stats on normalised enrichment values LIMMA Which statistic to pick? If enrichment is roughly similar Raw counts, then DESeq/EdgeR If there are large differences in enrichment Enrichment normalisation

LIMMA statistics Visualisation of hits Map onto scatterplot for simple verification Normally makes sense to use log transformed counts Look at the data underneath candidates you make specific claims about

Hit validation Linear View Log View

Look whether hits make sense Look at points which change but were not selected Log scale can be useful for visualising hits Keep the context of non-hits Hit validation Directionality Most ChIP enrichments are not strand-specific

Should expect to see enrichment on both strands Hit validation You should be able to see consistency between replicates Experimental Design

Experimental Design Considerations All normal rules apply Think about sources of variation Don't confound variables Think about what batch effects might exist Test your antibody well before starting By far the biggest factor in success

Good performance on Western / in-situ is not a guarantee, but it's a good start Experimental Design Considerations Number of replicates Lots of studies use 2 replicates Fine for just finding binding sites (motif analysis) Not really enough for differential binding Huge reliance on 'information sharing'

No accurate measurement of variance per peak Potentially over-predicts differential binding Should think about likely levels of variability and make replicates to match Experimental Design Considerations Amount of sequencing Can be difficult to predict

Depends on Genome size Proportion of genome which is enriched Efficiency of enrichment ENCODE standard is ~20M reads per sample Can get away with fewer (K4me3 for example) Will need more for some marks (H3 for example) Sequencing depth will affect ability to detect changes

Experimental Design Considerations Type of sequencing Single end is fine for most applications ATAC-Seq can require paired end for some analyses Moderate read length is required Can map anywhere in the genome 50bp is probably OK. 100bp would be preferable

Material for the Course All Slides Exercises Data Virtual Machine Images Are available at

Downstream Analyses Composition / Motif Analysis Composition Good place to start, can provide either biological or technical insight See if hits (up vs down) cluster based on the underlying sequence composition Motifs

Great for defining putative binding sites Interesting to do sensitivity check Can do differential motif calling (for hit/non-hit) Compter - composition analysis MEME - Motif Analysis

Gene Ontology / Pathway Be careful how you relate hits to genes Really need to have a global link between peak positions and genes Random positions will give significant GO hits if you just use closest/overlapping genes

Recently Viewed Presentations

  • Linear Motion - Weebly

    Linear Motion - Weebly

    Average Velocity - change in position change in time Vavg = Δx or vavg = (vf + vi) Δt 2 Velocity of an object with constant acceleration (condition often true if air resistance ignored) (vf - vi) = a t...
  • Vision: Healthy people, families and communities. VP Quarterly

    Vision: Healthy people, families and communities. VP Quarterly

    NEW NETWORK VISUALS Primary Health Care Portfolio Overview (Transitional Structure) Primary Health Care Service Line Networks and Services Urban Networks Rural Networks Home Care/ SWADD Palliative Care/ Midwifery Population & Public Health Eagle Moon Health Office Quality, Planning and Resource...
  • Today's Family

    Today's Family

    A family consists of two or more people living in the same house. The Colonial Family. During the colonial era families and their relatives work, play, and celebrate special events together . The Family During the Industrial Revolution .
  • Scientific Revolution and The Enlightenment

    Scientific Revolution and The Enlightenment

    Make the Connection. Middle Ages. ... Foundation of Classical art and music. The world behaves according to patterns and these ought to be obeyed. Basic Premises. Scientific method can answer fundamental questions about society. ... "Man is born free, yet...
  • Laverne: Bonjour a tous! Nous nous appellons Victor,

    Laverne: Bonjour a tous! Nous nous appellons Victor,

    The gargoyles are said to protect the cathedral and Paris as well, as they over look the city. The gargoyles are often designed to look scary, though clearly not all of us are. Take Hugo for example.
  • Revision sheets C1 element compound Mixture= 2 or

    Revision sheets C1 element compound Mixture= 2 or

    emulsion. when we have small droplets of oil mixed in with water. However, since oil and water don't mix, eventually, this emulsion separates again. E.ginclude ice-cream, milk, mayo, sauces. We keep oil and water mixed by adding an . emulsifier....
  • Transforming Behavioral Healthcare the most disabling disorder before

    Transforming Behavioral Healthcare the most disabling disorder before

    $282B spent in State managed care. 20% of Medicaid beneficiaries have a . BH diagnosis and 50% of the expense. poor outcomes drivenby lack of measurement. page. 0. Lack of outpatient care . and . follow-up . places burden of...
  • Headquarters U.S. Air Force Integrity - Service -

    Headquarters U.S. Air Force Integrity - Service -

    LEADS Corp., B-292465, Sept. 26, 2003, 2003 CPD ¶ 197) When INFORMATION is an issue, consider using firewalls and non-disclosure agreements, or just disclose the information. AFMC FARS Subpart 5309.5, related solicitation provision, contract clause, and IG5309.504. AFSPC FARS Subpart...