# Computational Genomics - ZhiwuZhangLab

Statistical Genomics Lecture 8: Genetic architecture and simulation of phenotype Zhiwu Zhang Washington State University Outline Accumulation of gene effect Genetic architecture Gene effect distribution Phenotype distribution Heritability R function How phenotypes get normal distributed? x~B(n, p) n trials each with p successful rate. The total number of successes is a random variable, x 1 2 1 3 1 4 1 p=0.5, Left-fail Rightsuccess x=rbinom(10000,5,.0 ) Gene 1 Gene 2 1 1 3 4 6 Gene 3 Gene 4 1 1 5 1 0 1 0 5 1 0 1 2 3 4 5

Gene 5 Even genes have the same effect Outcome quartz() #Mac par(mfrow=c(4,4),mar = c(3,4,1,1)) n=10000 m=10 x=rbinom(n,m,.5) hist(x) plot(density(x)) Ten gens with same effect 8 10 6 0 2 4 6 6 8 4 6 8 N =density.default(x 10000 Bandwidth = = 0.2129 y) Density 4 0.0 0.1 0.2 0.3 0.4 Density 2500 1500 2 2 0.0 0.1 0.2 0.3 0.4 2500 1500 0 500 0 0 500 0 10 0 y Histogram of y 2 4 6 8 0.0

0.2 0.4 N =density.default(x 10000 Bandwidth = = 0.2129 y) Density 1500 8 8 10 y Histogram of y 2500 8 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) Density 6 4 8 0 500 2500 2500 1500 4 2 6 2 4 6 8 10 0 y Histogram of y 2 4 6 8 N =density.default(x 10000 Bandwidth = = 0.2129 y) 0.4 0 4 0.2

10 0 500 2 6 N =density.default(x 10000 Bandwidth = = 0.2129 y) y Histogram of y 0 4 2 0.0 8 2 density.default(x =y) Density 6 0 2500 0 Density 4 8 1500 10 1500 2 6 Frequency 8 4 Frequency 500 6 0 500 Frequency 4 y Histogram of y 0 Frequency 2 2 N =density.default(x 10000 Bandwidth = =

0.2129 y) Density 1500 2500 y Histogram of y 0 Frequency 0 Histogram of y 0 500 10 Frequency 8 0.0 0.1 0.2 0.3 0.4 6 0.0 0.1 0.2 0.3 0.4 4 0 Frequency 2 0.0 0.1 0.2 0.3 0.4 0 0.0 0.1 0.2 0.3 0.4 500 Density 1500 density.default(x = y) 0 Frequency 2500 Histogram of y 0 2 4 6 8 10 0 2 4 6 8 Ten gens with different effect ndividual total effect

Gene effect Assign to individuals Uniform random variable Individual gene effect Ten gens with different effects #x=rep(1,m) x=runif(m) gene=matrix(x,n,m,byrow = T) head(gene) galton=matrix(runif(n*m),n,m) head(galton) galton.binary=galton<.5 head(galton.binary) gene[galton.binary]=0 head(gene) y=rowSums(gene) y[1:6] hist(y) plot(density(y)) Ten gens with different effects 4 4 0.4 0.3 0.2 0.1 0.0 N =density.default(x 10000 Bandwidth = = 0.1319 y) 2 3 4 0 4 5 2 3 4 6 0 y Histogram of y 1 2 3 4 N =density.default(x 10000 Bandwidth = = 0.1186 y) 0.3 0.1 0.0 0 4 4

0.0 1 Density Frequency 2 3 0.4 Density 0 0.30 0.20 0.10 0.00 0 2 0.4 3 1 N =density.default(x 10000 Bandwidth = = 0.1153 y) 0 2 0.2 0.0 1 y Histogram of y N =density.default(x 10000 Bandwidth = = 0.1309 y) Density 500 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Density 0 Frequency 0 1 y Histogram of y 5 0.4 3 0.4 5 4 0.2 2 0.1 4 3

0 1 0.0 3 2 y Histogram of y N =density.default(x 10000 Bandwidth = = 0.1002 y) Density 1500 2 1 1500 Frequency 0.4 0 500 1 Density 0 0.0 3 0 0 1500 5 0.3 2 500 0 3 0.2 1 y Histogram of y 1000 1500 2 0.2 Density 800 400 0 Frequency 1 N =density.default(x 10000 Bandwidth = = 0.1202 y) 0

Frequency y Histogram of y Frequency Frequency 0 0.2 5 500 4 1500 3 density.default(x =y) 500 2 1500 1 Histogram of y 500 0 0.0 0.1 0.2 0.3 0.4 500 Density 1500 density.default(x =y) 0 Frequency Histogram of y 0 1 2 3 4 5 0 1 2 3 4 5 6 60 70 50 60 70 50 60 Same effect 50 60

70 30 40 50 60 70 0.4 0.0 0.1 0.2 Density 0.3 1500 Frequency 500 0.4 50 60 30 40 50 60 70 0.0 0.4 3 4 Density 0.4 0.3 0.2 Density 0.1 0.0 2 3 4 5 0 1 2 3 4 5 6 0.08 Density density.default(x = y) 0.00 50 60 30 40 50 60 70 Density 0.08 N =density.default(x 10000 Bandwidth = = 0.6387

y) 0.00 0.04 40 y Histogram of y 40 50 60 70 30 40 50 60 70 N =density.default(x 10000 Bandwidth = = 0.6387 y) 0.08 y Histogram of y 0.04 70 70 2 35 45 55 65 40 y Histogram of y 50 60 70 N =density.default(x 10000 Bandwidth = = 0.6387 y) 0.08 40 1 0.04 1 Histogram of y 30 0.08

30 0.2 0.0 0 1500 1500 70 N =density.default(x 10000 Bandwidth = = 0.6387 y) 4 N =density.default(x 10000 Bandwidth = = 0.1186 y) Density 60 3 0.00 50 2 0.04 40 0.2 Density 1500 Frequency 500 0 Frequency 1500 4 0 60 1 500 0 3 500 Frequency 2 1000 Frequency 50 0.04 70 0 N =density.default(x 10000 Bandwidth = = 0.1153 y)

Density 60 4 0.00 50 3 500 0.08 0.04 Density 0.00 40 0.00 40 1 0 0.08 Density 30 y Histogram of y 2 y Histogram of y 6 0.08 1500 Frequency 40 4 0.04 30 0 30 2 N =density.default(x 10000 Bandwidth = = 0.6387 y) Density 1000 70 0 0.00 65 Density 60 5 0.00

50 0.08 Density 40 1000 1500 Frequency 40 4 density.default(x = y) 30 0 30 N =density.default(x 10000 Bandwidth = = 0.7302 x) 0.00 30 55 1000 70 45 500 60 3 N =density.default(x 10000 Bandwidth = = 0.6387 y) y Histogram of y 500 Frequency 1500 0.08 Density 50 65 500 35 0.00 40 55 0 30 40 50 60 70 N =density.default(x 10000 Bandwidth = = 0.6387 x) x Histogram of x 45 y Histogram

of y 0.08 0.04 Density 70 2 0.04 35 0.00 60 1000 1500 Frequency 70 1 0 Histogram of y 1 y Histogram of y N =density.default(x 10000 Bandwidth = = 0.1309 y) 0 60 0 Density 0.4 Density 0.0 0 500 0.08 0.00 50 0.04 50 0 40 40 0.04 Frequency Frequency 0.08 30 0.04 Density 1000 1500 Frequency 0 30

0 30 1500 70 0.00 50 40 1000 50 density.default(x = x) N =density.default(x 10000 Bandwidth = = 0.7171 x) x Histogram of x 500 30 N =density.default(x 10000 Bandwidth = = 0.6387 x) Density 1000 500 40 1000 1500 Frequency 30 0.00 70 70 500 70 1500 60 0.04 50 60 1000 50 0.08 Density 2500 40 N =density.default(x 10000 Bandwidth = = 0.6387 x) x Histogram of x 50 500

30 1000 30 40 x Histogram of x 0 0.00 60 0.04 50 0 Frequency 30 0.08 Density 1000 40 500 0.08 30 40 50 60 70 N =density.default(x 10000 Bandwidth = = 0.6387 x) x Histogram of x 1500 Histogram of x 0.00 60 0.04 50 500 30 Frequency 0.04 Density 1000 1500 Frequency 500 0 40 x Histogram of x 0 Frequency 1500 30

0 100 genes density.default(x = x) 0 0.4 0.3 0.2 Density 5 0 1 2 3 4 5 6 Histogram of x 4 0.1 4 0 1 2 3 4 5 6 N =density.default(x 10000 Bandwidth = = 0.1319 y) 0 8 3 1500 6 3 density.default(x = y) 5 1000 4 2 Frequency 2 2 4 500 1500 0 1 y Histogram of y 3 0 10 1 1500 8 1500 0 2

1000 6 0 0.0 8 1 y Histogram of y N =density.default(x 10000 Bandwidth = = 0.1002 y) 0.30 6 0 Frequency 4 3 Density 4 5 500 2 2 500 Frequency 2 4 0 0 1 y Histogram of y 0 3 0.2 800 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) 2 0.20 8 1 0.10 10 0

N =density.default(x 10000 Bandwidth = = 0.1202 y) 0 8 5 1000 6 4 0.0 0.1 0.2 0.3 0.4 1500 500 0 Frequency 6 0.4 4 3 400 Frequency 4 0.2 Density 2 y Histogram of y 2 0 2 0.0 0 8 0.0 0.1 0.2 0.3 0.4 Density Density 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) Frequency 8 0.0 0.1 0.2 0.3 0.4 2500 1500 2500 1500 10 y Histogram of y 1 y

Histogram of y 1500 6 8 0 1000 4 6 8 Frequency 2 4 6 500 0 2 4 0 10 500 Frequency 6 0 500 0 2 Histogram of y 0.00 8 4 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) 500 Density 6 2 10 density.default(x = y) 0 2500 1500 4 0 0 500 2

8 N =density.default(x 10000 Bandwidth = = 0.2129 y) 8 0.4 10 y Histogram of y 0 6 6 0.2 8 4 4 y Histogram of y Density Density 6 2 2 Histogram of y 0.0 2500 1500 4 Frequency 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) 0 Frequency 2 0 2500 10 500 0 8 1500 8 6 Frequency 6 density.default(x = y)

0 500 Density 4 y Histogram of y 0 Frequency 2 4 Frequency 2500 1500 500 0 2 2500 0 N =density.default(x 10000 Bandwidth = = 0.2129 y) 1500 10 Histogram of y 0 500 8 0.0 0.1 0.2 0.3 0.4 Density 6 0.0 0.1 0.2 0.3 0.4 4 0 Frequency 2 y Histogram of y 0.0 0.1 0.2 0.3 0.4 0 0.0 0.1 0.2 0.3 0.4 2500 1500 500 Frequency density.default(x = y) 0 10 genes Histogram of y 30 40

50 60 Different effects 30 40 50 60 70 Complex traits Controlled by multiple genes Influenced by environment Also known as quantitative traits Most traits are continuous, e.g. yield and height, Some are categorical, e.g. node number, score of disease resistance Some binary traits are still quantitative traits, e.g. diabetes Economically important Dissecting phenotype Y= G + E + GxE + Residual G = Additive + Dominance + Epistasis E: Environment, e.g. year and location Residual: e.g. measurement error QTL and QTN Quantitative Trait Loci (QTL) Specific regions in the genome that are associated with quantitative traits The regions were traditionally considered as 1015 cM. The regions evolved to 3-5 cM in fine mapping. Suppose maize genome contains 50,000 genes and has genetic map =1700 cM Each QTL we identify has roughly 100-150 genes Gene cloning: identification the QTL regions at gene level Ultimate goal: functional mutations Quantitative Trait Nucleotide: QTN Gene effects Flowering time is controled by >50 genes Most have small effects The largest one only 1.5 increase days Buckler et al, Science, 2009 Distribution of QTN effect Normal distribution Geometry distribution Theoretical geometric distribution The probability distribution of the number X of Bernoulli trials needed to get one success Prob (X=k)=(1-p)k-1 p Approximated geometric distribution Effect(X=k)=pk

Genotype in Numeric format myGD=read.table(file="http://zzlab.net/GAPIT/data/ Genetic map myGM=read.table(file="http://zzlab.net/GAPIT/data/ mdp_SNP_information.txt",head=T) Sampling QTNs #Sampling QTN NQTN=10 X=myGD[,-1] n=nrow(X) m=ncol(X) QTN.position=sample(m,NQTN,replac e=F) SNPQ=as.matrix(X[,QTN.position]) QTN.position head(SNPQ) plot(myGM[,c(2,3)]) lines(myGM[QTN.position,c(2,3)],type="p",col="red") points(myGM[QTN.position,c(2,3)],type="p",col="blu e",cex = 5) 1.5e+08 0.0e+00 Position 3.0e+08 QTN positions 2 4 6 Chromosome 8 10 Additive genetic effect addeffect=rnorm(NQTN,0,1) addeffect effect=SNPQ%*%addeffect head(effect) genetic effect distribution 60 40 20 2 0 0 -2 0 50 100 200 -2 0 Index 2 4 6 8 effect Fn(x)

0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 ecdf(effe ct) -2 par(mfrow=c(2,2 )) plot(effect) hist(effect) boxplot(effect) plot(ecdf(effect)) effect 4 Frequency 6 80 8 Histogram of effe ct -4 -2 0 2 4 x 6 8 Residual h2=.7 effectvar=var(effect ) effectvar residualvar=(effectvarh2*effectvar)/h2 residualvar residual=rnorm(n,0,sqrt(residual var)) head(residual) Residual distribution 60 20 40 Frequency 0 0 -2 0 50 100 200 -4 -2 Index

0 2 4 residual 0.0 0.2 0.4 0.6 0.8 1.0 Fn(x) 0 2 4 ecdf(residual) -2 par(mfrow=c(2,2 )) plot(residual) hist(residual) boxplot(residual) plot(ecdf(residual )) residual 2 80 4 Histogram of residual -4 -2 0 x 2 4 Phenotype 30 20 10 0 -2 0 50 100 150 200 250 -4 -2 0 Index 2 4 6 8 10 y

0.6 Fn(x) 0.4 0.2 0 2 4 6 0.8 8 1.0 10 ecdf(y) 0.0 -2 y=effect+residual par(mfrow=c(2,2)) plot(y) hist(y) boxplot(y) plot(ecdf(y)) 0 2 y 4 Frequency 6 40 8 50 10 Histogram of y -5 0 5 x 10 Heritability 5 4 3 1 0.1 2 0.2 0.3 plot(density(y),ylim=c(0,.5)) lines(density(effect),col="blue" ) lines(density(residual),col="re d")

0 0.0 Density 0.4 0.5 density.default(x = y) va=var(effect) ve=var(residual) vp=var(y) v=matrix(c(va,ve,vp),1,3) colnames(v)=c("A", "E","P") barplot(v,col="gray") -5 0 5 10 A E P Correlations #Plot par(mfrow=c(1,3),mar = c(30,4,10,1)) plot(y,effect) plot(y,residual) plot(residual,effect) cor(y,effect) How to do this again and again? Function G2P=function(X,h2,alpha,NQTN,distribution){ n=nrow(X) m=ncol(X) #Sampling QTN QTN.position=sample(m,NQTN,replace=F) SNPQ=as.matrix(X[,QTN.position]) QTN.position #QTN effects if(distribution=="norm") {addeffect=rnorm(NQTN,0,1) }else {addeffect=alpha^(1:NQTN)} #Simulate phenotype effect=SNPQ%*%addeffect effectvar=var(effect) residualvar=(effectvar-h2*effectvar)/h2 residual=rnorm(n,0,sqrt(residualvar)) y=effect+residual return(list(addeffect = addeffect, y=y, add = effect, residual = residual, QTN.position=QTN.position, SNPQ=SNPQ)) } Function to simulate phenotypes Load R function source('~/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo/ G2P.R') 0 1 2 3 4 5

source('~/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo/G2P.R') par(mfrow=c(3,1),mar = c(3,4,1,1)) myG2P=G2P(X,.75,1,10,"norm") str(myG2P) va=var(myG2P\$add) ve=var(myG2P\$residual) vp=var(myG2P\$y) v=matrix(c(va,ve,vp),1,3) colnames(v)=c("A", "E","P") A E barplot(v,col="gray") 15 10 5 0 E P A E P 4 6 8 10 A 2 myG2P=G2P(X,.25,1,10,"norm") va=var(myG2P\$add) ve=var(myG2P\$residual) vp=var(myG2P\$y) v=matrix(c(va,ve,vp),1,3) colnames(v)=c("A", "E","P") barplot(v,col="gray") va decrease s with decreasin g h2 0 myG2P=G2P(X,.5,1,10,"norm") va=var(myG2P\$add) ve=var(myG2P\$residual) vp=var(myG2P\$y) v=matrix(c(va,ve,vp),1,3) colnames(v)=c("A", "E","P") barplot(v,col="gray") P 1.5 1.0 0.5 0.0 Density Effect of Number of Genes density.default(x =myG2P\$y) -6 0 2 0.00 0.10 0.20

0.30 density.default(x =myG2P\$y) -5 myG2P=G2P(X,.5,1,10,"norm") plot(density(myG2P\$y),ylim=c(0,.3)) lines(density(myG2P\$add),col="blue") lines(density(myG2P\$residual),col="red") 0 5 10 15 N = 281 Bandwidth = 0.7383 0.00 0.04 0.08 density.default(x =myG2P\$y) Density myG2P=G2P(X,.5,1,100,"norm") plot(density(myG2P\$y),ylim=c(0,.1)) lines(density(myG2P\$add),col="blue") lines(density(myG2P\$residual),col="red") -2 N = 281 Bandwidth = 0.3934 Density #Desect phenotype par(mfrow=c(3,1)) myG2P=G2P(X,.5,1,2,"norm") plot(density(myG2P\$y),ylim=c(0,1.5)) lines(density(myG2P\$add),col="blue") lines(density(myG2P\$residual),col="red") -4 -20 0 20 40 N = 281 Bandwidth = 3.442 60 Effect of Number of Genes Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y 50 10 0 0 0 10 0 0 4 8 30 Frequency 50 80 60 40

20 Frequency -4 30 Frequency 60 40 20 Frequency 80 40 40 20 0 myG2P=G2P(X,.5,1,10,"norm") hist(myG2P\$add) hist(myG2P\$residual) hist(myG2P\$y) Frequency 60 0 Frequency 120 #Check on distribution par(mfrow=c(3,3),mar = c(3,4,1,1)) myG2P=G2P(X,.5,1,2,"norm") hist(myG2P\$add) hist(myG2P\$residual) -0.5 0.5 1.5 -2 -1 0 1 2 -2 0 1 2 3 4 hist(myG2P\$y) myG2P\$add myG2P\$residual myG2P\$y Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y -10 -5 0 5 10 -5 0 5 10 -30 -10 10 60 40 0 20 Frequency 80

60 40 0 20 Frequency 40 20 0 Frequency myG2P=G2P(X,.5,1,100,"norm") hist(myG2P\$add) hist(myG2P\$residual) hist(myG2P\$y) 60 myG2P\$add myG2P\$residual myG2P\$y Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y -30 -10 10 -60 -20 20 Check on gene effect distribution 4 6 8 12 16 40 20 0 Frequency 60 10 20 30 40 50 0 0 20 40 Frequency 60 Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y -6 -2 2 6 0 5 10 15

4 6 8 10 14 30 20 0 10 Frequency 30 0 10 20 40 Frequency 60 50 40 80 myG2P\$add myG2P\$residual myG2P\$y Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y 0 myG2P=G2P(X,.5,.95,10,"geom") hist(myG2P\$add) hist(myG2P\$residual) hist(myG2P\$y) Frequency Frequency #Check gene effect distribution par(mfrow=c(3,3),mar = c(3,4,1,1)) myG2P=G2P(X,.5,1,10,"geom") hist(myG2P\$add) hist(myG2P\$residual) hist(myG2P\$y) -4 0 2 4 6 5 10 15 0.0 1.0 2.0 80 Frequency 0 40

80 60 40 0 20 Frequency 80 60 40 20 0 Frequency myG2P=G2P(X,.5,.5,10,"geom") hist(myG2P\$add) hist(myG2P\$residual) hist(myG2P\$y) 120 myG2P\$add myG2P\$residual myG2P\$y Histogram of myG2P\$addHistogram of myG2P\$residual Histogram of myG2P\$y -1.0 0.0 1.0 -0.5 0.5 1.5 2.5 Highlight Accumulation of gene effect Genetic architecture Gene effect distribution Phenotype distribution Heritability R function

## Recently Viewed Presentations

• Blackpool Teaching Hospitals NHS Foundation Trust Aidan Kehoe Chief Executive Our Values Policy Context Independent NHS Board - Resource allocation - Guide commissioning Greater transparency Concentration on outcomes - Advancing Quality Communication of change Prevention & public health White paper...
• Scientific foundation of modern biotechnology based on knowledge of DNA, its replication, repair and use of enzymes to carry out in vitro splicing DNA fragments Modern biotechnology Breaking the Genetic Code - Finding the Central Dogma An "RNA Club" organized...
• [email protected] Global Burden of NCDs. NCDs such as cardiovascular, diabetes, cancer, renal, genetic and respiratory conditions are major health threat accounting for : 60%. of deaths . 47%.
• Each of the five banks have a 20 percent market share. What is the four-firm concentration ratio? What is the HHI? Michael R. Baye, Chapter 6 homework Numbers 4, 6, 10, and 14 Managerial Economics & Business Strategy Chapter 7...
• Role Based Access Control Models ... we assumed the presence of a single security officer Normally have a small administrative team to mange RBAC Propagation of rights Management Model Management Model Proposed Administrative roles and permissions are disjoint from regular...
• Must worked for a total of 1250 hours or on paid leave. (Annual, Sick, Comp or Donated Sick Leave) Can a holiday be deleted from timesheets in ETS? No. Please do not try to delete any holidays. This will generate...
• Valence Shell Electron Pair Repulsion Theory Tetrahedral Trigonal pyramidal Bent
• The discharge from the adiabatic, reversible turbine is at 25C. Determine the quality of the outlet steam (%). 12.5 Steam is supplied to a steady state turbine at 10 MPa and 600°C. The discharge from the adiabatic, reversible turbine is...