Institute of Zoology, Chinese Academy of Sciences, Beijing, China Oct 2018 University of Science and Technology of China Population Genetics in a nutshell A short selfintroduction 1997-2002

USTC, School of Life Science, BS 2002-2004 Cornell University, Biostat, MS 20042008 2009.2-2009.6 UC Berkeley, Population Genetics, Ph.D

2009.7-2013.8 Beijing Institute of Genomics, Associated 2013.9-2018.6 2017.6-2018.6 2017.6-2018.6 2018.7 UC Berkeley, Postdoc Investigator

Genome Institute of Singapore, Principle Investigator National Cancer Center, Singapore, Joint faculty Topics covered in this lecture i) The historical context for the origin of Population Genetics ii) Key concepts in Population Genetics 1) Hardy Weinberg Equilibrium 2) Natural selection and deterministic theory 3) Finite populations, Wright-Fisher Model, Genetic drift, effective population size 4) Mutation, measures of variability, mutation drift balance, Molecular

Clock 5) Other forms of natural selection 6) Recombination, linkage disequilibrium, genome wide association studies 7) Later part of the 20th century for Population Genetics General philosophy 1) The mathematical treatment of the subject can be a bit foreign to many of you. I will try to explain them as intuitively as possible. 2) the rise of Population Genetics, especially the quantitative treatment of the subject helped establish the legitimacy of evolutionary biology, a primarily historical science, in a scientific

climate that favored experimental methods over historical ones. 3) This lecture is meant for giving you a glimpse of this field. If you have any questions, please feel free to stop me. I) Historical context for the origin of Population Genetics Darwin and the evolutionary theory 1) Common descend, survival of fittest. it is an argument based on phenotypes. The survival of the fittest is operating on a complex phenotype called fitness.

2) It lacks elements related to inheritance: how phenotypes are inherited. Lacks power to explain variability within a population. 3) Understanding the relationship between genotype and phenotype is the core question of genetics, which didnt exist in Darwins time. 4) Darwin, himself, is a geologist. Alfred Wallace Darwins perspective on inheritance Darwinian theory is a theory without a mechanism/genetic basis.

Darwin has no idea about modern genetics. Pangenesis was Darwin's attempt to provide such a mechanism of inheritance. The idea was that each part of the parent's body emitted tiny particles called gemmules, which migrated through the body to contribute to that parent's gametes. The rediscovery of Mendelian law and the birth of modern genetics Hugo DeVries (Netherland), Carl Correns (Germany) and Erich von Tschermak (Austria) independently rediscovered Mendels work in the same year.

1902-1903 chromosome theory of inheritance Walter Sutton and Theodor Boveri The fundamental conflict between Mendelian Genetics and Biometricians Mendelian Biometricians Segregation is discrete

Traits are binary/discrete Traits are continuous Inheritances are blending/averaging William Bateson Hugo de Vries Francis Galton (founder)

Raphael Weldon Karl Pearson (Protege of Galton) Origin of Population Genetics JBS Haldane, 1892-1964 R.A Fisher, 1890-1962 Fisher showed that the continuous variation measured

by the biometricians could be produced by the combined action of many discrete genes, and that natural selection could change gene frequencies in a population, resulting in evolution (1918-1930). Sewall Wright, 1889-1988 Fitness landscape Further reading II) Key concepts in Population

Genetics The scope of Population Genetics Why are the patterns of variation as they are? (mathematical theory) What are the forces that influence levels of variation? What is the genetic basis for evolutionary change? What data can be collected to test hypotheses about the factors that impact allele frequency? What is the relation between genotypic variation and phenotype variation? Evolutionary forces: Mutation, Random genetic drift, Recombination/gene conversion, Migration/Demography, Natural selection

2.1 Hardy Weinberg Equilibrium Hardy Weinberg equilibrium The HardyWeinberg principle, also known as the HardyWeinberg equilibrium, model, theorem, or law, states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. English mathematician German obstetrician-gynecologist,

A few concepts 1) Locus: a genomic position in the genome. It can be a single base or a genomic segment. 2) Allele: different forms of sequences at a genomic locus. 3) Genotype frequency, the proportion of genotypes in a population (sample). Given a locus with two alleles, there are three possible genotypes, AA, Aa, aa. The frequencies of these genotypes are called genotype frequencies, often denoted as f(AA), f(Aa), f(aa). 4) Allele frequency, the proportion of alleles in a population (sample). For example, the allele frequency of A (often denoted as f(A)) is f(AA) + f(Aa), likewise, f(a)= f(aa) + *f(Aa).

Exercise in class I When you take a sample of 100 humans and genotyped their hemoglobin gene. Lets assume there are only two alleles observed and you denoted the allelic types as F (fast) and S(slow). Lets assume that individuals with FF is 50, FS is 40 and SS is 10. What is the estimated genotype as well as allele frequencies? Mathematical form of HWE When population is random mating, the expected genotype frequencies will be

f(Aa)= 2pq, f(AA)=p2, f(aa)=q2. f(AA) + f(Aa) + f(aa)=1. Assumptions underlying HWE Assumptions with HWE organisms are diploid only sexual reproduction occurs generations are nonoverlapping mating is random population size is infinitely large allele frequencies are equal in the sexes there is no migration, mutation or selection

Another interpretation of HWE There are two degree of freedom in genotype space (i.e. three variables f(AA), f(Aa) and f(aa), but f(AA) + f(Aa) + f(aa)=1. ), but there is only one degree of freedom in allelic space (two variables, f(A) and f(a), but f(A) + f(a) =1). Hardy Weinberg equilibrium create an one-to-one map between these two spaces. In other words, HWE allows you to predict genotype frequencies from allele frequencies (vice versa). Testing HWE and goodness of fit test Under a typical null model, it will have various predictions about the values

in different categories. We can then test the prediction from the model against the actual observation and calculate a test statistics (TS). 2 = It can be proven that this TS will follow a chi-square distribution. Exercise in class Testing for Hardy Weinberg Equilibrium How to test a hypothesis in general? When you have a hypothesis in mind (in this case, HWE), you want to test whether this hypothesis is true or not. A typical conduction is that, you assume the null hypothesis is true, you make various predictions about the system, then you

test your observed values against the expected and decide whether the two match each other or not. There are many statistical approaches testing for the matching of the observed and expected values. Today, we will use one of these approaches. Remind ourselves: HWE When population is random mating, the expected genotype frequencies will be f(Aa)= 2pq, f(AA)=p2, f(aa)=q2. f(AA) + f(Aa) + f(aa)=1. Example

You sampled a population with 30 AA, 45 Aa and 25aa. Is HWE true for this sample? Step 1) estimate the allele frequency of the sample; a) There are 100 individuals, the frequency of A= f(AA) + * f(Aa), The frequency of AA is 0.3, f(Aa)=0.45, So, the frequency of A is 0.525. Step 2) predict the expected genotype frequency from allele frequencies based on HWE. Since f(A)=0.525, f(a)=0.475. The predicted genotype frequencies are: 28, 50 and 22 respectively. Example continued Under a typical null model, it will have various predictions about the values in different categories. We can then test the prediction from

the model against the actual observation and calculate a test statistics (TS). 2 = It can be proven that this TS will follow a chi-square distribution. Category Observed Expected

Difference AA 30 28 0.14 Aa

45 50 0.5 aa 25 22

0.41 SUM 1.05 The 5% cutoff for the Chi-Square test is 3.84. The calculated value is smaller than 3.84. So, there is not enough of evidence to reject HWE in this case. Side note Chi-square distribution has a parameter called degree of freedom. In this case, we use chisq(df=1).

The number of degrees is calculated as # of categories estimated parameters -1 Key points 1) how to calculate allele and genotype frequencies in a population (sample). 2) how to predict genotype frequencies given allele frequencies assuming HWE 2.2 Natural selection and deterministic theory

A few concepts 1) Fitness, describes individual reproductive success and is equal to the average contribution to the gene pool of the next generation that is made by individuals of the specified genotype or phenotype. 2) Viability selection, the selection of individual organisms who can survive until they are able to reproduce. 3) Absolute/relative fitness, Absolute: The # of offsprings (or reproductive success) of a genotype. Relative: Rescale fitness against one genotype. For example, we sometimes rescale the fitness against fitness values from the largest (smallest) genotypes.

Deterministic theory (viability selection) When population size is infinitely large, how population evolve from generation to generation (change in allele/genotype frequencies). AA Aa aa Fitness 1+s 1+hs 1 Freq before selection p2 2pq q2 Population_mean_fitness=w^bar

=p2*(1+s) +2pq(1+hs) +q2 Freq after selection will be: f(AA)=p2*(1+s)/w^bar, f(Aa)=2pq*(1+hs)/w^bar and f(aa)= q2*1/w^bar Exercise in class 2 Given h=1/2, s=0.01, starting from A being 0.5, whats the frequency of A after one generations? Example given in class Given h=1/2, s=0.01, starting from A being 0.5, whats the frequency of A after one generations?

AA Aa aa Frequency at zygotic stage p2 2pq q2

0.25 0.5 0.25 Fitness value (1+s)=1.01 (1+hs)=1.005

1 Post zygotic after selection p2*(1+s) 2pq*(1+hs) q2*1 0.2525

0.5025 0.25 Normalizing factor Genotype freq post selection Allele freq W_bar=p2*(1+s) + 2pq*(1+hs) + q2*1= 1.005 0.2512

0.5 f(A)=0.5012, f(a)=0.4988 0.2488 A computer exercise 1) python code for simulating allele frequencies 2) a R code to plot the trajectories Observations from the deterministic theory 1) The alleles of fitness advantage will increase in frequency, until hitting

equilibrium (often times fixation). 2) the mean fitness of the population will increase due to the higher prevalence of the better alleles in each generation. This constant increase in mean fitness is called Fishers fundamental theorem of natural selection. 3) Fisher predicted not only mean fitness will increase, but also predicted the amount of increase in each generation (not covered here). Key concepts 1) understand the operational details how to calculate allele and genotype frequencies using deterministic theory.

2.3 Finite populations, WrightFisher Model, Genetic drift, effective population size R.A Fisher and Sewall Wright Historical background: Fishers fundamental theorem of natural selection. The increase in mean fitness equals to the additive components of genetic variance in fitness. Fisher tends to think in terms of large populations and changes in allelic frequencies are deterministic. Sewall Wright

Wrights view on small populations How to move from one adaptive peak to another. Genetic drift (together with gene flow/migration) allows the population to move from one peak to another through the valleys. Fitness landscape Sewall Wright

How populations evolve: The Wright-Fisher model For formulation: Generation 1 Generation 2 Generation 3 Generation 4 Generation 5 Need a mathematical framework to describe the change in allele frequency from generation to generation.

Mechanical end of Wright-Fisher Model: There are N individuals, 2N genes, we sample with replacement from previous generation 2N new genes, to form the new generation. This is the Wright-Fisher model. It is a binomial/multinomial distribution conditioning on allele frequency from the previous generation. A simple exercise A population of size 100, allele frequency of the current generation is 0.5. If the population follows the Wright-Fisher model, whats the allele frequency of the population in the next generation? The frequency of the allele in the next generation is a random variable (i.e. r.v. not a fixed value). The mean of the r.v is:

Predictions and properties of the Wright-Fisher model Genetic drift: the random fluctuation of allele frequencies across generations. Drift measures variance in allele frequencies across generations. The effect of drift is larger in small populations, much smaller in bigger populations. The expectation of allele frequencies from generation to generation is constant. The long term fate of alleles, is either fixation or extinction. In other words, drift will reduce the level of genetic variability. Illustration of random genetic drift

A few properties of the Wright-Fisher Model The long term fate of an allele is: lost or fixation. The probability of fixation of an allele of frequency p is: ? Given an allele, which just entered the population, its frequency will be 1/2N, the time to fixation is: 4N generations.

Effective population size and the Wright Fisher model Census population size (N) : the actual number of individuals in a population/species. Effective population size (Ne): it is a concept mapping between actual population and Wright-Fisher population (like an ideal gas). Real Population Size of N Properties of the

population (e.g. variance in changes in allele frequency. ) Wright-Fisher population Size of Ne Key Concepts 1) Understand Wright-Fisher model (including fixation probability). 2) Understand that genetic drift will purge genetic variability.