Differential Privacy SIGMOD 2012 Tutorial Part 1: Motivation

Differential Privacy SIGMOD 2012 Tutorial Part 1: Motivation

Differential Privacy SIGMOD 2012 Tutorial Part 1: Motivation Marianne Winslett University of Illinois at Urbana-Champaign Advanced Digital Sciences Center, Singapore Including slides from: Anupam Datta / Yufei Tao / Tiancheng Li / Vitaly Smatikov / Avrim Blum / Johannes Gehrke / Gerome Miklau / & more! Official outline: Overview of privacy concerns Case study: Netflix data Case study: genomic data (GWAS) Case study: social network data Limits of k-anonymity, l-diversity, t-closeness Part 1A WHY SHOULD WE CARE? Many important applications involve publishing sensitive data about individuals. Medical research What treatments have the best outcomes? How can we recognize the onset of disease earlier? Are certain drugs better for certain phenotypes? Web search What are people really looking for when they search?

How can we give them the most authoritative answers? Public health Where are our outbreaks of unpleasant diseases? What behavior patterns or patient characteristics are correl ated with these diseases? Many important applications involve publishing sensitive data about individuals. Urban planning Where do people live, work, and play, e.g., as tracked by t heir phones GPS? How do they travel between those places, e.g., via foot, bu s, train, taxi, car? What is their travel experience, e.g., ho w much time do they spend waiting for transport or stuck i n traffic? Energy conservation How much energy do households/offices use at each time of day? How could the peak loads associated with this behavior be changed, e.g., by smart appliances and demand response, so that we can build fewer power plants? Many important applications involve publishing sensitive data about individuals. Social and computer networking What is the pattern of phone/data/multimedia network usa ge? How can we better use existing (or plan new) infrastr ucture to handle this traffic? How do people relate to one another, e.g., as mediated by

Facebook? How is society evolving (Census data)? Industrial data (individual = company; need SMC if no TTP) What were the total sales, over all companies, in a sector l ast year/quarter/month? What were the characteristics of those sales: who were th e buyers, how large were the purchases, etc.? Today, access to these data sets is usually strict ly controlled. Only available: Inside the company/agency that collected the data Or after signing a legal contract Click streams, taxi data Or in very coarse-grained summaries Public health Or after a very long wait US Census data details Or with definite privacy issues US Census reports, the AOL click stream, d NIH dbGaP summary tables, Enron email Society would benefit if we could publish some useful form of the data,

without having to worry about privacy. Or with IRB (Institutional Review Board) approval dbGaP summary tables ol Why is access so strictly controlled? No one should learn who had which disease. Microdata What if we de-identify the records by removin g names? publish We can re-identify people, absolutely or proba bilistically, by linking with external data. The published table A voter registration list Quasi-identifier (QI) attributes Background knowledge 87% of Americans can be uniquely identified by

{zip code, gender, date of birth}. actually 63% [Golle 06] Latanya Sweeney [International Journal on Uncertainty, Fuzziness and Kn owledge-based Systems, 2002] used this approach to re-identify t he medical record of an ex-governor of Massachusetts. Real query logs can be very useful to CS researchers. But click history can uniquely identify a person. What the New York Times did: Find all log entries for AOL user 4417749 Multiple queries for businesses and services in Lilburn, GA (population 11K) Several queries for Jarrett Arnold Lilburn has 14 people with the last name Arnold NYT contacts them, finds out AOL User 4417 749 is Thelma Arnold Just because data looks hard to re-identify, doe snt mean it is. [Narayanan and Shmatikov, Oakland 08] In 2009, the Netflix movie rental service offered a $1,000,000 priz e for improving their movie recommendation service.

Customer #1 High School Musical 1 High School Musical 2 High School Musical 3 Twilight 4 5 5 ? Training data: ~100M ratings of 18K movies from ~500K randoml y selected customers, plus dates Only 10% of their data; slightly perturbed We can re-identify a Netflix rater if we know just a little bit about her (from life, IMDB ratings, blogs, ). 8 movie ratings ( 2 wrong, dates 2 weeks) re-identify 99 % of raters

2 ratings, 3 days re-identify 68% of raters Relatively few candidates for the other 32% (especially wit h movies outside the top 100) Even a handful of IMDB comments allows Netflix re-identificat ion, in many cases 50 IMDB users re-identify 2 with very high probability, on e from ratings, one from dates The Netflix attack works because the data are s parse and dissimilar, with a long tail. Considering just movies rated, for 90% of records there isnt a single other record that is more than 30% similar Why should we care about this innocuous data set? All movie ratings political and religious opinions, sexual orien tation, Everything bought in a store private life details Every doctor visit private life details One customer sued Netflix, saying she thought her rental hist ory could reveal that she was a lesbian before she was ready t o tell everyone. Social networks also describe relationships bet ween people, which can be sensitive too.

Nodes Edges from call & email logs: what did they know and when did they know it? Edges ID Age HIV ID1 ID2 Alice 25 Pos Alice Bob

Bob 19 Neg Bob Carol Carol 34 Pos Bob Dave Dave 45 Pos Bob Ed

Ed 32 Neg Dave Ed Fred 28 Neg Dave Fred Greg 54 Pos Dave

Greg Harry 49 Neg Ed Greg Ed Harry Fred Greg Greg Harry We could learn so much about how society works if we could freely analyze these data sets. J. Onnela et al. Structure and tie strengths in mobile communication networks,

Proceedings of the National Academy of Sciences, 2007 But de-identified people (nodes) can still be r e-identified using background info. Naively Anonymized Network External information 18 In fact, local structure is highly identifying. Friendster network ~4.5 million nodes Well-protected Uniquely identified [Hay, VLDB 08] degree nbrs degree The reidentification attack can also be active. Active attack

Auxiliary network attack Embed small random graph prior to anonymization. Use unanonymized public network with overlapping membership. [Backstrom, WWW 07] [Narayanan, OAKL 09] It is becoming routine for medical studies to include a g enetic component. Genome-wide association studies (GWAS) aim to identify the correlation bet ween diseases, e.g., diabetes, and the patients DNA, by comparing peopl e with and without the disease. GWAS papers usually include detailed correlation statistics. Our attack: uncover the identities of the patients in a GWAS For studies of up to moderate size, a significant fraction of people, determine whether a specific person has participated in a particular study within 10 seconds, with high co nfidence! A genome-wide

association study identifies novel risk loci for type 2 diabetes, Nature 445, 881-885 (22 February 2007) GWAS papers usually include detailed cor relation statistics. SNPs 2, 3 are linked, so are SNPs 4, 5. SNP1 SNP2 SNP3 SNP4 SNP5 Publish: linkage disequilibrium between these SNP pairs. Human DNA Diabetes

SNPs 1, 3, 4 are associated with diabetes. Publish: p-values of these SNP -disease pairs. Privacy attacks can use SNP-disease associati on. Idea [Homer et al. PloS Genet.08, Jacobs et al. Nature09]: Obtain aggregate SNP info from the published p-values (1) Obtain a sample DNA of the target individual (2) Obtain the aggregate SNP info of a ref. population (3) Compare (1), (2), (3) 0.1 SNP 1 0.2 SNP 2 0.3 SNP 3 0 SNP 0.5 SNP

0 SNP 1 0.7 SNP 1 2 0.6 SNP 2 3 0.9 SNP 3 Background knowledge 0.8 SNP 4 0.4

SNP 5 Aggregate DNA of patients in a study 1 0.5 SNP SNP 4 5 0.3 SNP 4 0.1 SNP DNA of an individual 5 Aggregate DNA of a reference population Privacy attacks can use both SNP-disease and SNP-SNP associations.

Idea [Wang et al., CCS09]: Model patients SNPs to a matrix of unknowns Obtain column sums from the published p-values Obtain pair-wise column dot-products from the published LDs Solve the matrix using integer programming Each SNP can only be 0 or 1 (with a dominance model) Patient SNP1 SNP2 SNP3 SNP4 SNP5 1 x11 x12 x13 x14

x15 2 x21 x22 x23 x24 x25 3 x31 x32 x33 x34 x35 x11+x21+x31=2 x13x14+x23x24+x33x34=1

A successful attack reveals the DNA of all patients! What are the goals of the work on different ial privacy and its antecedents? Publish a distorted version of the data set or analy sis result so that Privacy: the privacy of all individuals is adequat ely protected; Utility: the published information is useful for its intended purpose. Paradox: Privacy protection , utility . Issues Privacy principle What is adequate privacy protection? Distortion approach How can we achieve the privacy princi ple, while maximizing the utility of the dat a? Different applications may have different privacy protection needs. Membership disclosure: Attacker cannot tell that a given pers on is/was in the data set (e.g., a set of AIDS patient records o r the summary data from a data set like dbGaP).

-presence [Nergiz et al., 2007]. Differential privacy [Dwork, 2007]. Sensitive attribute disclosure: Attacker cannot tell that a give n person has a certain sensitive attribute. l-diversity [Machanavajjhala et al., 2006]. t-closeness [Li et al., 2007]. Identity disclosure: Attacker cannot tell which record correspo nds to a given person. k-anonymity [Sweeney, 2002]. slide 27 Part 1B WHAT HAVE RESEARCHERS A LREADY TRIED? Privacy principle 1: k-anonymity your quasi-identifiers are indistinguishable from k other peoples. [Sweeney, Intl J. on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] 2-anonymous generalization: 4 QI groups A voter registration list Sensitive attribute QI attributes

The biggest advantage of k-anonymity is that people can understand it. And often it can be computed fast. But in general, it is easy to attack. k-anonymity ... or how not to define priv acy. [Shmatikov] Does not say anything about the computations to be done on the data (utility). Assumes that attacker will be able to join only on quasi-identifiers. Intuitive reasoning: k-anonymity prevents attacker from telling which record corresponds to which person. Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute. This reasoning is fallacious! k-anonymity does not provide privacy if the sensitive values in an equivalence class lack diversity, or the attacker has certain backgr ound knowledge. From a voter registration list Homogeneity Attack

Bob A 3-anonymous patient table Zipcode Age Disease 476** 2* Heart Disease Zipcode Age 476** 2* Heart Disease 47678 27

476** 2* Heart Disease 4790* 40 Flu 4790* 40 Heart Disease 4790* 40 Cancer 476** 3* Heart Disease

476** 3* Cancer 476** 3* Cancer Background Knowledge Attack Carl Zipcode Age 47673 36 Updates can also destroy k-anonymity. What is Joes disease? Wait for his birthday. A voter registration list plus dates of birth (not shown) 10

17000 No diversity in this QI group. Principle 2: l-diversity [Machanavajjhala et al., ICDE, 2006] Each QI group should have at least l well-represented sensitiv e values. Maybe each QI-group must have l different sensitive values? A 2-diverse table Age Sex [1, 5] M [1, 5] M [6, 10] M [6, 10] M [11, 20] F [11, 20] F [21, 60] F [21, 60] F [21, 60] F

[21, 60] F Zipcode [10001, 15000] [10001, 15000] [15001, 20000] [15001, 20000] [20001, 25000] [20001, 25000] [30001, 60000] [30001, 60000] [30001, 60000] [30001, 60000] Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu We can attack this probabilistically. If we know Joes QI group, what is the probability he has HIV? ...

Disease ... HIV A QI group with 100 tuples HIV ... 98 tuples HIV pneumonia bronchitis ... The conclusion researchers drew: The most frequent sensitive value in a QI group cannot be too frequent. Even then, we can still attack using backg round knowledge. Joe has HIV. Sally knows Joe does not have pneumonia. Sally can guess that Joe has HIV. ... A QI group with 100 tuples

Disease ... HIV ... 50 tuples HIV pneumonia ... 49 tuples pneumonia bronchitis ... l-diversity variants have been proposed to address these weaknesses. Probabilistic l-diversity The frequency of the most frequent value in an equivalen ce class is bounded by 1/l. Entropy l-diversity The entropy of the distribution of sensitive values in each equivalence class is at least log(l) Recursive (c,l)-diversity The most frequent value does not appear too frequently r1

When you see patches upon patches upon patches, it is a sign that a completely different approach is needed like Jim Grays theory of transactions, versus how concurrency was previously handled. l-diversity can be overkill or underkill. Original data Anonymization A Anonymization B Cancer Q1 Flu Q1 Flu Cancer Q1

Flu Q1 Cancer Cancer Q1 Cancer Q1 Cancer Flu Q1 Flu Q1 Cancer

Cancer Q1 Cancer Q1 Cancer Cancer Q1 Cancer Q1 Cancer Cancer

Q2 Cancer Cancer Q2 Cancer Cancer Q2 Cancer Q2 Cancer Cancer

Flu Q2 Cancer Q2 Cancer Flu Q2 Cancer Q2 Flu 99% quasi-identifier isisnot 99%cancer cancer quasi-identifiergroup

group notdiverse, diverse,yet yet Q2 Cancer Q2 Cancer anonymized anonymizeddatabase databasedoes doesnot notleak leakmuch muchnew newinfo. info. 99% have cancer Q2 Cancer Q2 Flu 50% isisdiverse 50%cancer cancerquasi-identifier quasi-identifiergroup group diverse This Thisleaks

leaksaaton tonof ofnew newinformation information Diversity does not inherently benefit privacy. Principle 3: t-Closeness Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas

787XX Flu Caucas 787XX Acne Caucas 787XX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX

Shingles Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu [Li et al. ICDE 07] Distribution of sensitive attributes within each quasi-identifier group should be close to their distribution in the entire original DB Then we can bound the knowledge that the attacker gain s by seeing a particular anonymization. Released table Adversarial belief Belief B0

Knowledge External Knowledge B1 Overall distribution of sensitive values B2 Distribution of sensitive values in a particular group Age Zip code Gender Disease 2*

479** Male Flu 2* 479** Male Heart Disease 2* 479**

Male Cancer . . . . . . . . . . . . Only applicable when we can define the distance between values, e.g., 50

using a hierarchy of diagnoses. 4766* * Gastritis How anonymous is this 4-anonymous, 3diverse, and perfectly-t-close data? Asian/ AfrAm 787XX HIV- Acne Asian/ AfrAm 787XX HIV- Acne

Asian/ AfrAm 787XX HIV- Flu Asian/ AfrAm 787XX HIV+ Shingles Caucasian 787XX HIV+ Flu Caucasian 787XX

HIV- Acne Caucasian 787XX HIV- Shingles Caucasian 787XX HIV- Acne That depends on the attackers background k nowledge. My coworker Bobs shingles got so bad that he is in the hospital. He looks Asian to me This

Thisisisagainst againstthe the rules, rules,because because flu is not flu is notaaquasiquasiidentifier. identifier. In Inthe thereal realworld, world, almost almost anything anything could couldbe be personally personally identifying identifying (as (as we saw with Netflix). Asian/

AfrAm 787XX HIV- Acne Asian/ AfrAm 787XX HIV- Acne Asian/ AfrAm 787XX HIV- Flu Asian/ AfrAm

787XX HIV+ Shingles Caucasian 787XX HIV+ Flu Caucasian 787XX HIV- Acne Caucasian 787XX HIV- Shingles

Caucasian 787XX HIV- Acne There are probably 100 other related proposed pr ivacy principles k-gather, (a, k)-anonymity, personalized anonymity, positive di sclosure-recursive (c, l)-diversity, non-positive-disclosure (c1, c2, l)-diversity, m-invariance, (c, t)-isolation, And for other data models, e.g., graphs: k-degree anonymity, k-neighborhood anonymity, k-sized grou ping, (k, l) grouping, and they suffer from related problems. [Shmatikov] Trying to achieve privacy by syntactic transformation of the da - Scrubbing of PII, k-anonymity, l-diversity Fatally flawed! Insecure against attackers with arbitrary background info

Do not compose (anonymize twice reveal data) Does he go No meaningful notion of privacy too far? No meaningful notion of utility And there is an impossibility result that applies t o all of them. [Dwork, Naor 2006] For any reasonable definition of privacy breach and sanitiza tion, with high probability some adversary can breach som e sanitized DB. Example: Private fact: my exact height Background knowledge: Im 5 inches taller than the aver age American woman San(DB) allows computing average height of US women This breaks my privacy even if my record is not in the database! How to do it right (according to Shmatikov) Privacy is not a property of the data. Syntactic definitions are doomed to fail! Privacy is a property of the analysis carried out on the da ta. Marianne says: dont forget the identity transformatio n. The definition of privacy must be robust in the presence of arbitrary background information.

slide 58 Principle 4: differential privacy An analysis result should not change much, whether or not any single individuals record is in the DB C should be no worse off for having her record in the analysis. Differential privacy definition [Dwork 2006] A query-answering method A is -differentially pri vate if for all data sets D1 and D2 differing by at most one element and all (one-time) queries Q a nd possible answers R, Pr[A(Q, D1) = R ] e Pr[A(Q, D2) = R ]. e- 1-, when .1 Pr[A(Q, D1) = R ] Pr[A(Q, D2) = R ] e 1+, when .1

To attain DP, add noise to the analysis result. User q? q(DB) + noise(q) DB Intuitively, the more sensitive q is to DBs content, the more noise you need. Sensitivity is a worst case measure: global sensitivity Sensitivity is independent of DBs current content. Background knowledge could be everything but t his one record, and DP will still protect it. User q? q(DB) + noise(q) DB Examples: Sensitivity of how many people aged 21 have diabetes is 1. Sensitivity of sum, avg, max, min medical bill is very high. Usually, set noise(q) Lap(sensitivity(q)/). Take a random sample from Laplace distribution with parameter , = sensitivity(q)/. mean = 0, variance = 2 2, Pr(x) = e|x|/ = e |x|/sens(q)

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

Microsofts Windows Live example report [Lee, 200X] Microsoft Web Analytics Groups feedback Rollup total mismatch looked like dirty data Country Gender Visitors United Female 128 States Male 75 Total

203 Female 15 Male 15 Total 30 Canada Country Visitors United States 202 Canada 31

Easy to reconcile totals and reduce noise [Hay 2010, Ding 2011] Results of analysis can be freely shared/publis hed, and the guarantee still holds. (but epsilon should have been set at a level intended for thi s)

Recently Viewed Presentations

  • Genre Characteristics - PBworks

    Genre Characteristics - PBworks

    Genre Characteristics How Does Nonfiction Look? Provides an outline of important information in a table of contents, index, or glossary How Does Nonfiction Look? Each page has words in a variety of fonts and type sizes. Bold or italic fonts...
  • Distributed Systems

    Distributed Systems

    1 Overview. The ubiquitous computing vision (Mark Weiser, 1993) "Ubiquitous computing enhances computer use by making many computers available throughout the physical environment, but making them effectively invisible to the user."
  • I grandi momenti e la vita quotidiana tra difficoltà e risorse

    I grandi momenti e la vita quotidiana tra difficoltà e risorse

    Simona today A Brunet Lezine evaluation performed 2 years ago, showed a developmental age between 18-24 months. But Simona astonishes us because she doesn't fit a model, she changes every day, she learns in slow motion, and she remembers, showing...
  • Studland Bay assessment help - Geog

    Studland Bay assessment help - Geog

    Studland Bay is located in the Isle of Purbeck in Dorset . ... Sand Dunes. Heathland. Seagrass. Lyme Grass. Heather. SSSI = Site of special scientific interest. Nature reserve. Boardwalks. to guide people over the dunes. ... Studland Bay assessment...
  • An Introduction to the Solar System: An Overview

    An Introduction to the Solar System: An Overview

    An Introduction to the Solar System: An Overview Scale Models Scale Models Scale Models Common Characteristics of the Planets All revolve eastward Most rotate eastward (save Venus, Uranus, and Pluto) All lie near the ecliptic Origin of the Solar System...
  • Foramen Magnum Meningioma: the Interest of The Couple Ct- Mri ...

    Foramen Magnum Meningioma: the Interest of The Couple Ct- Mri ...

    CT showed a process in the level of the foramen magnum spontaneously isodense that enhances after injection of contrast. CEREBRAL CT C+: large tumor occupies slightly more than half of the transverse diameter of the foramen magnum. the rostral spinal...
  • Phasing, Concluded; Fitting and Refinement

    Phasing, Concluded; Fitting and Refinement

    Gaining Experience I would like you to gain some experience using some of the state-of-the-art software for crystallographic structure manipulation If you have a Mac or a PC, you can do this on your own machine and can keep the...
  • C1 Exam Tips - Wrotham School

    C1 Exam Tips - Wrotham School

    New types of polymers: new packaging materials, waterproof coatings for fabrics, dental polymers, wound dressings, hydrogels, smart materials (including shape memory polymers). 7 Poly(butene) is a polymer made from crude oil in two stages.