HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Protecting Privacy with MATH (Collab with the Census)

minutephysics · Youtube · 4 HN points · 3 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention minutephysics's video "Protecting Privacy with MATH (Collab with the Census)".
Youtube Summary
This video was made in collaboration with the US Census Bureau and fact-checked by Census Bureau scientists. Any opinions and errors are my own. For more information, visit https://census.gov/about/policies/privacy/statistical_safeguards.html or search "differential privacy" at http://census.gov.

REFERENCES
Differential Privacy in the Wild: http://www.vldb.org/pvldb/vol9/p1611-machanavajjhala.pdf

Harvard University Privacy Tools Project: https://privacytools.seas.harvard.edu/differential-privacy

Simons Institute Workshop Video Recordings and Articles Archive: https://simons.berkeley.edu/workshops/schedule/6281

Cynthia Dwork (key inventor of Differential Privacy), giving a great intro talk about differential privacy: https://www.youtube.com/watch?v=lg-VhHlztqo

Shiva P Kasiviswanathan and Adam Smith. On the semantics of differential privacy: A Bayesian formulation. Journal of Privacy and Confidentiality, 6(1):1–16, 2014.

Shiva Prasad Kasiviswanathan and Adam Smith. On the ‘semantics’ of differential privacy: A Bayesian formulation. 2015. https://arxiv.org/abs/0803.3946

Daniel Kifer and Ashwin Machanavajjhala. A rigorous and customizable framework for privacy. In ACM Symposium on Principles of Database Systems (PODS), 2012.

Daniel Kifer and Ashwin Machanavajjhala. Pufferfish: A framework for mathematical privacy definitions. ACM Trans. Database Syst., 39(1):3, 2014.


Support MinutePhysics on Patreon! http://www.patreon.com/minutephysics
Link to Patreon Supporters: http://www.minutephysics.com/supporters/

MinutePhysics is on twitter - @minutephysics
And facebook - http://facebook.com/minutephysics
And Google+ (does anyone use this any more?) - http://bit.ly/qzEwc6

Minute Physics provides an energetic and entertaining view of old and new problems in physics -- all in a minute!

Created by Henry Reich
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Technical explanation by minutephysics: https://www.youtube.com/watch?v=pT19VwBAqKA

The amount of privacy is configurable, and researchers who would like to tune it (in either direction) can make a plea so the public can decide the amount of privacy loss they're willing give up in order to aid research.

Each culture is different and has different tolerances for privacy and amount of intrusiveness.

Your comment reminded me of this minutephysics video about the census. https://youtu.be/pT19VwBAqKA It a great walkthrough of the "privacy doom principle."
> That would make the existence of anonymous data practically speaking impossible to have on the web

For almost every type of data that is true. Transforming or substituting data doesn't make it anonymous; the patters in the data are still present. To produce actually anonymous data you have to do what the GDPR instructed: corrupt the data ("rendered anonymous") severely enough that the "data subject is ... no longer identifiable". You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

> 8192 bucket

While others have pointed out that this isn't actually limited to 13 bits of entropy for most people, there are at least two reasons that field is still very personally identifying. First, "x-client-data on its own" never happens. Google isn't wasting time and money implementing this feature to make an isolated database with a single column. At no point will the x-client-data value (or any other type of data they capture) ever sit in isolation. I used the IPv4 Source Address as an example because it will necessarily be present in the header of the packets that transport the x-client-data header over the internet. Suggesting that Google would ever use this value in isolation is almost insulting to Google; why would they waste their expensive developer time to create, capture, and manage data that is obviously useless?

However, lets say they did make and isolated system that only ever received 13 bit integers stripped of all other data. Surely that wouldn't be personally identifiable? If they store it with a locally generated high resolution timestamp they can re-associate the data with personal accounts by correlating the timestamps with their other timestamped databases (web server access logs, GA, recaptcha, etc).

> you'd need to describe a scheme such that, given an x-client-data header, and only an x-client-data header, you could identify one (and only one) unique person to whom that header corresponds

You should first describe why google would ever use that header and only that header. Even if they aren't currently using x-client-data as an identifier or as additional fingerprintable entropy, simply saving the data gives Google the option to use it as an identifier in the future.

[1] https://www.youtube.com/watch?v=pT19VwBAqKA https://en.wikipedia.org/wiki/Differential_privacy

joshuamorton
> You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

Correct, and another anonymization technique (in place of differential privacy) is k-anonymity. In k-anonymity schemes, you ensure that in any given table no row corresponds to any fewer than k individuals. Why is this useful? Well let's say you have some, say, 10-15 bit identifier. You can take a request from a user that contains information that might when combined, be identifying. Say: coarseish location (state/country), device metadata (browser version, OS version), and coarse access time (the hour and day of week). Combining all 3 (or 4 if you include the psuedonymous ID) is enough to uniquely identify at least some users. Then let's say you also track some performance statistics about the browser itself.

But any single piece of data (plus the pseudonymous ID) is not enough to identify any specific user. So if you use the psuedonymous ID as a shared foreign key, you can join across the tables and get approximate crosstabs without uniquely identifying any specific user. Essentially, if you want to ask if there are performance differences between version N and version N+1, you can check the aggregate performance vs. the aggregate count of new vs. old browser version, and with 8K samples, you're able to draw reasonable conclusions. And in general you can do this across dimensions or combinations of dimensions that might normally contain enough pieces of info to identify a single user.

This is essentially the same idea as differential privacy, although without the same mathematical precision that differential privacy can provide. (By this I don't mean that the data can be re-identified, just that differential privacy can be used to provide tighter bounds on the anonymization, such that the statistical inferences you can gather are more precise. k-anonymity is, perhaps, a less mathematically elegant tool).

Specifically, I'm describing k-anonymity using x-client-data as a Quasi-identifier in place of something like IP or MAC address. You can find those terms in the "See Also" section of the differential privacy wiki page you linked. Google is mentioned in those pages as a known user of both differential privacy and k-anonymization in other tools.

Hopefully that answers your question of why Google would want such a thing.

> simply saving the data gives Google the option to use it as an identifier in the future.

Yes, but that doesn't mean that they're currently in violation of the GDPR, which is what a number of people keep insisting. I'm not claiming that it's impossible for Google to be doing something nefarious with data (although I will say that in general I think that's an unreasonably high bar). Just that the collection of something like this isn't an indication of nefarious actions, and is in fact likely the opposite.

Sep 16, 2019 · 2 points, 0 comments · submitted by weinzierl
Sep 12, 2019 · 2 points, 0 comments · submitted by sohkamyung
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.