HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
DEF CON 26 - Greenstadt and Dr Caliskan - De-anonymizing Programmers from Source Code

DEFCONConference · Youtube · 1 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention DEFCONConference's video "DEF CON 26 - Greenstadt and Dr Caliskan - De-anonymizing Programmers from Source Code".
Youtube Summary
Many hackers like to contribute code, binaries, and exploits under pseudonyms, but how anonymous are these contributions really? In this talk, we will discuss our work on programmer de-anonymization from the standpoint of machine learning. We will show how abstract syntax trees contain stylistic fingerprints and how these can be used to potentially identify programmers from code and binaries. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found in single-author GitHub repositories and the leaked Nulled.IO hacker forum.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
I agree I have nothing to fear from the military. I would have hoped that example would show not that I am afraid of cruise missiles, but instead that very serious decisions at the highest level of government are being made from this information. You can bet your bottom dollar that all other high-levels of management, government, religious groups, ect have the same mindset towards the "privacy of metadata". I'm sure you'll plead ignorance of direct evidence of offenses right now and not historical, as reason enough to ignore practical truth..

Being de-anommed from logs is ez pz. https://www.youtube.com/watch?v=Fj6wfGcKFlI https://www.youtube.com/watch?v=7EaCnhC0mc0 https://www.youtube.com/watch?v=MQL1jrm7Vzk

There was another talk by a german lady where they de-anommed everybody in their data set, including a german judge who was looking at porn during his chambers sessions, almost like he was addicted. They brought this information to him and it was kept under wraps.

Nobody is safe from this stuff, it only takes 3-4 peices of unique information (not matter how tiny) to statistically add up a probable identity with a high degree of confidence.

nindalf
I’m not saying de-anonymisation is impossible. Obviously it is.

I’m asking you to

1. Tell us what VSCode collects that’s so sensitive

2. Show us how you’d de-anonymise a person with that information (but I’m willing to concede that this is possible)

3. Demonstrate harm from knowing that a specific person is tied to a set of VSCode logs.

Don’t hand wave #3 away. Don’t assume that it’s self-evident that the harm is high. You’re talking in such vague terms that it’s impossible to respond. So be clear and concrete. What specific harm comes from the specific data that VSCode collects, assuming that it’s de-anonymised? Don’t talk about a random German judge. Talk about me, a developer using VSCode.

barrysteve
I rewrote this comment a couple times before posting, as getting through to your perspective rules out what I consider logical common sense. You're still asking the wrong questions because you don't get the process.

VSCode could collect anything at all, but let's be hypothetical, yet specific so you understand the process. I'm going to use these two pages as reference [0]This RoboLeary blog post describes what VSCode collects (more than just performance, but we'll leave that for now) and [1]This Github repo shows an example log file of just the telemetry data you can voluntarily opt out of.

[0]https://www.roboleary.net/tools/2022/04/20/vscode-telemetry....

[1]https://gist.github.com/robole/97f6e1c4dc888ae0a49a7683e5494...

In the github example log file, there's 9k lines of code, with thousands of xml style entries with properties. Those properties can be any value you want for the purposes of de-anom. It doesn't matter what it is.

Your database back at Microsoft HQ (or wherever this stuff goes) will have a huge ocean of telemetry logs, so how do we find out which one is yours? See below.

In those thousands-of-entries, some of them will differ statistically from the average of all log reports. It could be the time it takes for a menu to load in milliseconds, it could be the filepath of an extension, it could be a specific hardware report, it could be a collection of exceptions that keep throwing. It doesn't matter what it is, it matters that it differs from everyone else.

Some of those reports that differ from the herd (statistical average) can be collected together to make a fingerprint of an individual machine. Maybe your machine always takes slightly longer than average to open a menu, or has a different extension set, whatever. The tiniest difference is still a difference.

This fingerprint of your machine can be correlated statistically to be likely the same machine again and again across telemetry logs, across time. Your fingerprint keeps showing up across multiple logs, across multiple days, we know a time of computer usage. The fingerprint does not need to be perfectly separated from every other fingerprint in the bell curve, but we can solve that later on. As long as it's narrow enough (only so many conflicting fingerprint reports) we can narrow it down with added metadata.

How do we know who is running VSCode and doing this work on this fingerprinted machine? All the other 'harmless' metadata like IP address, VS licenses, Microsoft accounts, linked email addresses, and most importantly all the data harvesting Microsoft will not let you turn off.

So this isn't a problem right? Who cares if Microsoft can trivially (and likely automatically) de-anonymize me. No junior engineers have access to the data, only Edward Snowden et al. can see it, so who cares?.

Management can easily take access to the database of fingerprints and build analysis tools and programs the end user and the junior engineer alike, cannot see or understand what decisions are being made from it. So if you and I (assuming we're both grunts in software) can implement a perfect telemetry system that collects "no useful category of information" and the database is kept from public eyes, we can both still lose out on what management does with that information, as there is more to be revealed in it than we want.

So if military is happy to kill people based on metadata and our advertising data is being sold left-right and centre and insurance and health data is going haywire, what assurance does any computer user on a windows machine, or VSCode user have, that their data is not going to be used to make decisions arbitrarily (from the end user perspective) against their will?

If I don't want VSCode to know that I'm working on a project at 3am, that's my business and revealing what Harm that causes and why I might want to keep that private, reveals a clue about what is being kept private, and thus cannot be shared. Steve Jobs would have completely removed the collection of meta-data from secret product development teams and no-one would have batted an eye. When a random joe does it, there's all these questions about what Harm does it do, as if someone had no reason to keep a secret after Steve died.

If you want transparency from the rank-and-file, get it from what decisions management is making with metadata too and tell me what Harm Does It Cause Microsoft Management To Reveal Their Truth. Heaven forbid I suggest reality is a two-way street.

nindalf
Ok so what I’m getting from your comment is that since this data could be linked to you … bad things could happen. Unclear what those bad things are, but it could be anything, even a cruise missile. Unclear why someone would do this thing, but sure.

I get that this seems common sense and logical to you, but it doesn’t to me. Just because it’s possible doesn’t mean it’s going to happen. And arguably, 60%+ of all developers dont seem convinced either.

If you’re concerned about this and want industry level change, you need to work on this pitch. It’s not very convincing.

crazyjustin
Thank you for taking the time to type this out. I've been uncomfortable about telemetry data for a while but this makes it clear to me that it can be used to do a lot more than just drive ui changes. Google has been fingerprinting in this way for decades. Of course Microsoft can do the same.

I've been thinking about trying VSCode. I'm the only dev I know at my org that doesn't use it and I guess it's going to stay that way.

I really don't understand the mindset of some folks. In this dynamic Microsoft has all the power and therefore the burden of transparency and proof of good will should be on them. Why would you give away information about yourself freely to an entity with no oversight. There is no reason to trust a corporation with your personal data.

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.