HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
PLSE Seminar Series: Emery Berger, "Saving the World from Spreadsheets"

UW PLSE · Youtube · 167 HN points · 2 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention UW PLSE's video "PLSE Seminar Series: Emery Berger, "Saving the World from Spreadsheets"".
Youtube Summary
Abstract
Spreadsheets are one of the most widely used programming environments, with roughly 1 billion users of Microsoft Excel alone. Unfortunately, spreadsheets make it all too easy to make errors that go unnoticed. These errors can have catastrophic consequences because spreadsheets are widely deployed in domains like finance and government. For instance, the infamous “London Whale” incident in 2012 cost JP Morgan approximately $2 billion; this was due in part to a spreadsheet programming error. A Harvard economic analysis used to support austerity measures imposed on Greece after the 2008 worldwide financial crisis. These austerity measures led to widespread protests and economic dislocation. The analysis was based on a single large spreadsheet, which was later found to contain numerous errors; when fixed, its conclusions were reversed.

Our research aims to dramatically reduce the risk of spreadsheet errors by developing algorithms that can effectively and accurately find them. This is challenging because traditional analyses for conventional programming languages do not apply in the spreadsheet domain (for example, spreadsheets don’t segfault). In this talk, I will present two systems we have developed that effectively find errors in spreadsheets: CheckCell uses a combination of program analysis and statistical analysis to automatically find likely data errors, while ExceLint combines program analysis with an information-theoretic approach to find likely formula errors. We implemented both of these as plugins for Microsoft Excel; both are principled, fast, and accurate (e.g., ExceLint’s median precision and recall are 1).

This work is joint with Dan Barowy (now a professor at Williams College) and Ben Zorn (Microsoft Research).

Bio
Emery Berger is a Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger’s research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He and his collaborators have created a number of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Crédit Suisse, Reuters, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, the Distinguished Artifact Award for PLDI 2014, the Most Influential Paper Award at OOPSLA 2012, the Most Influential Paper Award at PLDI 2016, three CACM Research Highlights, a Google Research Award, a Microsoft SEIF Award, and Best Paper Awards at FAST, OOPSLA, and SOSP; he was named an ACM Senior Member in 2010. Professor Berger is currently serving as an elected member of the SIGPLAN Executive Committee; he served for a decade (2007-2017) as Associate Editor of the ACM Transactions on Programming Languages and Systems, and was Program Chair for PLDI 2016.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Obligatory link to excelint/checkcell threads.

https://news.ycombinator.com/item?id=20491091

https://news.ycombinator.com/item?id=22431500

https://arxiv.org/abs/1901.11100

One of the interesting takeaways from Emery Berger's talk, is that he was able to replicate to 2 decimal places a published spreadsheet transcription error rate of 5.26% of cells have errors.

I highly recommend the talk. https://www.youtube.com/watch?v=GyWKxFxyyrQ

Dec 23, 2019 · divan on Ask HN: Best Talks of 2019?
"Performance matters" by Emery Berger on StrangeLoop'19 is a great talk - https://www.youtube.com/watch?v=r-TLSBdHe1A

Also, rest of his talks are also fascinating:

- https://www.youtube.com/watch?v=XRAP3lBivYM

- https://www.youtube.com/watch?v=GyWKxFxyyrQ

alexitosrv
Thank you! That was a fantastic use of 42:14! Loved the style and what a great research by he and his team.
omegabravo
performance matters was a sensational presentation. I don't think the title gives it justice
screye
Love Emery.

Took 2 courses under him @UMass. An extremely demanding professor, but each of his classes are an absolute delight.

Oct 31, 2019 · 24 points, 2 comments · submitted by bingden
abakker
Nice Abom79 t-shirt.
rspoerri
i love the views into the audience :-)

more on topic, without having seen the whole youtube movie: https://datavizproject.com/

Jul 21, 2019 · 142 points, 25 comments · submitted by matt_d
wwarner
Excelint https://github.com/plasma-umass/ExceLint-addin CheckCel https://github.com/plasma-umass/DataDebug
DonHopkins
I'd love to have something like ExceLint for Google Sheets, called Google ShizNits.
phonon
What's the difference between https://github.com/plasma-umass/ExceLint-addin and https://github.com/ExceLint/ExceLint ?
emeryberger
(ExceLint co-author and the speaker on this video here) The first one is the latest version which works in all modern Excel versions (Windows, Mac, Online). It’s a rewrite in TypeScript I did while on sabbatical this year at Microsoft Research (also it has faster and improved algorithms and more features to further improve its precision and usability). It is actively under development. Installation instructions are in the README. We are hoping to have it posted on the Microsoft store at some point. If you would like to see something like this in Excel, say so!
phonon
Cool! Does it include all the capabilities of https://github.com/plasma-umass/DataDebug or should/could both be installed?
emeryberger
Right now, ExceLint and CheckCell are not integrated. But this is something we would like to do!
vogtb
I have a habit of taking notes on these types of talks when I stumble upon them, so for anyone interested, here's a quick write up of the video. Some of it is paraphrased,.

* "[Microsoft estimates that 750M users of Excel. (%7 of the world population).]"

* Spreadsheet errors basically ruined the economy of Greece.

* "State of the art" is manually double checking your formulas. This is what the experts suggest…

* There's apparently an article out there from Forbes titled "Sorry, Your Spreadsheet Has Errors (Almost 90% Do)".

* Thomas Herndon is the guy that did manual spreadsheet verification to prove that there were errors.

* Talks about CheckCell, ExceLint.

* Input errors are a huge problem: "Roughly 1% of characters people mistype."

* "[1 out of 20 cells manually typed probably has an error.]" (Woah.)

* "[Users often add a digit or remove a digit, changing the order of magnitude]"

* 1) Manual data entry is hard to do correctly, 2) Writing formulas/code/Excel that uses that data is also hard to do correctly.

* One take away: Like code, if you're not testing it manually, and no one is testing it for you manually, and you're not writing tests, and the results aren't "gut-checked" or the results aren't used, why would it be correct? If a tree falls…

* "[A lot of public posted Excel sheets are filled with errors, or fudging.]" Look at the grades one that he shows around 19:36 to see what I mean.

* "The Bootstrap" - stats analysis using simulations. "[Resample samples]... random sample with replacement, repeatedly, to get distribution of output of calculation." Requires a homogenous range. Allows you to find "outliers" that drastically change the output. "What is the likelihood of observing one of the simulations under the null hypothesis, and if it's below [X] then we say it's unusual." Dude in audience at 30:29 describes it well.

* Formulas are easier to audit because they're usually named w/ column, etc. Data is hard.

* Goes through a long process of describing how they gather data, etc. Good stuff, but the short and the long of it is CheckCell is good.

* Loops back around to the global finance sheet that had a lot of errors: CheckCell worked on it.

* ExceLint - static analysis, ranks errors and their fixes. Can find off-by-one-like errors. Formulas using off-by-one ranges, etc.Excel has its own error finder, but it gives a lot of false positives and false negatives."Most errors are reference errors" - wrong row, wrong column, too short a range, too long a range, etc. "Looks for disruptions in rectangular regions." Not just outliers. Looking for irregularity, where regularity is basically low entropy. "[Capture the relationship of cells/ranges and their relationship to one another.]" Looking for relationships that minimize entropy ("[Because users aren't insane and they're putting things in a rectangular grid.]" Looks for every rectangle (i.e. range) that when merged with a neighbor, would remain rectangular. That is considered a potential fix. Then you can simulate the fix as if you already did it, and check the entropy on that.

* A lot of the errors, and their origins have to do with basic Excel features. Some of these features were outlined as best practices in Joel's "You suck at Excel" talk, which is kinda funny. Great power, Uncle Ben, etc.

* Dropped this one: "SUM is [something like 45% of formulas]".

(edit: spacing out list)

freqshow
Thank you for posting this. I know that your taking a few moments to offer this bit of help to the community is appreciated by many more people than will ever reply to you.
Mvandenbergh
I think this stuff is great. I'm a big fan of the real world and meeting people where they are. If your plan for saving the world from spreadsheets is to convince people to use Pandas instead (and there are certainly people on HN who think that way), you're not really serious about fixing things. Developing these kinds of checking tools, which could realistically be rolled out to many Excel users is a great step forward.
wwarner
I don't know of anything similar for Jupyter notebooks or sql either. I guess testing with simulated data designed to yield expected results would be the way I test my stuff.
hjk05
Seems that bootstrapping and outlier detection is the way to go independent of what language of “editor”(Eg notebook use) you are using.

The really cool thing here is that they parse the excel formula in order to automatically “figure out” how they can perform the bootstrapping.

jnbiche
> I don't know of anything similar for Jupyter notebooks or sql either

Exactly. It feels like we're falling into the exact same problem with Jupyter notebooks as with spreadsheets: they become increasingly used by professionals who code (but who aren't software devs) to create bug-ridden, unmaintainable, large-scale software because they become familiar with the tool they have at hand.

analog31
Yes, definitely. I'm guilty as charged to some extent. I'm a heavy user of Jupyter, and introduced Python to my workplace, where it is now used by a handful of scientists in R&D.

I'm hesitant to blame the tool. Instead, I think it's a matter of our exuberance and interest in producing new results that causes us to get ahead of our software engineering skills and build things that get out of hand. Also, the professional developers are simply not available to help us improve things. We're on our own.

baxtr
and then people just love AirTable... especially the dev community I feel
verdverm
I tried Airtable and became frustrated with their UX. Went back to the spreadsheet, also people prefer to receive them
chopete
I tried using AirTable twice. It has a steep learning curve. The UI comes in the way often.

If they really have day-to-day users - they must be from the top down approach (somebody up the chain selected it) or forcefully committed ones or a have a perfect use-case.

It is certainly not for regular/most excel users.

larrydag
I'm an R advocate. I call R the "Excel Buster". I think reproducible research is very important and tools should follow. Excel has its place in fast mathematics prototyping but for reproducible research it is quite lacking.
educationdata
I actually think you are more possibly to make mistakes in R than in Excel. Because in Excel you always see the results directly, and you can even easily catch an anomaly in a single cell. But in R it takes one more step to see the results, and you probably won't see results of all the rows directly.
_Wintermute
Especially when R really likes to carry on chugging along with your analysis spitting out nonsense values when it should have failed on something 50 lines ago.

    sum(1, 2, 3, 10)
    [1] 16 # great

    mean(1, 2, 3, 10)
    [1] 1 # wait, what?
mycall
Vast majority of Excel users would implode learning R.
FromHoiPolloi
Can confirm. Source: am an excel user.
DonHopkins
Check out the guy with his face planted in his keyboard at 9:05!

Is that what spreadsheet debugging looks like?

Edit: at 13:40, maybe he's just using facebook.

Edit2: I'm sorry for putting it that way, I didn't mean to shame him about his vision. I only thought it seemed like he wasn't paying much attention to the lecture when he kept switching between using his phone and laptop, while wearing a "facebook" t-shirt.

bagrow
He could have a vision impairment.
sbr464
Really? He clearly has vision trouble.
emeryberger
Please delete this. He is a grad student with a serious vision impairment.
bschne
Here's the corresponding papers:

CheckCell - https://people.cs.umass.edu/~emery/pubs/CheckCell-preprint-O...

ExceLint - https://arxiv.org/pdf/1901.11100.pdf

Jul 10, 2019 · 1 points, 0 comments · submitted by matt_d
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.