HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Using R to detect fraud at 1M transactions per second

blog.revolutionanalytics.com · 142 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention blog.revolutionanalytics.com's video "Using R to detect fraud at 1M transactions per second".
Watch on blog.revolutionanalytics.com [↗]
blog.revolutionanalytics.com Summary
In Joseph Sirosh's keynote presentation at the Data Science Summit on Monday, Wee Hyong Tok demonstrated using R in SQL Server 2016 to detect fraud in real-time credit card transactions at a rate of 1 million transactions per second. The demo (which starts at the 17:00 minute mark) used a gradient-boosted tree model to predict the probability of a credit card transaction being fraudulent, based on attributes like the charge amount and the country of origin. Then, a stored procedure in SQL Server 2016 was used to score transactions streaming into the database at a rate of 3.6 billion per...
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Sep 30, 2016 · 142 points, 69 comments · submitted by sndean
dhd415
The presentation is a little light on the technical details of how the demo was run. What I could get from the presentation was 1M fraud predictions/sec via R stored procedures on data streaming into SQL Server 2016 stored in in-memory column store tables on a 4-socket "commodity" server.
baldfat
> PROS has been using R for a while in development, but found running R within SQL Server 2016 to be 100 times (not 100%, 100x!) faster for price optimization. "This really woke us up that we can use R in a production setting ... it's truly amazing," he says.

WOW if this is even half true we have a new area of R.

madenine
Microsoft has gone all in on R, excited to see where things go. Some cool stuff out of Ignite this week.
nerdponx
What does it mean to run R "within" SQL Server here?
vmarsy
I found this blog post a good introduction to R within SQL:

https://blogs.msdn.microsoft.com/sqlcat/2016/06/16/early-cus...

larrydag
SQL Server R Services https://msdn.microsoft.com/en-us/library/mt604845.aspx

New for SQL Server 2016

apohn
That statement is overly vague and sounds like marketing BS.

I worked on a project where we scored streaming data in R. The biggest bottleneck was getting the data into and out of the R session. We started out using disk based I/O and ended up using using rJava so our streaming system could communicate with R. In that case we did get a 100X speed up between our first iteration and the final version which used rJava to serialize the data.

So basically, the major bottleneck was not R. It was the communication with R. In the article R is installed on the same hardware as SQL Server, which should automatically give it a speedup with streaming data.

If Microsoft also has an optimized way to get data from SQL Server to R I can see how they got a 100X speedup. In certain cases using the MKL libraries can give you that as well from faster scoring, but I suspect the speedup just comes from improving the data transfer method.

buro9
> If Microsoft also has an optimized way to get data from SQL Server to R I can see how they got a 100X speedup. In certain cases using the MKL libraries can give you that as well, but I suspect the speedup just comes from improving the data transfer method.

The optimized method is that you can run R inside the database in the latest version of SQL Server.

I've actually installed Windows again, just to play with this feature (though I cannot make claims to actually putting it to good use yet).

apohn
This is what the docs say

"When you select this feature, extensions are installed in the database engine to support execution of R scripts, and a new service is created, the SQL Server Trusted Launchpad, to manage communications between the R runtime and the SQL Server instance."

So basically SQL Server is talking to the R session. The speedup is coming from R being installed locally and the "communication" which I've yet to figure out.

baldfat
> I've actually installed Windows again, just to play with this feature (though I cannot make claims to actually putting it to good use yet).

I heard the Linux SQL Server is surprisingly decent.

https://blogs.microsoft.com/blog/2016/03/07/announcing- sql-server-on-linux/

apathy
postgresql used to have a way to embed R into PostGreSQL too:

https://github.com/jconway/plr

I guess it still does, though I haven't used it in years. You can, of course, do the same thing with Python.

None
None
Jweb_Guru
No need to install Windows to get R in a database, you can run with PL/R on Postgres (unless you have a particular desire to run it within SQL Server, of course, but doesn't it run on Linux now?).

http://www.joeconway.com/plr/

blahi
That's a lot less than what Microsoft is offering.
Jweb_Guru
That may be, but it's not clear to me from the article. What's the feature I'm missing that makes the SQL Server offering so much more compelling?
blahi
Out of core algorithms, attatchment of data straight into R for a couple of very important things.
Jweb_Guru
> Out of core algorithms.

Does this just mean third-party modules? If so, doesn't http://www.joeconway.com/plr/doc/plr-module-funcs.html suffice?

> attatchment of data straight into R for a couple of very important things.

Isn't this doable through http://www.joeconway.com/plr/doc/plr-global-data.html?

Apologies if these are dumb questions, as I'm not very familiar with R.

blahi
No. They have their own highly optimized algorithms. They also have their their own distributed data structures.

I am not sure about that 2nd link. Seems like it's just UDFs written in R.

Jweb_Guru
Ah, okay. It wasn't clear to me that Microsoft had written its own algorithms. That does seem very useful, then (though presumably those algorithms and data structures could be used outside of the SQL Server environment, I presume that Microsoft is using them to encourage people to use SQL Server rather than another solution).

I believe the second link is referring to being able to initialize and share data between functions within the R runtime (rather than having to transfer back and forth between Postgres and the runtime). Is that not what you were referring to?

blahi
>being able to initialize and share data between functions within the R runtime

That's right.

>rather than having to transfer back and forth between Postgres and the runtime

But that's not. There's a difference between being able to use data outside of the database (from the R runtime) in my UDFs (executed in Postgres) on one hand and being able to attach 2TBs of data straight from an SQL table in the R runtime on the other. I don't even care that much about the algorithms. Moving the data is the bottleneck most of the time. And Microsoft is actually late to the party (but better than never). Oracle, Netezza, Vertica and Hana have been able to do it for quite a while now.

You are spot on about being able to use the algorithms outside of SQL Server. You can use them on Teradata or Hadoop or rent your own VMs on Azure to use them or you can buy standalone licenses too.

Jweb_Guru
So by "attach 2 TB of data straight from an SQL table into the R runtime" you mean that Microsoft taught R to interact directly with SQL Server's storage engine? If so, I agree, data movement is almost always the bottleneck for large data sets, and I don't think PL/R can do that (though I am not sure if that's a necessity due to the way Postgres's language plugins work, or something that could be done with enough effort).

However, if all you mean is that SQL Server can transfer the data a tuple at a time to R on the same server (in memory), I believe that PL/R and Postgres interact like that already (again, maybe I'm wrong). And I don't know how much extra overhead that provides over talking directly to the storage engine, anyway.

blahi
>Microsoft taught R to interact directly with SQL Server's storage engine

They have created 2 new services for SQL Server 2016 - BxlServer and SQL Satellite which facilitate the communication and data exchange. They obviously have additional speedups for the proprietary runtime (that was one of the main selling points of the company they acquired - fast data access to several RDBMS), but it's plenty fast for regular R too.

https://msdn.microsoft.com/en-us/library/mt709082.aspx

contingencies
"Using <technology of the day> to <contribute to some exciting high level business sounding goal> with <impressive statistic>". Video, reportedly without technical specifics.

They say if you can't communicate something succinctly, then you don't truly understand it. They are right.

vasaulys
Does anybody use R in production services or just for exploratory work?

It seems that once you figure out a good model in R, its almost always rewritten into either Scala or Java for real production work.

vegabook
I have 20k lines of (my own) R code running in production (used intensively by a salesforce of up to 20 people who price bonds with it) and it's an unmitigated nightmare to manage. Slow as crazy. No threading to manage concurrency so constant batch jobs everywhere. Memory hog. On Windows (this is finance), unfortunate fairly frequent crashes. No real time feeds due to the horrible architecture of the interpreter. That said, beautiful charts!

Just Say No. It'll sap your mojo. Am moving the whole thing to a blend of C, Python, and a distributed computing framework (thinking of Flink or Concord.io).

sandGorgon
have you considered spark instead of flink/concord ?
vegabook
I have, thanks for asking. I must admit that I have a very real priority on (soft) real time. Flink appears attractive but I also have a slight bias to non JVM which is where Concord appears interesting. I also just love Concord's "hot" DAG capability. Agreed (I think) though that I must include Spark micro-batching as a potential candidate. Any experience you have on this...I welcome links/tips. As you can probably tell I am at the very initial exploratory stage on stack choice.
md2be
R is just S which is just C
dandermotj
You're not wrong and absolutely totally wrong at the same time. R is the furthest thing from C you could find in paradigm, syntax and performance, but yes much of the underlying code is C or Fortran.

But really you're missing the point. R's purpose is interactive, exploratory and scientific computing and that's what it is incredibly good at. It wasn't intended for high performance computing, but there are ways of getting it there. Look out for Rho in the future.

vegabook
So well put. But what is Rho? Intrigued...
dandermotj
https://github.com/rho-devel/rho
blahi
That sounds like bad coders, not that R is bad.

Evidenced by:

>No threading to manage concurrency

R is used in production at EA, Activision, Ebay, Trulia, Google, Microsoft and many, many more. Those are just the ones I've seen give talks about scoring >1TBs regularly with R.

Every time somebody says R can't do be used for large data sets or is slow, I ask for more details and almost universally the programmer's complete lack of initiative is the weak link.

vegabook
R just does not have robust software engineering tools for anything that even begins to resemble scale and anybody who says otherwise is denying reality. R can certainly be used in production but the skeleton framework cannot be R. RPC only in my experience with all the structure with something else. R is intrinsically single user / batch with maybe shared database but say goodbye to anything that even starts to approach real time, or multi-node dependent. In my experience the only people who insist that R is robust for production, inevitably have a vested interest. Any objective programmer can see its greatness but also its glaring flaws.
blahi
Riiight. Everybody else is a bad engineer and you are the good one. With the single threaded R code...

edit: The comment above has been extended quite a bit. Initially it was a single (abrasive) sentence. I still stand by my answer however. Somebody who did not turn on multi-threading does not get to criticize R. It is the first thing you learn in any book about R. You have to be almost actively avoiding learning about it. It's in every 3rd blog post and SO question.

vegabook
perhaps you might not have started your own comment with the erroneous view that 'bad coders' are to blame when R proves to be deficient at extra-design tasks.

Oh I further note your R consulting vocation. There you go. Vested interest.

BTW, I love R. But my love is not blind.

None
None
kgwgk
Excel is used in production very widely, but I'm sure we all agree it has its limitations.
nerdponx
Do you have personal experience with this kind of hyper-performant R code?
blahi
I have experience scoring ~ 1TB daily. And a lot of smaller data sets spanning a few hundred gigs.

It's not "hyper performant". Obviously doing things in scala or C++ will be faster. However rewriting the models would take months and an entirely different set of skills. That means separate people.

But if somebody says that they use Python instead of R for the speed... that's just bull. For example one of the fundamental building blocks, pandas is slower than the counterpart in R.

sandGorgon
could you talk about some of the learnings you had around scoring 1tb daily in R ?

How do you even load the data into memory ? is it read from a database or s3 files.

blahi
In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.

That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.

I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.

There are also map/reduce packages for Hadoop.

apathy
I never would have tried MonetDb (ok monetdblite) if not for this great little tutorial on how to load all of SEER into it:

http://www.asdfree.com/2013/07/analyze-surveillance-epidemio...

Yeah the presentation and code isn't beautiful, but it does avoid the need to WRITE THE DAMNED THING YOURSELF, which some people apparently will never understand (although they will once they are unemployed). More importantly, it turns out you don't necessarily need Vertica for fast out-of-core loading and processing.

Granted, there are plenty of other ways to work out of core (hdf5, bigMatrix, any random database, blah blah) but this was one that was new to me. And I like it.

vegabook
this is not software engineering or production. It is batch jobs / exploratory analysis. It requires little or no structure apart from the analysis itself.

also in anything that has not been coded in C directly underneath, Python is 20x faster and C is 500× faster. R is literally the slowest mainstream language today by a long shot. That's a key consideration for production.

ignasl
Where did you get those numbers from? They are most definitely wrong unless you don't vectorize your code and run loops all around. A lot of R is actually written in C so you can squeeze really good performance if you know what you are doing. I would recommend reading Hadley's Advanced R and profile your code, I think you might be pleasantly surprised.
blahi
I would also suggest The R Inferno.
vegabook
I make extensive use of vectorization and use as many calls as I possibly can to the built-ins and/or c-based libraries. However as you well know, part of the fun in R is applying your own functions and unless you write these in C, you're back to native R and that's tediously slow. Ggplot another culprit -> amazing library, but if you're chucking out large amounts of custom charts with it it takes ages. Base graphics an order of magnitude faster (if less pretty and convenient for axis training).
ignasl
That's definitely sounds like a bad coders. However I would say that if someone comes from other more classic programming language background he will write a bad and slow R code by default. Especially if he is pressured into delivering fast and don't have time to search/learn the best solution. I was amazed how often you can solve something with one or two lines in R and those 2 lines will have so much better performance, better readability, maintainability and reliability than something you would do without thinking. But you have to know those 2 lines and which libraries to use etc. R actually is extremely elegant language and probably most productive language if you know what you are doing however it's not very beginner friendly (especially coming from other languages).
baldfat
> It seems that once you figure out a good model in R, its almost always rewritten into either Scala or Java for real production work.

I wouldn't say 1% of programs in R written need that speed. I personally use it for small projects (Besides a few Spark side projects) and I am out putting Reports.

I really would like someone to show an actual example of this happening in 2016.

nerdponx
I do it at my company. I prototype in R, and then end up having to rewrite chunks of it in Python so it can be worked into our application, which right now is exclusively Python.

It's not a matter of performance, it's just because it would be an enormous amount of engineering overhead to start calling R from inside the Python app

baldfat
That seems like you could simply use http://jupyter.org/ and just run the script with R code inline.

http://blog.revolutionanalytics.com/2016/01/pipelining-r-pyt...

Also why not just switch to Pandas it really is a pretty close R clone.

kgwgk
"Pretty close" as long as you stay within the region of common functionality. I wouldn't say it's a clone.
baldfat
That is true. I actually started my journey with Pandas and then switched to R for the ecco-system and zero based for data science drove me nuts.

But I do feel that the goal is a clone.

"Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R." http://pandas.pydata.org/

blahi
How much experience do you have in statistical computing, out of curiosity?
nerdponx
It has nothing to do with interoperability on my machine. I use notebooks (and Pandas) all the time, and I consider myself fluent in bith R and Python.

It's because R is a substantial engineering dependency. As I said, our entire stack is Python and Node. Yes, you can call R from Python using Rpy2, but that's a pro-bono project maintained largely by one person. It's great for casual use, but there is far too much risk to start talking about building critical business code around it.

baldfat
So why not Pandas?
nerdponx
Personal preference. I switch back-and-forth based on the project.

R data frames are native and feel native. Pandas data frames are non-native and can be a pain in the ass to work with.

That, and there is a lot mpre to the decision than just which data frame implementation I like better.

RA_Fisher
Check out opencpu.org, it's an R web api. Really cool stuff.
0x001E84EE
Part of that may stem from R and most (all?) of its libraries being licensed under GPL.
nerdponx
Afaik Bloomberg uses it extensively for internal data visualization tools.
madenine
doesn't Bloomberg have a custom, in-house R IDE?
apohn
I used to work in the consulting arm of a software firm and we wrote and deployed R code in production at many Fortune 500 companies. We worked in almost every industry.

I spent quite a bit of time refactoring bad R code so it could run reliably in a production environment. There is a ton of bad R code out there that barely works for exploratory analysis, let alone a production environment.

So yes, R is used in production environment in a lot of places.

0x001E84EE
R and its libraries are GPL licensed. Is there some corporate license available to prevent companies from being required to publish proprietary code that interacts with R? Or was the usage limited to to internal systems?
apohn
Thanks to certain popular technologies like Hadoop, a lot of big companies have their legal teams looking at open source licenses as an alternatives to the big vendors like IBM. Using R and CRAN is getting easier because of this.

A lot of customers we worked with only provided outputs to external parties via reports, extracts, dashboards, etc. I don't recall a situation where an external person could run an R script (e.g. some of the companies I worked for provided their customers with BI reports). Don't ask me about the legality of that - even if I had an answer I wouldn't say it.

We used to run into all sorts of annoying issues with regards to licensing. For example, I worked at a customer where their scientists were blocked from downloading stuff from CRAN in an ad-hoc way (e.g. install.packages()). And nobody from out team was allowed to send them packages due to fear that they'd blame us for any issues with packages or package licensing.

The end result was a convoluted process for installing R, upgrading R, or anything to do with packages. During one project I was involved in a ridiculously long winded email chain discussing licensing on a particular library, with the lawyers acting like I had some sort of insight into the mind of the library author. That's the kind of resistance some organizations face when thinking about open-source tools.

ginger_beer_m
What are the hallmarks of bad R codes to watch out for and avoid?
ignasl
For majority of use cases R code should be vectorized so I would say if you see a loop in the code that's a red flag and you should check it out.
apohn
What makes R code good for production is basically the same for what makes code in any language good for production. Use functions, local variables, tests, check for nulls, type issues, etc.

I wouldn't say having loops is always a bad thing. Sometimes writing loops is the only way to solve a particular problem and code loops can be easier to read and debug. Sometimes people say use the apply family of functions instead of loops, but my experience is that in many cases apply will not give you any significant speedup over a loop. I use apply because it's easier to write cleaner code with better flow than loops, not because I expect an automatic speedup.

However, if there are loops to do everything, that's a sign of bad R code. For example, if you are using a loop to add numbers in a vector together, that's bad code. That needs to be fixed

A lot of R is also written for exploratory analysis. So it's written without much thought to structure, scope, flow, or much of anything. It's basically like a first draft of a paper. Making this code production ready should not just be putting that code in a function - you need to step back and architect it properly.

There's also a practical matter of how fast it needs to be. I've been involved in projects where a loop based R script was run in batch once a day at 1AM. And the run time for the script was 20 minutes. If we vectorized it, maybe it would run in <1 minute. But why bother if it's run once a day?

vijucat
Did you guys separate out the R process (or multiple processes?) from the rest of the transaction-processing / other server infrastructure or embed the REngine (which sounds like a bad idea to me; incorrect data serialization can easily crash the whole process)?

What is a stable way to connect (and reconnect!) to R, assuming it was a separate process? I would think that an indirect communication path, such as Server <--> Database <--> R would work best, but I'd love to hear your battle hardened take on it.

sandGorgon
i have the same question - how do you use R in production ?
apohn
We used separate workflows depending on if the data was streaming or batch oriented (e.g. on-demand or triggered by a user). First I'll talk about batch oriented jobs.

The company I worked for had a tomcat based product that exposed R via a RESTful API. It was similar to what you get from AzureML now, except it was on-premise. So basically we would call out to this and configure it to restart R sessions if they crashed or timed out.

In an ideal situation we would isolate this server from the rest of the processing as much as possible. To be honest our server was pretty basic - it basically served to queue jobs (if needed) and manage RSessions if the server was configured to run multiple sessions. For serious failover we had a second server.

We did try to do as much as possible outside of R such as data pipelining an ETL. That was done for the obvious reasons, but also because many customers had SQL and Data people, but not R people. So if one of their Data people understood the data ETL, they could fix it without calling us.

For many customers they'd never let R connect to a Database directly. So They'd have a separate process pull data and write it to disk. Then an R script would be triggered and would pick this data up.

I never saw major crashing issues with R in production with batch oriented jobs unless there was something unexpected with the size or type of data. Typically as long as there was time between jobs, R's garbage collector would sort things out and be ready for the next job. Also by the time something made it into production we'd hardened the script, frozen the CRAN package versions, etc. So some small issue wouldn't cause a major issue.

Streaming data presented it's own adventure. To get data into/out of R as quickly as possible, you need to embed the REngine and talk to it via rJava. If we streamed data through R very quickly it would do fine for a while - then you'd see the memory usage go up and the time for each transaction started to vary greatly. Then it would crash.

The solution to this was multiple Rsessions and a lot of telemetry. We would track how long each transaction took through R. As soon as we started seeing a lot of variance in the time we'd restart the engine. By running the multiple Rsessions in round-robin we'd delay the onset of this instability, and it didn't matter when R sessions needed to be restarted.

Another trick we used was to cache data in an in-memory database so if something crashed the whole service would restart and pull from the in-memory database instead of trying to fetch old data from the server.

vijucat
Thanks, this is all quite useful! I faced crashes with REngine + rJava, too, and thought of a DB as a intermediary, but your in-memory DB idea adds an interesting twist that adds performance, too.
gearhart
That's half of the population of earth buying something on a credit card.

I'm assuming this wasn't real-time, real-world data (although I didn't watch the whole 1.5hr video to confirm), but the implication is that this system could process the peak load of global credit card transactions as they happened. That's pretty impressive.

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.