Hacker News Comments on
University of Washington
Data Science at Scale
Hacker News Stories and CommentsAll the comments and stories posted to Hacker News that reference this url.
Bill Howe did a solid intro course for the University of Washington. Videos and other materials are available on Coursera.
The one thing I'd really change is to tighten up the range of tools used. It seems helpful to show students a range of tools, but it usually ends up being a major distraction for students and a lot of extra effort for course staff. Any such course is already going to be a blitz of new concepts and technology.
Go full Python, plus interactive tools as helpful (Weka, Tableu). Let them pick up R or D3.js or whatever later, after they have a better appreciation for the concepts and such which make them useful.
So looking through this 'track', I see one course which seems like it might be more central to the discipline, "Intro to Data Science". Has anybody had a chance to compare this one against Bill Howe's "Introduction to Data Science" on Coursera?
For an introduction to the broader realm of data input, normalization, modeling, and visualization -- in which ML plays but a part -- you can "preview" Bill Howe's "Introduction to Data Science" class on Coursera; I'm working through the lectures, and I find he gives compelling explanations of what all these parts are, why they're important, and how it all fits together in a larger context.
⬐ ghaffI took Prof. Howe's course on Coursera and it's a bit of a mixed bag. I can actually see it being better in some respects just going through the content after-the-fact than taking the course as it was run as there were a number of issues with auto-grading of assignments and some of the specific tools choices (like Tableau, which only runs on Windows).
That said, the course covered a lot of ground and touched on a number of different interesting/important topics. Some of the lecture material was a bit disorganized/had errors and didn't flow all that well from one topic to another but there was a lot of good material there, especially if you had enough background to appreciate it. I was comfortable enough but it was obvious that the expectations set by the prereqs were off.
Hopefully the course will run again with most of the kinks worked out and, perhaps, a better level-setting of what's needed to get the most out of the course.
A large telco has a 600 node cluster of powerful hardware. They barely use it. Moving Big Data around is hard. Managing is harder.
A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem.
Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci
⬐ jfxberns"A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem."
It's not that hard to program... it does take a shift in how you attack problems.
If your data set fits on a few SSDs. then you probably don't have a real big data problem.
"Moving Big Data around is hard. Managing is harder."
Moving big data around is hard--that's why you have hadoop--you send the compute to where he data is, thus requiring a new way of thinking about how you do computations.
"Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci"
Data science does not solve the big data problem. Here's my favorite definition of a big data problem: "a big data problem is when the size of the data becomes part of the problem." You can't use traditional linear programming models to handle a true big data problem; you have to have some strategy to parallelize the compute. Hadoop is great for that.
"A large telco has a 600 node cluster of powerful hardware. They barely use it."
Sounds more like organizational issues, poo planning and execution than a criticism of Hadoop!⬐ petarConsider using the gocircuit.org
It was expressly designed to avoid non-essential technical problems that come in the way of cloud application developers, when they try to orchestrate algorithms across multiple machines.
Identical HBASE SQL queries and their respective counterparts implemented in the Circuit Language are: Approximately equally long in code, and orders of magnitude faster than HBASE. In both cases, the data is pulled out of the same HBASE cluster for fair comparison.⬐ continuations⬐ AsymetricComWhat does Go Circuit do differently that makes it orders of magnitude faster than HBase SQL?> Moving Big Data around is hard.
I never had any issues with Hadoop. Took about 2 days for me to familiarize myself with it and adhoc a script to do the staging and setup the local functions processing the data.
I really would like to understand what you consider "hard" about Hadoop or managing a cluster. It's pretty straight forward idea, architecture is dead simple, requiring no specialized hardware at any level. Anyone who is familiar with linux CLI and running a dyanamic website should be able to grok it easily, imho.
Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.⬐ alecco⬐ thrownaway2424Is this serious? Have you ported a program to Hadoop? Unles you use Pig or one of those helping layers it is quite hard for non-trivial problems. And those helping layers usually come with some overhead cost for non-trivial cases, too.
Edit: no downvote from me.⬐ AsymetricCom⬐ nlIt was a pretty easy problem, parsing logs for performance statistics. But moving the data is the easy part and that's why I was incredulous of the OP's statement.
I'm starting to wonder if this is really "Hacker News" or if it's "we want free advice and comments from engineers on our startups so lets start a forum with technical articles"⬐ aleccoBig Data should be on the Peta+ level. Even with 10G Ethernet it takes a lot of bandwidth and time to move things around (and it's very hard to keep 10G ethernet full at a constant rate from storage). This is hard even for telcos. Note Terabyte+ level today fits on SSD.⬐ oceanplexianNot really, "Big Data" has nothing to do with how many bytes you're pushing around.
Some types of data analytics are CPU heavy and require distributed resources. Your comment about 10G isn't true. You can move around a Tb every 10 minutes or so. SSDs or a medium sized SAN could easily keep up with the bandwidth.
If your data isn't latency sensitive and run in batches, building a Hadoop cluster is a great solution to a lot of problems.⬐ schrodingerOf course big data is about number of bytes. That's what something like map reduce helps with. It depends on breaking down your input into smaller chunks, and the number of chunks is certainly related to the number of bytes.Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.
W T F does this even mean? I genuinely do not understand what point you are trying to make?
I have used Slashdot longer than you have (ok, possibly not.. but username registered in 1998 here...).
I find HN has generally much more experienced people on it, who understand more about the tradeoffs different solutions provide.
The old Slashdot hidden forums like wahiscool etc were good like this too, but I don't think they exist anymore do they?⬐ skrebbel> Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.
You sound like a snob.⬐ AsymetricComSee what I mean?I strongly agree. Although there are clearly uses for map/reduce at large scale there is also a tendency to use it for small problems where the overhead is objectionable. At work I've taken multiple Mao/reduce systems and converted them to run on my desktop, in one case taking a job that used to take 5 minutes just to startup down to a few seconds total.
Right tool for the job and all that. If you need to process a 50PB input though, map/reduce is the way to go.⬐ twic> Mao/reduce systems
Well, that certainly sounds like ...
puts on sunglasses
... a Great Leap Forward.⬐ collywI completely agree as well, but I don't consider myself much of an expert in NoSQL technologies (which is why I read up on threads like this to find out).
Does anyone have a use case where data is on a single machine and map reduce is still relevant?
(I am involved in a project at work where the other guys seem to have enthusiastically jumped on MongoDB without great reasons in my opinion).⬐ amenodI completely agree with the parents about Map Reduce. However I would justify using MongoDB for totally different reasons, not scalability. It is easy to setup, easy to manage and above all easy to use, which are all important factors if you are developing something new. However it does have a few "less than perfect" solutions to some problems (removing data does not always free disk space, no support for big decimals,...) and it definitely takes some getting used to. But it is a quite acceptable solution for "small data" too.
Edit: ...and I wouldn't dream of using MongoDB's implementation of MapReduce.⬐ markuskoblerOn modern hardware with many cpu cores you can use a similar process of fork and join to maximise throughput of large datasets.⬐ collyw⬐ jfxbernsThat sounds more like parallelisation rather than a use case for NoSQL.
Why is that any better than all of your data on one database server, and each cluster node querying for part of the data to process it? Obviously there will be a bottleneck if all nodes try to access the database at the same time, but I see no benefit otherwise, and depending on the data organisation, I don't even see NoSQL solving that problem (you are going to have to separate the data to different servers for the NoSQL solution, why is that any better than cached query from a central server?).⬐ RoboprogForks/threads on (e.g.) 12 core CPUs works up to a point. But that point probably does solve many problems without further complication :-)⬐ virtuabhiSingle hardware with many cores does not give the same performance as multiple machines. For example, consider disk throughput. If the data is striped across multiple nodes then the read request can be executed in parallel, resulting in linear speed up! In a single machine you have issues of cache misses, inefficient scatter-gather operations in main memory, etc.
And it is much more easier to let the MapReduce framework handle parallelism than writing error prone code with locks/threads/mpi/architecture-dependent parallelism etc."Does anyone have a use case where data is on a single machine and map reduce is still relevant?"
No! MapReduce is a programming pattern for massively parallelizing computational tasks. If you are doing it on one machine, you are not massively parallelizing your compute and you don't need MapReduce.⬐ mselloutYou can imagine cases where map-reduce is useful without any starting data. If you are analyzing combinations or permutations, you can create a massive amount of data in an intermediate step, even if the initial and final data sets are small.⬐ collyw⬐ aleccoHave you got any links on how to do that, as it sounds very like a problem I am trying to sole just now - combinations of DNA sequences that work together on a sequencing machine.
At the moment I am self joining a table of the sequences to itself in MySQL, but after a certain number of self joins the table gets massive. Time to compute is more the problem rather than storage space though, as I am only storing the ones that work (> 2 mismatches in the sequence). Would Map Reduce help in this scenario?⬐ memracomIf I had your problem, the first thing that I would do is try PostgreSQL to see if it does the joins fast enough. Second thing that I would try is to put the data in a SOLR db and translate the queries to a SOLR base query (q=) plus filter queries (fq=) on top.
Only if both of these fail to provide sufficient performance, would I look at a map reduce solution based on the Hadoop ecosystem. Actually, I wouldn't necessarily use the Hadoop ecosystem. It has a lot of parts/layers and the newer and generally better parts are not as well known so it is a bit more leading edge than lots of folks like. I'd also look at somethink like Riak http://docs.basho.com/riak/latest/dev/using/mapreduce/ because then you have your data storage and clustering issues solved in a bulletproof way (unlike Mongo) but you can do MapReduce as well.> Does anyone have a use case where data is on a single machine and map reduce is still relevant?
What matters is the running data structure. For example, you can have Petabytes of logs but you need a map/table of some kind to do aggregations/transformations. Or a sparse-matrix based model. There are types of problems that can partition the data structure and work in parallel in the RAM of many servers.
Related: it's extremely common to confuse for CPU bottleneck what's actually a problem of TLB-miss, cache-miss, or bandwidth limits (RAM, disk, network). I'd rather invest time in improving the basic DS/Algorithm than porting to Map/Reduce.⬐ collywOK, maybe I am not understanding you correctly, but what you describe seems to be, if the data is on one machine, connect to a cluster of machines, and run processing in parallel on that.
That doesn't imply a NoSQL solution to me. Just parallel processing on different parts of the data. If I am wrong can you point me to a clearer example?⬐ aleccoMaybe I misunderstood what you were asking.
Note both MapReduce and NoSQL are overhyped solutions. They are useful in a handful of cases, but often applied to problems they are not as good.⬐ aidos⬐ RoboprogI'm not sure that the two concepts are resulted at all. Obviously mongo has map reduce baked in - but that's not that relevant. Map/reduce is a reasonable paradigm for crunching information. I have a heavily CPU bound system that I parallelise by running on different machines and aggregating that results. I probably wouldn't call it map reduce - but really it's the same thing.
How do you parallelise your long running tasks otherwise?⬐ aleccoI can't say without more information on the problem to solve. As I said above, there are cases where MapReduce is a good tool.
And even if you improve the DS/Algorithm first, usually that is usable by the MapReduce port and you save a lot of time/costs.It sounds to me like the poster above restructured the input data to exploit locality of reference better.⬐ aleccoIt's one of the issues, yes. But I wanted to be more general on purpose.⬐ collywSo assuming the data is one one machine (as I asked), why would an index not solve this problem? And why does Map Reduce solve it?⬐ virtuabhiIndexes do not solve the locality problem (see Non-clustered indexes). Even for in-memory databases, it is non-trivial to minimize cache misses in irregular data structures like B-trees.
Now why MapReduce "might" be a better fit for a problem where data fits into one disk. Consider a program which is embarrassingly parallel. It just reads a tuple and writes a new tuple back to disk. The parallel IO provided by map/reduce can offer a significant benefit in this simple case as well.
Also NoSQL != parallel processing.
Seems to work great over here, and the installation was pretty easy, too. You can even choose not to download certain types of files using the -n option. For example, if you have a large hard drive and a smaller one, you can download the whole course to the large HD:
coursera-dl -u username -p password -d pathToLargeHD course_name
and only download pdf lecture notes to the smaller one
coursera-dl -u username -p password -d pathToSmallHD -n mp4,pptx course_name
I tried that over here, worked great.
Some schools prefer students don't download course materials. I succesfully downloaded Machine Learning and Algorithms courses from Stanford but could not download this one, it says "now downloadable content found":
⬐ carlosggAfter upgrading to latest version of script, I was able to download this one, too.
This seems to fit the bill: https://www.coursera.org/course/datasci
⬐ muraikiI did the first two weeks of this course and found it quite accessible, although the second week question of implementing matrix algebra in SQL didn't seem to have much preparatory material in the lectures. Unfortunately I've had to drop out due to a concussion, but I think that most HN'ers would be able to take this course.
Both in timing and in content this could be a good lead-in to the University of Washington's Intro to Data Science class that looks like it will have more of a focus on 'big data', NoSQL, Hadoop, data mining, etc.