Hacker News Comments on
University of California, Berkeley
Distributed Machine Learning with Apache Spark
Hacker News Stories and CommentsAll the comments and stories posted to Hacker News that reference this url.
It's one of the assignments. (The collaborative filtering one)
⬐ bertomartincool,thanks⬐ minimaxirWhopps, it was actually the other Big Data class.
The assignments are here: https://github.com/spark-mooc/mooc-setup
The is an edX course going that covers Machine Learning with Python, though it does require "...familiarity with basic machine learning concepts".
"All exercises will use PySpark, but previous experience with Spark or distributed computing is NOT required. "
I am doing this course and find it really good : https://www.edx.org/course/scalable-machine-learning-uc-berk...
It is about creating a linear and logistic regression + pca using spark (python api).
Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at Berkeley's AMPLab) in collaboration with DataBricks (Commercial company started by Spark creators) just started a free MOOC on edx: https://www.edx.org/course/introduction-big-data-apache-spar...
(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)
It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).
So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)
I use Spark for work (Scala API) and still learned one or two new things.
It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.
It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.
(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)
The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM (http://www.acm.org/press-room/news-releases/2015/dissertatio...)
Spark also set a new record in large scale sorting (Beating Hadoop by far): https://databricks.com/blog/2014/11/05/spark-officially-sets...
* EDIT: typo in "Berkeley", thanks gboss for noticing :)
⬐ spacko> It is followed by a more advanced spark course (Scalable Machine Learning)
Is it really more advanced regarding Spark? The requirements state explicitely that no prior Spark knowledge is required.⬐ eranation⬐ tomnipotentCool, I stand correct. Thanks"... on top of Hadoop".
Can safely remove this part. Hadoop not required.⬐ digitalzombieHadoop isn't require and it only run better if you fit data in memory.
Spark does micro batch processing where as Hadoop traditionally does batch processing. Hadoop yarns is different now and even with old Hadoop if you can fit it into memory it can be supposely as fast according to a meetup I've attended.
There's also Apache Flink by data artisan.⬐ gttI've been struggling to set up it correctly on my debian machine. Are there debian packages or some concise tutorial? I've found some thing on the web, but certain things does not much mine and I'm lost...⬐ annapurnaThanks for the detailed info and context. Just signed up for my first edX course.⬐ yzhThanks! I've been following the course and so far it's been awesome!⬐ julnephtThanks for the plug, I have signed up as well to the class and its great !⬐ NoneNone⬐ 0xFFCI would love to learn about spark,but as some one who li e in third world country I hate edx,instead I am in love with udacity and coursera.the place I am living ,we don't have much traffic monthly ,instead we can download everything we want between 1am-6am,so there is no way to download course from edx ,simply and using it later.I wish it was on udacitg or coursera,is there any torrent for course material?⬐ sidmitraI'm doing the spark course. Edx has a download button on the videos, and can download PDF files for the lectures. The rest like quizes that are embeded, i just screenshot or save as pdf for posterity.
Are you sure you can't download, or maybe they've changed recently.⬐ 0xFFCYes I am aware of download button , but consider every course is ~50 distict video and also consider our downloading time you are going to agree with me about downloading is extermely painful ,why they just doesn't put whole material (at least just videos) like the way udacidy does.⬐ jm0You can download the lectures using the edx-downloader: https://github.com/shk3/edx-downloader
Do they really get better? I'm either going to jump straight to the (R) Statistical Inference course from JHU, or switch to the Berkeley/EdX Spark course.
I use a lot more Spark in my day job than R, but I really should learn statistics more formally.
⬐ rz2kI thought they got better compared to the first few classes, but they do really revolve around R. For a rigorous treatment of the subject matter, the MITx course on Probability is really good.  You could also take a look at the two JHU "Mathemtical Biostatistics Bootcamp" courses. Those are also quick compared to the MITx course, but a little more careful about the math than the courses in the data science specialization are.
I haven't ever used Spark, and I like R, but I am going to take the Berkeley/EdX course.