HN Academy

The best online courses of Hacker News.

Hacker News Comments on
Distributed Machine Learning with Apache Spark

edX · University of California, Berkeley · 5 HN comments

HN Academy has aggregated all Hacker News stories and comments that mention edX's "Distributed Machine Learning with Apache Spark" from University of California, Berkeley.
Course Description

Learn the underlying principles required to develop scalable machine learning pipelines and gain hands-on experience using Apache Spark.

HN Academy Rankings
Provider Info
This course is offered by University of California, Berkeley on the edX platform.
HN Academy may receive a referral commission when you make purchases on sites after clicking through links on this page. Most courses are available for free with the option to purchase a completion certificate.

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this url.
https://www.edx.org/course/scalable-machine-learning-uc-berk...!

It's one of the assignments. (The collaborative filtering one)

bertomartin
cool,thanks
minimaxir
Whopps, it was actually the other Big Data class.

The assignments are here: https://github.com/spark-mooc/mooc-setup

The is an edX course going that covers Machine Learning with Python, though it does require "...familiarity with basic machine learning concepts".

"All exercises will use PySpark, but previous experience with Spark or distributed computing is NOT required. "

https://www.edx.org/course/scalable-machine-learning-uc-berk...

Jul 17, 2015 · gbersac on How to learn data science
I am doing this course and find it really good : https://www.edx.org/course/scalable-machine-learning-uc-berk...

It is about creating a linear and logistic regression + pca using spark (python api).

Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at Berkeley's AMPLab) in collaboration with DataBricks (Commercial company started by Spark creators) just started a free MOOC on edx: https://www.edx.org/course/introduction-big-data-apache-spar...

(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)

It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).

So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)

I use Spark for work (Scala API) and still learned one or two new things.

It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.

It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.

https://www.edx.org/course/scalable-machine-learning-uc-berk...

(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)

The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM (http://www.acm.org/press-room/news-releases/2015/dissertatio...)

Spark also set a new record in large scale sorting (Beating Hadoop by far): https://databricks.com/blog/2014/11/05/spark-officially-sets...

* EDIT: typo in "Berkeley", thanks gboss for noticing :)

spacko
> It is followed by a more advanced spark course (Scalable Machine Learning)

Is it really more advanced regarding Spark? The requirements state explicitely that no prior Spark knowledge is required.

eranation
Cool, I stand correct. Thanks
tomnipotent
"... on top of Hadoop".

Can safely remove this part. Hadoop not required.

digitalzombie
Hadoop isn't require and it only run better if you fit data in memory.

Spark does micro batch processing where as Hadoop traditionally does batch processing. Hadoop yarns is different now and even with old Hadoop if you can fit it into memory it can be supposely as fast according to a meetup I've attended.

There's also Apache Flink by data artisan.

gtt
I've been struggling to set up it correctly on my debian machine. Are there debian packages or some concise tutorial? I've found some thing on the web, but certain things does not much mine and I'm lost...
annapurna
Thanks for the detailed info and context. Just signed up for my first edX course.
yzh
Thanks! I've been following the course and so far it's been awesome!
julnepht
Thanks for the plug, I have signed up as well to the class and its great !
None
None
0xFFC
I would love to learn about spark,but as some one who li e in third world country I hate edx,instead I am in love with udacity and coursera.the place I am living ,we don't have much traffic monthly ,instead we can download everything we want between 1am-6am,so there is no way to download course from edx ,simply and using it later.I wish it was on udacitg or coursera,is there any torrent for course material?
sidmitra
I'm doing the spark course. Edx has a download button on the videos, and can download PDF files for the lectures. The rest like quizes that are embeded, i just screenshot or save as pdf for posterity.

Are you sure you can't download, or maybe they've changed recently.

0xFFC
Yes I am aware of download button , but consider every course is ~50 distict video and also consider our downloading time you are going to agree with me about downloading is extermely painful ,why they just doesn't put whole material (at least just videos) like the way udacidy does.
jm0
You can download the lectures using the edx-downloader: https://github.com/shk3/edx-downloader
Do they really get better? I'm either going to jump straight to the (R) Statistical Inference[1] course from JHU, or switch to the Berkeley/EdX Spark course[2].

I use a lot more Spark in my day job than R, but I really should learn statistics more formally.

[1] https://www.coursera.org/course/statinference

[2] https://www.edx.org/course/scalable-machine-learning-uc-berk...

rz2k
I thought they got better compared to the first few classes, but they do really revolve around R. For a rigorous treatment of the subject matter, the MITx course on Probability is really good. [1] You could also take a look at the two JHU "Mathemtical Biostatistics Bootcamp"[2] courses. Those are also quick compared to the MITx course, but a little more careful about the math than the courses in the data science specialization are.

I haven't ever used Spark, and I like R, but I am going to take the Berkeley/EdX course.

[1] https://www.edx.org/course/introduction-probability-science-...

[2] https://www.coursera.org/course/biostats & https://www.coursera.org/course/biostats2

HN Academy is an independent project and is not operated by Y Combinator, Coursera, edX, or any of the universities and other institutions providing courses.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.