HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Andrew Montalenti - streamparse: real-time streams with Python and Apache Storm - PyCon 2015

PyCon 2015 · Youtube · 3 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention PyCon 2015's video "Andrew Montalenti - streamparse: real-time streams with Python and Apache Storm - PyCon 2015".
Youtube Summary
"Speaker: Andrew Montalenti

Real-time streams are everywhere, but does Python have a good way of processing them? Until recently, there were no good options. A new open source project, streamparse, makes working with real-time data streams easy for Pythonistas. If you have ever wondered how to process 10,000 data tuples per second with Python -- while maintaining high availability and low latency -- this talk is for you.

Slides can be found at: https://speakerdeck.com/pycon2015 and https://github.com/PyCon/2015-slides"
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
If you're interested in Python + Storm, we are making it happen via our open source project streamparse. See the project on Github[1] and also my PyCon US 2015 talk[2].

Your point about documentation is well-taken. We are trying to document streamparse + Storm usage from a Pythonista standpoint via our online documentation[3], e.g. here is our detailed Python API documentation[4].

[1]: https://github.com/Parsely/streamparse

[2]: https://www.youtube.com/watch?v=ja4Qj9-l6WQ

[3]: http://streamparse.readthedocs.org/en/stable/

[4]: http://streamparse.readthedocs.org/en/stable/api.html

Apache Storm[1] isn't exactly like Erlang/OTP, but it definitely aims to achieve similar goals. Though Storm is written in Java and Clojure, the infrastructure it provides is usable across languages thanks to something it calls the "multi-lang protocol".

I work on a Storm library for Python that's called streamparse[2], and the goal of the project is to allow us to easily achieve Erlang OTP-style reliable processing atop open source infrastructure while still writing pure Python code.

I also gave a PyCon talk about streamparse which you can find on YouTube[3]. It describes the motivation for the project -- which is to solve a large-scale real-time analytics problem with Python and do so in a reliable way, while beating Python's multi-core limitations (the GIL, etc.)

[1]: https://storm.apache.org/

[2]: https://github.com/Parsely/streamparse

[3]: https://www.youtube.com/watch?v=ja4Qj9-l6WQ

MCRed
None of those are equivalent. In fact they don't actually address the issue.
pixelmonkey
OTP is described thusly by "Learn You Some Erlang":

"OTP contains functions to safely spawn and initialize processes, send messages to them in a fault-tolerant manner and many other things."

Also, "supervisors are one of the most useful parts of OTP you'll get to use." And, "an OTP application specifically uses OTP behaviours for its processes, and then wraps them in a very specific structure that tells the VM how to set everything up and then tear it down."

Given that, I don't see how you could think Storm isn't at all related. (I never said it was equivalent, just very similar goals.)

fernandotakai
not sure if this exists right now, but i would love to see a good tutorial/article on storm showing how to use it to solve real life problems instead of "hey here's a way to split words and count them".
patrickmay
We're using Storm to handle high volume Kafka feeds to process advertising bid requests and other user data in machine learning models. Most of it is proprietary, unfortunately, so I can't write the article you'd like (and that I would have liked before starting this).

So far Storm is working as advertised. Frankly, what it does isn't that terribly difficult, but it's good to have a well-tested implementation of it so we can focus on our business logic.

pixelmonkey
It is a fair criticism of the state of Storm documentation and examples. The book Storm Applied by Manning details tons of Storm usage examples, ranging from the technically advanced to the mundane. I recommend that book to learn more about real world usage (disclosure: I wrote the foreword).

Also, my team is working on an open source example project called birding that uses streamparse and pykafka together to build filtered firehose tweet streams from Twitter's public API. We think this will illustrate both Storm and Kafka really well, and be beyond the typical word count examples. See https://github.com/Parsely/birding to track that effort.

virtualwhys
> to easily achieve Erlang OTP-style reliable processing

that reliability wasn't the case for at least one of the internet heavyweights, quite the opposite [1]

[1] http://blog.acolyer.org/2015/06/15/twitter-heron-stream-proc...

pixelmonkey
I have a more cynical view of the Heron announcement. But let's remember that Twitter ran Storm in production for over 4 years at massive scale, and whatever improvements Heron represents, they thought the abstraction was important enough to reimplement the entire Storm API in the new system.
BTW, I'm one of the co-authors of streamparse, one of the DARPA-supported projects that is being developed by my company, Parse.ly. It lets you integrate Apache Storm cleanly with Python.

I just gave a talk about streamparse at PyCon US (https://www.youtube.com/watch?v=ja4Qj9-l6WQ) a few days ago, it was entitled "streamparse: defeat the Python GIL with Apache Storm". I'm glad to answer any questions about it.

look_lookatme
How does one get their commercial project supported by DARPA?
phy6
Send your proposal in response to the Broad Agency Announcements (BAA) that the agency puts out.
pixelmonkey
Hmm, not sure I could answer that question, as in this case, DARPA is supporting our open source projects, not our commercial projects. Or is that what you are asking?

That said, FastCompany covered the story of how we got involved with DARPA here:

http://www.fastcompany.com/3040363/the-future-of-search-brou...

look_lookatme
I looked at the MEMEX page and say a bunch of companies represented and was genuinely curious... thanks for sharing the link.
anon3_
What's the GitHub URL?
pixelmonkey
https://github.com/Parsely/streamparse
anon3_
Solid. Apache licensed. You're inside tmux too.

This is legit.

Docs: http://streamparse.readthedocs.org/en/latest/

How did you make that screenshot / animated preview?

pixelmonkey
I used a Linux program called byzanz. The bash alias I use to record gif screencasts is here:

https://github.com/amontalenti/home/blob/master/.bash_aliase...

strgrd
With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

> ...DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing...”

Doesn't this sound like the same "protect the kids" line embedded in every press release for not-so-subtle government spy programs? $1 million is a lot of money, and I'm sure being able to name drop DARPA in any conversation about your company carries its own cache -- surely you feel pressured to design your optimizations to fit DARPA's needs. Does it feel weird to write code that's being used to track people? Or is that off base?

meowface
It's a means of searching public-facing (albeit cloaked) content, not a means of tracking people specifically.

If anything, it's ethically much superior to what the NSA is doing: law enforcement searches for content that is clearly criminal (child pornography, actual terroristic threats, murder-for-hire services), then requests a warrant after showing the content to a judge. That's how the process should work; identify something illegal at the front, identify probable cause, then go in through the back with court approval. These search engines can only find content that is already accessible to other users.

The NSA is already in the back, looking for justification for already being there, then after finding something, lying and saying they went in through the front.

Of course, this software could theoretically be used to search a database of data unethically exfiltrated without a warrant, but that's not what the stated goal is and there doesn't seem to be any evidence of that.

makomk
They're using it to search for "human trafficking", by which they seem to mean adult women having sex in exchange for money. Oh, sorry, adult women who describe themselves as "latina" having for money - mustn't forget that part. (Seriously. Look at the pictures in the article.) Minor details like whether the women in question are actually trafficked, or whether they should be deporting them right back into the hands of the people who trafficked them if they are, have never been terribly important to the police in the US. This will be used to hurt vulnerable women.
mpyne
> Does it feel weird to write code that's being used to track people?

Does it feel weird to design mechanical implements designed for the sole purpose of destroying human life?

I'm not speaking of drones and missiles, mind you; I'm speaking of small arms, the very same tools so staunchly defended by libertarian lovers of the Second Amendment everywhere.

There are plenty of valid reasons to want to track someone over a network like Tor, just as there are insidious reasons. E.g. all the reasons that make legal, warrant-protected wiretaps a legitimate function of governments worldwide.

But even if there weren't valid reasons, other countries will develop (or already have) similar capabilities, so making DARPA your line in the sand for this is missing the point anyways.

def_illiterate
>With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

You're not adding to the conversation by pointing this out. We can all clearly see this for what it is.

pixelmonkey
To be clear, our projects (at Parse.ly) don't have anything to do with Tor. In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

But, I'll address your general question, which is, do I have a moral/ethical problem with DARPA funding some of our open source work, such as streamparse and pykafka?

The answer is a resounding "no". There are very few funding sources for open source work. Part of DARPA's funding supports fundamental tech advancements (famously, the Internet itself and GPS) and recently, important open source projects (such as, Apache Spark and the Julia language).

Now, there is no doubt in my mind that open source software is used for intelligence purposes, regardless of its funding source. To restrict ones contribution to F/OSS based on the fear that some government or entity may use it toward an end you disagree with seems a battle you can only win through willful ignorance.

The nature of open source software is that people can use it however they please (within legal limits, of course). This is a trade-off I accept with eyes wide open, and in my mind, the benefit to the community for F/OSS always wins out.

saurik
> In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

This reminds me of the movie Cube :(.

api
Brilliant little unknown film. First time I've seen it mentioned, ever.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.