HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
Enough Machine Learning to Make Hacker News Readable Again

pyvideo.org · 169 HN points · 1 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention pyvideo.org's video "Enough Machine Learning to Make Hacker News Readable Again".
Watch on pyvideo.org [↗]
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
I like the serendipity of Hacker News, so I don't want to filter out or in specific topics. But if you filter for quality, I love that. Using machine learning to pick out the "best" items was the subject of a discussion recently:

http://pyvideo.org/video/2612/enough-machine-learning-to-mak...

https://news.ycombinator.com/item?id=7712297

The author is nice enough to keep that app going here:

http://hn.njl.us/

It is an opinionated filter, but with ADD I can get caught reading most things on the front page. This guides me to some important links that don't initially seem all that interesting.

I asked the author if he would share his training set, or open source the app, but apparently not. If you can turn that web app into an RSS feed or a digest, that would help.

As for your idea to filter on keywords, you can get that working as an RSS feed using Yahoo Pipes without writing any code.

jchmura
Interesting. I had not seen that discussion. I'll take a look.
May 07, 2014 · 169 points, 85 comments · submitted by Wingman4l7
nostromo
Cool project!

But HN to me is a way to keep current on what people in tech are talking about. I don't want to live in a bubble. I want to discover new things that smart people think are cool.

pekk
You are still living in a bubble, it's just the valley-tinged HN bubble and you have encoded this as "smart people".
pjmlp
HN is a bubble. I came here to get to know what is being discussed on the SV startup scene.

The European enterprise consulting world is another galaxy.

zo1
Any idea where we can keep a pulse/view on that other galaxy?
pjmlp
As answered in another thread,

https://news.ycombinator.com/item?id=7715048

zhte415
I would be quite interested in similar (or not so similar) discussion forums for the European Enterprise galaxy. Would you be able to either submit a list of what you find notable, or post some starting points?
pjmlp
Not sure. I get most of my information via Heise, Skillsmatter webcasts and occasional meetings at local JUG and .NET user groups.
hershel
Maybe some version of this tool should be used as a to to filter past discussions instead of search? For example, for creation of a portal shout easy to use development tools including discussions?
tannerc
Something that keeps bringing me back to HN specifically (over the likes of Reddit, Twitter, etc.) is the sheer intelligence of conversations.

More often than not I'll find myself skimming through the discussion here before exploring the linked material. The reasoning behind why people feel something deserves to be "front paged" and the insights that domain experts offer in the discussion is what (I feel) makes HN valuable. Taking away the brainy aspect behind how the community works would be an interesting experiment, alas one I wouldn't want to see _replace_ what we have today.

"Things smart people think are cool" is nearly an understatement.

zo1
Agreed. This used to be the case with Slashdot, but that became a shithole with the new management. Sorry for the language, but as a long-time reader and commenter on Slashdot, I was very bitter when stuff started going downhill.

I've found HN to be a very good alternative. The comments/discussions were a bit iffy to get into at first with some trigger-happy downmods, but overall I think the conversation is very constructive. And there is a dearth of information and helpful things being said.

hessenwolf
Where's the dearth?
zo1
[..]being said/posted on HN. Sorry, should have elaborated that sentence.
ams6110
Did you mean wealth? Otherwise I'm confused by your statement.
zo1
Wow, dearth does NOT mean what I though it does. Anyways, I meant to say "lots", which is the complete opposite of dearth.

Thanks for being confused :)

hessenwolf
Just a subtle prod in the right direction :)

I actually went and looked up 'dearth' just to be sure it meant 'lack of'.

gabemart
This is exactly what people on reddit used to say about reddit six or seven years ago. I have to believe the decline in general quality, and the quality of the front page most of all, in reddit could happen in some form to HN, without vigilance on the part of users and mods.
mey
The community will drive to a common form which is shaped by the community managers (mods and power users). I believe @pg gets that and the people taking over also get. Additionally some users may simple outgrow the community even if the community stays exactly the same.

I think HN gets a lot of things right to foster what I enjoy, which is intelligent conversations about a wide variety of forward looking topics. I think the biggest risk at the moment is spamming the site with what are essentially targeted ads/click bait because of the user base is, but for the most part the front page is decently curated.

tannerc
Interestingly, as long as the discussions continue to contain valuable — and often educated — insights, I personally think there's still enough value to keep coming back to HN there alone. The subjective quality of links themselves may shift as the community ages, but it's the well-thought-out discussions and insights within them which are arguably often more worthwhile than the linked content itself.

It's worth mentioning that comments which don't seem to add much to the conversation are often down voted lately, which is promising.

krapp
Ironically, dang himself just got accused of posting something "not worthy" of HN: https://news.ycombinator.com/item?id=7712692

There seems to be a remarkable difference between what people feel HN should be, and what it actually is. "Quality" seems to be more of a sliding scale which correlates to personal bias, and perhaps, a feeling of alienation brought about by a diverse community, than anything objective.

pygy_
Here's the result of the presentation: http://hn.njl.us/

The classifier rejects this very submission... not sure what to think of it.

Maybe the "[video]" label killed it? The fact that it references HN?

CatMtKing
Probably just doesn't want to hear himself talk ;)
GhotiFish
His algorithm rates this article as bad.

That is very funny.

ColinWright
More likely the retrieved text fron the page itself contains very little of technical interest. If there were a transcript I expect it would fare better.

Great reminder - material in a video is undiscoverable.

egwor
Just my 2c; I can often get data faster by reading than waiting for someone to explain it verbally, so I usually prefer not to watch videos.

If someone could automatically transcribe videos with key panes from the video... (google??) ... then that would be cool.

001sky
^^^agree with this
philsnow
Idea: browser extension that notices two things: when you follow links from the HN front page, and whether you upvote that story. If you read an article and don't upvote it, it labels that article "dreck". If you do upvote it, it labels that article non-"dreck". Maybe it has some subtle reminder that you should remember to upvote the article once you're done.

People who use this extension make HN better for themselves (because they're classifying articles according to their tastes as they go along) and they're also making HN better for others (by incentivizing people to upvote good material when they may otherwise have not upvoted).

If you have enough HN karma to downvote, maybe only downvotes count as dreck. Then you're still improving both your own and others' experience.

Yo dawg, I heard you like HN, so I proposed a browser extension that lets you improve HN while you improve HN.

dredmorbius
Two general problems with this, and they're common to many content-recommendation / filtering systems.

• Explicit rating actions are only a small part of interactions with a site. Other implicit actions are often far richer in quantity and quality -- time spent on an article, interactions and discussion, the quality of that discussion (see pg's hierarchy of disagreement, for example), and other measures. As Robert Pirsig noted, defining quality is hard.

• Whose ratings you consider matters. The problem of topic and quality drift happens as general interests tend to subvert the initial focus of a site or venue. Those which can retain their initial focus will preserve their nature for a longer period of time, but even that is difficult. Increasingly, my sense is that you want to be somewhat judicious in who you provide an effective moderating voice to, but those who get that voice should be encouraged to use it copiously. Policing the moderators (to avoid collusion and other abuse) becomes a growing concern (see reddit and its recent downvote brigades against /r/technology and /r/worldnews).

philsnow
regarding the first part, granted.

regarding the second part, the proposed scheme uses hn's built in control of making users earn a bunch of karma before letting them downvote. I agree that topic drift happens, witness all of the bitcoin related discussion over the past year or so.

ColinWright

  > ... hn's built in control of making users
  > earn a bunch of karma before letting them
  > downvote.
Since all of this is talking about the classification of submissions, this is irrelevant, because you can't downvote submissions, only comments.

At least, I don't yet have enough karma to downvote submissions.

dredmorbius
Submissions can be flagged.

I'd argue they should be downvotable as well, though you're right, they're not.

Incidentally, comments can also be flagged (on the comment link view only, not in the forum view).

dredmorbius
So, there are two basic approaches you can make to this:

1. Delegate moderation powers only to a select group of individuals who know and will uphold the site's standards. Effectively: and editorial board.

2. Allow all users to moderate. But score each on how well the result of their moderation adheres to a specified goal -- that is, for a given submission, was it 1) appropriate to the site and 2) did it create high-level engagement? Users might correlate positively or negatively, strongly or weakly. That is: some people will vote up content that's not desirable, and downvote content that is. Others simply can't tell ass from teakettle. In the first case, you simply reverse the sign, in the second, you set a low correlation value. And of course, those who are good and accurate predictors get a higher correlation value.

With the 2nd approach, everyone's "vote" counts, though it may not be as they expected. You've also got to re-normalize moderation against a target goal.

It's more computationally intensive, but I think it might actually be a better way of doing things.

ff_
And the most beautiful thing here is that his classification algorithm marks this entry as "Probably I shouldn't read this"

I love it: http://hn.njl.us/

harrystone
Are people really serious when they talk about the fabled Hacker News of old? What could it possibly have been like? I'm imagining something like zombo.com, only with lower contrast text.
onedognight
See for yourself.

    https://news.ycombinator.com/classic
Often the content on the main page is quite different.
0x0
Is there an explanation of what's going on for that page somewhere?
Terretta
Articles voted up by old timers.
riffraff
is there a definition of "old timers" ? N years old account? signed up before year X?
icebraining
That's biased by the fact that many people who were upvoting back then have since left the site. If you want to see for yourself, the Wayback Machine is a better sample: https://web.archive.org/web/20071115044647/http://news.ycomb...
baddox
To my recollection, it was pretty similar, but with a lot fewer general news stories and a lot more stories about specific YC companies.
protomyth
Slashdot was fun back in the day (Id 64578), so most sites do that progression. I think HN has changed in that general social stories show up more. I do think the weekends are a bit weirder now.
drblast
The way I remember it, it was like this site but more frequently updated and focused mostly on Haskell:

http://lambda-the-ultimate.org/

If you read a few rows down on that page, you'll see this:

"For the debate about MS being evil, you can head directly to HN where you'll also find an explanation of what bootstrapping a compiler means."

And that about sums it up. For a while I didn't even create an account because I didn't think I could add anything without sounding stupid compared to everyone else. Now I try to refrain from commenting for...different reasons.

icebraining
People keep saying how much better HN was, but I'm just not seeing it: https://web.archive.org/web/20071115044647/http://news.ycomb...
GhotiFish
Good catch, articles are saved as well.

I don't notice a shift in tenor from the crowd of old and the one we have now.

saraid216
> And that about sums it up. For a while I didn't even create an account because I didn't think I could add anything without sounding stupid compared to everyone else. Now I try to refrain from commenting for...different reasons.

Same here. Though I refrain less.

At some point, I'd like to go and find my first comment on here just to see what got me to make an account.

walrus
https://hn.algolia.com/#!/all/forever/prefix/0/by%3Asaraid21...
saraid216
That is the most disappointing thing I have seen all day. Oh well.
None
None
dkarapetyan
I actually did something like this at some point. I took all the high ranking items, tokenized them to extract features, and ran them through a bayesian classifier to do some filtering. I was just using whatever information was available on the front page and did not do any further analysis with the actual content.

The results were ok. Maybe with a bit more power it could be more useful but the results were still hit and miss and I didn't have a long term strategy for not filtering myself into a bubble other than continuously re-training the model.

zmk_
As an econometrician I cannot believe how many times he said 'magic'. There is something very wrong when you put things in your model 'because, who knows, it might be helpful' (like he did with host names). Variable selection is a very hard problem and using 'magic' is asking for problems. It is so disappointing to see machine learning, statistics, econometrics deal with similar problems and fail to learn from one another.
BWStearns
To be fair he did explain that he thought that the host name might be indicative of whether or not it would be druck. If he knew exactly how that was the case (and if it was already known that it did have effect), why bother with machine learning? Just write an explicit scoring mechanism.
zmk_
It did seem to me that this was an interpretation that he came up with after he tired many different pipelines and "flipped all the switches". There are many sources of randomness that warrant using statistical methods. But it feels strange to me to see people use these tools without giving much thought to parameter stability, parameter significance, causality, model selection in general.
rsingla
I do agree.

The presentation did a wonderful job of providing a high level introduction into the idea of machine learning but anyone that's strongly interested in ML should pick up some of the books he mentioned.

mikecb
That's exactly the recent criticism of 'big data': engineers and others getting correlations they don't understand from all the data they can collect, and attempting to use them for who knows what.
asdfologist
It's a completely harmless toy project, so who cares if he chose his variables non-scientifically? He's not creating a cancer diagnosis tool here.
zmk_
I understand this is a toy project, but he is put in a position in which he educates people how to use these methods and gives the wrong impression. The next guy might use this flawed logic while creating a tool for disease prevalence prediction.
soundoflight
Interesting enough... the greens are all ones I clicked on earlier today and read. A few false negatives but not bad!
randyrand
Just because I think it's worth mentioning, I find it ironic that the link to this video got marked as bad using his algorithm =)

Saw this at the link he provided.

vixin
Excellent first rate presentation. Those giving technical talks might want to take note.
mjfl
Very block-boxy ("and it does a whole bunch of math and voila!")
tdicola
Isn't that the point of tools like scikit-learn? You don't need to know how to code, optimize, etc. all the algorithms, just understand how to use them.
mjfl
Perhaps, but I feel like if you are trying to use a statistical tool, it would be best to know how it works. Think about if every scientist claimed a discovery when they found a result with a 90% confidence interval. Machine learning (at least in this application) is different because often the consequences are testable, verifiable, but I still think that it's better to know how it works than treat it like a black box.
andreasvc
What makes you assert he is treating it like a black box? There isn't time in the presentation to go into detail, but actually linear models are inspectable, namely, you can obtain a list of features and how they are weighted. Also, as he said, the scikit-learn documentation is of high quality and explains how the models work. BTW you give an example of scientists, but like he stressed, machine learning as he applied it is a form of engineering.
3rd3
There is probably a large group that lacks advanced linear algebra and statistics for learning the theory but would still be able to build useful applications using a ML library. I think the video is mainly directed at that group.
skj
black box, not block box.
_archon_
I didn't even see that at first, although now that I look at it... I kind of like the term "block box". It takes a black box, and defines it in terms of how it's used, not what it does. It is a block that can be implemented in a certain way. How does it work? Doesn't matter at this level. It's a building block for a differently-focused project. A block box.
ColinWright
Yes. So are compilers. And web frameworks. And editors. And memory-management tools. Progress is made by no longer re-implementing and re-inventing the things that many, many people have invented and implemented in the past, and building on their work.

This doesn't mean that there is no value in learning about these things for yourself, but the packaging of knowledge in reusable tools is the only way programming progresses.

Dewie
Nicely encapsulated doesn't have to imply a black-box implementation, though. I for one would like it if compilers were less black-boxy; ideally, I want to find out why my compiler does a particular thing by investigating its output, querying the API, going through the compilation steps, etc., rather than having to google some StackOverflow answer.
swalsh
This guy right here made watching the video worth it: http://i.imgur.com/twr2j8Y.png
mey
The interactive map http://scikit-learn.org/dev/tutorial/machine_learning_map/in...
asdfologist
I find it odd that Ordinary Least Squares is missing from the map, even though it's probably more popular than all the other methods in that entire map combined.

However, it is mentioned at the top here: http://scikit-learn.org/stable/supervised_learning.html.

gone35
To add to simonster's comment [1]: confusingly, OLS is also morally equivalent to what the map calls "SGD regressor" with a squared loss function[2]. It is also nearly equivalent, with lots of caveats and many details aside, to SVR with a linear kernel and practically no regularization.

So yeah, it is confusing. There is a lot of overlap between several disciplines and it's still an emerging field.

[1] https://news.ycombinator.com/item?id=7713940

[2] http://scikit-learn.org/dev/modules/sgd.html#regression

blahzay
It's also odd there's no mention of Logistic Regression.
gone35
Yeah the nomenclature is not very rigorous and there is some overlap depending on how you look at it but, roughly and without being pedantic, the closest in that map would be SGD with a logistic loss function[1].

[1] http://scikit-learn.org/dev/modules/sgd.html#classification

simonster
OLS is a special case of ElasticNet, Lasso, and ridge regression with the regularization parameters set to zero. (The latter two are also special cases of ElasticNet with one of the two regularization parameters set to zero.) In the presence of many predictors or multicollinearity among the predictors, OLS tends to overfit the data and regularized models usually provide better predictions, although OLS still has its place in exploratory data analysis.
nathanathan
It would be nice if scikit-learn included an "autolearn" function based on this flow-chart.
platz
Why is it that in genetic algorithms never seem to be mentioned any more. Are they sub-standard, or just a "higher-level" than typically talked about i.e. you must implement them yourself.
None
None
mendicantB
This is not a good generalization. I've usually only seen this issue with optimization problems when:

1) You haven't played with parameters 2) Implementation is not correct (usually the case with genetic algos, since it requires a reasonable amount of domain expertise vs say GD)

mendicantB
Copped out when called out. Deleted comment said GAs were bad search algos and tend to get stuck at local minima.
bmh100
What evidence is informing your opinion that genetic algorithms are a bad search algorithm? What makes you say that they are very prone to getting stuck in local minima? Do you think they suffer from local minima more than, say, gradient descent?
erokar
> They are very prone to getting stuck in local minima.

That's quite a generalization. A GA's tendency to get stuck in local minima can be mitigated by adjusting population size, selection method/size and rate of mutation -- i.e. increase the randomness of the search.

Russell91
Genetic algorithms are not really an off-the-shelf black box that you can just plug your data into and get results. They take a domain expert to use efficiently, and even then they aren't guaranteed to perform that well. The area that I've encountered where they are most effective is in approximation heuristics for NP-hard problems where you slowly assemble a solution from smaller pieces.
jules
Could you give a concrete example where a genetic algorithm performs well? I have never been able to find any such example.
platz
I'm not sure this in an example of a GA performing well, but it is an interesting write up https://web.archive.org/web/20130526010327/http://www.aelag....
mendicantB
+1 I'd also add that genetic algorithms are for optimization, and can't really be compared with most of the algorithms in that chart. It'd be a sub-level where different optimization techniques for finding model weights, for each type of approach (classification, clustering etc)are compared.
simonster
Most (all?) of the algorithms on the chart iteratively optimize an objective. However, most of the objectives are convex or otherwise admit an optimization strategy that performs better than a genetic algorithm.
mendicantB
I believe you are repeating what I said (?). All of the algorithms have different methods of arriving to an objective function and leveraging it's results. Yet, most share the same problem in terms of optimizing it, and yes, most choose other routes.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.