HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
SVM with polynomial kernel visualization

udiprod · Youtube · 60 HN points · 1 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention udiprod's video "SVM with polynomial kernel visualization".
Youtube Summary
A visual demonstration of the kernel trick in SVM.

This short video demonstrates how vectors of two classes that cannot be linearly separated in 2-D space,
can become linearly separated by a transformation function into a higher
dimensional space.

The transformation used is:
f([x y]) = [x y (x^2+y^2)]

If you would like a stand-alone file with a high-res version of this movie for academic purposes please contact me.

Visit my homepage https://www.udiprod.com/, or read about my latest book "Zuto: The Adventures of a Computer Virus", http://www.zutopedia.com
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
The video you linked "explains" it very well to visual learners such as myself: http://www.youtube.com/watch?v=3liCbRZPrZA

If a picture is worth a thousand words, a second of video is truly worth 29,970 words.

snippyhollow
This video helped me too. Understanding something with one explanation is very good, but making the link with a second and a third one is even better.
Apr 27, 2010 · 60 points, 19 comments · submitted by jashmenn
bprater
Cool video. I have no idea what it's describing though. Can anyone generalize it?
lotharbot
It's describing a technique for classifying data into one of two categories (A or B) statistically, based on existing observations of data from categories A and B.

If we could just draw a straight line or plane that separated all of the A observations from the B observations (that is, A and B are "linearly separable"), keeping as far away from both as possible, we could say that any observation that hits the line is 50/50, and anything on one side or the other has a particular statistical change of being either A or B depending on how far away it is. (The fact that it's straight makes the statistics easy.)

But not all data sets can be separated easily with a straight line or plane; sometimes they're best separated with some kind of curve ("non-linearly separable"). The "trick" here is to create a higher-dimensional curved surface -- basically twisting our original space where A and B are -- in such a way that now we can make a straight cut between A and B on the twisted surface. The way our cut intersects the new curved surface corresponds to the best separating curve in the original space. (The hard part is picking the mathematical formula or "kernel" to create the twisted surface.)

So, for example, in the video all of the A points were outside of a particular circle and all of the B points were inside the circle. We create the curved surface above and copy all of the A and B points over. This allows us to separate A from B with a straight plane; its intersection with our curved space corresponds to a circle in the original space.

See also: http://en.wikipedia.org/wiki/Support_vector_machine

seiji
Hand-wavey non-math details:

Imagine a wall with illuminated colored shadows positioned randomly on it. The colored shadows are created by semi-transparent colored balls some distance from the wall. Each set of balls of the same color are the same distance from the wall (e.g. the red ones are 1m back, the blue ones are 2m back, etc).

Now, you are just looking at the wall. You don't know there is anything not on the wall (i.e. the higher dimensional objects projecting the pattern on the two dimensional surface). You just see colors interspersed seemingly randomly.

In order to make sense of all the two dimensional randomness, you can up-project the points to a higher dimensional space (i.e. try to re-create what projected them down in the first place). That's the kernel trick. You're projecting lower dimensional data into a higher dimensional space.

The goal of all this is to make your data linearly separable (i.e. so you can draw a line (2d) or plane (3d) or hyperplane (4d+) between you classes (colors) of data). You can't separate the colors when they are random points on the wall, but you sure a heck can separate the colors when you see the colored balls grouped by individual color at specific offsets from the wall.

That's a very simple 2d to 3d projection where you would draw a plane to separate your colors. For more interesting classes of data, you'll probably end up in much higher dimensional spaces.

j-g-faustus
Classification. The goal is to find the simplest description that will separate two sets of data (here: the points inside the circle from the ones outside).

Linear classifiers - in 2D, a straight line that separates the two groups - are preferred when possible since linear is as simple as it gets. http://en.wikipedia.org/wiki/Linear_classifier

The video shows how a classification task that cannot be handled by a linear classifier in 2D can still be described by a linear equation (a flat plane) by transforming the data to 3D.

nas
How about more concrete? It could be applied to a pattern recognition problem. For example, classifying email messages as spam and not-spam. The dots in 2-D space represent some characteristics of the message (a poor example would be number of words, average number of letters in each word).
None
None
smanek
It's been a few years, but I'll take a shot.

1. SVMs are useful for categorizing items.

2. SVMs can only categorize on 'linearly separable' sets. That is, if you were to plot the items from the two sets on a plane, it's possible to distinguish the two sets by drawing a line on the graph and everything above the line is in one set, while everything below is in the other.

3. The kernel trick is adding dimensions to a problem, so even though the problem isn't linearly separable in two dimensions, it might be in 3 by setting the right values in your extra dimension.

tlipcon
Neat visualization, but I wouldn't say this is the kernel trick. The kernel trick is a trick because it maps the higher dimension dot products into a kernel function in the lower dimension. The neat thing is specifically that a kernel function on low dimension points can actually correspond to a dot product in an infinite-dimension space. In fact this is the case for RBF kernels like Gaussian, which is probably the most popular SVM kernel choice.
jules
Can SVM fit a circle with variable origin and variable radius to a set?
ajj
It can in principle, and this is exactly where the kernel trick is useful.

The standard SVM formulation can only give linear classifiers. But, if you project your data into feature space (a higher, possibly infinite dimensional space), a linear separator in that space can be a circle in your original space. Since you can not do explicit computations in an infinite dimensional space, the kernel trick lets you get away without doing them at all. You can thus get an inner product value of two infinite dimensional vectors using a kernel function. So classifiers that only require inner product values and never the explicit vectors can exploit the kernel trick. i.e., SVM, logistic regression, etc.

That being said, choosing the appropriate kernel function is not always straightforward for your data.

jules
Can you give a kernel function that would let you fit a circle with variable position and radius? The examples always use a fixed position.
ajj
The Radial Basis Function kernel K(x,y) = exp(-(x-y)^2/gamma) is a very general kernel function that can get you this.

Finally, whether you actually get such a classifier depends on your data, and as I had mentioned setting the parameters (like gamma) is not very straightforward. Typically you may have to try with various values and see what works for your data.

Train your SVM (or any kernel method) and see what classifier it gives (run the classifier on data points near the boundary you want to detect to see where the exact boundary lies). This would only be to see how the classification function looks like.

olefoo
Can you recommend a good book that would get mathematically literate laymen a functional understanding of how to pick classifiers for data sets?
lliiffee
Technically what is visualized here isn't "the kernel trick". This is the general idea of how nonlinearly projecting some points into a higher-dimensional feature space makes linear classifiers more powerful. You can do this with out SVMs. Just compute the high-dimensional features corresponding to your data, then use logistic regression or whatever. Trouble is, if the higher-dimensional space is really big, this could be expensive. The "kernel trick" is computational trick that SVMs use to compute the inner product between the high-dimensional features corresponding to two points with out explicitly computing the high-dimensional features. (For certain special feature spaces.)

But this is definitely a cool visualization of the value of feature spaces!

moultano
Can you explain more about how the kernel trick specifically works?
lliiffee
Basically, the idea of feature spaces is to blow up the data into high dimensions. So, we use

x' = f(x)

as our data, instead of x. It turns out that in lots of machine learning algorithms (notably SVMs), you end up only needing inner products between different data elements. That is, we need to compute x^T y for two data elements x and y. In feature space, we need could compute this by doing f(x)^T f(y). However, it turns out that for certain feature spaces (like polynomials) one can compute the number f(x)^T f(y) quite quickly with out ever explicitly forming the big vectors x' or y'.

moultano
You stopped just short of the explanation I was hoping for. :)
None
None
gaika
see http://videolectures.net/mlss09uk_schoelkopf_km/ - kernel methods in general, not limited to SVM
lliiffee
Try section 7 of these notes:

http://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf

iskander
While the computational benefits are certainly nice, I think that Mercer's theorem is much deeper than that. It says that any [positive definite](http://en.wikipedia.org/wiki/Positive-definite) function is an inner product, though the space of that inner product might be infinite. The consequences of this go much further than just speeding up algorithms-- we can take plain old linear methods like SVMs and apply them to any data type on which we can define a kernel. This is huge, because it allows machine learning to break free of the vector world in which the algorithms are classically defined. Want to classify gene sequences? You don't have to hack them into a fixed-width vector, just define a string kernel for genes. Want to classify fragments of lambda calculus? Just define an appropriate tree kernel. It's wonderfully powerful.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.