Hacker News Comments on
SVM with polynomial kernel visualization
udiprod
·
Youtube
·
60
HN points
·
1
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.The video you linked "explains" it very well to visual learners such as myself: http://www.youtube.com/watch?v=3liCbRZPrZAIf a picture is worth a thousand words, a second of video is truly worth 29,970 words.
⬐ snippyhollowThis video helped me too. Understanding something with one explanation is very good, but making the link with a second and a third one is even better.
⬐ bpraterCool video. I have no idea what it's describing though. Can anyone generalize it?⬐ lotharbot⬐ tlipconIt's describing a technique for classifying data into one of two categories (A or B) statistically, based on existing observations of data from categories A and B.If we could just draw a straight line or plane that separated all of the A observations from the B observations (that is, A and B are "linearly separable"), keeping as far away from both as possible, we could say that any observation that hits the line is 50/50, and anything on one side or the other has a particular statistical change of being either A or B depending on how far away it is. (The fact that it's straight makes the statistics easy.)
But not all data sets can be separated easily with a straight line or plane; sometimes they're best separated with some kind of curve ("non-linearly separable"). The "trick" here is to create a higher-dimensional curved surface -- basically twisting our original space where A and B are -- in such a way that now we can make a straight cut between A and B on the twisted surface. The way our cut intersects the new curved surface corresponds to the best separating curve in the original space. (The hard part is picking the mathematical formula or "kernel" to create the twisted surface.)
So, for example, in the video all of the A points were outside of a particular circle and all of the B points were inside the circle. We create the curved surface above and copy all of the A and B points over. This allows us to separate A from B with a straight plane; its intersection with our curved space corresponds to a circle in the original space.
See also: http://en.wikipedia.org/wiki/Support_vector_machine
⬐ seijiHand-wavey non-math details:Imagine a wall with illuminated colored shadows positioned randomly on it. The colored shadows are created by semi-transparent colored balls some distance from the wall. Each set of balls of the same color are the same distance from the wall (e.g. the red ones are 1m back, the blue ones are 2m back, etc).
Now, you are just looking at the wall. You don't know there is anything not on the wall (i.e. the higher dimensional objects projecting the pattern on the two dimensional surface). You just see colors interspersed seemingly randomly.
In order to make sense of all the two dimensional randomness, you can up-project the points to a higher dimensional space (i.e. try to re-create what projected them down in the first place). That's the kernel trick. You're projecting lower dimensional data into a higher dimensional space.
The goal of all this is to make your data linearly separable (i.e. so you can draw a line (2d) or plane (3d) or hyperplane (4d+) between you classes (colors) of data). You can't separate the colors when they are random points on the wall, but you sure a heck can separate the colors when you see the colored balls grouped by individual color at specific offsets from the wall.
That's a very simple 2d to 3d projection where you would draw a plane to separate your colors. For more interesting classes of data, you'll probably end up in much higher dimensional spaces.
⬐ j-g-faustusClassification. The goal is to find the simplest description that will separate two sets of data (here: the points inside the circle from the ones outside).Linear classifiers - in 2D, a straight line that separates the two groups - are preferred when possible since linear is as simple as it gets. http://en.wikipedia.org/wiki/Linear_classifier
The video shows how a classification task that cannot be handled by a linear classifier in 2D can still be described by a linear equation (a flat plane) by transforming the data to 3D.
⬐ nasHow about more concrete? It could be applied to a pattern recognition problem. For example, classifying email messages as spam and not-spam. The dots in 2-D space represent some characteristics of the message (a poor example would be number of words, average number of letters in each word).⬐ NoneNone⬐ smanekIt's been a few years, but I'll take a shot.1. SVMs are useful for categorizing items.
2. SVMs can only categorize on 'linearly separable' sets. That is, if you were to plot the items from the two sets on a plane, it's possible to distinguish the two sets by drawing a line on the graph and everything above the line is in one set, while everything below is in the other.
3. The kernel trick is adding dimensions to a problem, so even though the problem isn't linearly separable in two dimensions, it might be in 3 by setting the right values in your extra dimension.
Neat visualization, but I wouldn't say this is the kernel trick. The kernel trick is a trick because it maps the higher dimension dot products into a kernel function in the lower dimension. The neat thing is specifically that a kernel function on low dimension points can actually correspond to a dot product in an infinite-dimension space. In fact this is the case for RBF kernels like Gaussian, which is probably the most popular SVM kernel choice.⬐ julesCan SVM fit a circle with variable origin and variable radius to a set?⬐ ajj⬐ lliiffeeIt can in principle, and this is exactly where the kernel trick is useful.The standard SVM formulation can only give linear classifiers. But, if you project your data into feature space (a higher, possibly infinite dimensional space), a linear separator in that space can be a circle in your original space. Since you can not do explicit computations in an infinite dimensional space, the kernel trick lets you get away without doing them at all. You can thus get an inner product value of two infinite dimensional vectors using a kernel function. So classifiers that only require inner product values and never the explicit vectors can exploit the kernel trick. i.e., SVM, logistic regression, etc.
That being said, choosing the appropriate kernel function is not always straightforward for your data.
⬐ julesCan you give a kernel function that would let you fit a circle with variable position and radius? The examples always use a fixed position.⬐ ajj⬐ olefooThe Radial Basis Function kernel K(x,y) = exp(-(x-y)^2/gamma) is a very general kernel function that can get you this.Finally, whether you actually get such a classifier depends on your data, and as I had mentioned setting the parameters (like gamma) is not very straightforward. Typically you may have to try with various values and see what works for your data.
Train your SVM (or any kernel method) and see what classifier it gives (run the classifier on data points near the boundary you want to detect to see where the exact boundary lies). This would only be to see how the classification function looks like.
Can you recommend a good book that would get mathematically literate laymen a functional understanding of how to pick classifiers for data sets?Technically what is visualized here isn't "the kernel trick". This is the general idea of how nonlinearly projecting some points into a higher-dimensional feature space makes linear classifiers more powerful. You can do this with out SVMs. Just compute the high-dimensional features corresponding to your data, then use logistic regression or whatever. Trouble is, if the higher-dimensional space is really big, this could be expensive. The "kernel trick" is computational trick that SVMs use to compute the inner product between the high-dimensional features corresponding to two points with out explicitly computing the high-dimensional features. (For certain special feature spaces.)But this is definitely a cool visualization of the value of feature spaces!
⬐ moultanoCan you explain more about how the kernel trick specifically works?⬐ lliiffee⬐ iskanderBasically, the idea of feature spaces is to blow up the data into high dimensions. So, we usex' = f(x)
as our data, instead of x. It turns out that in lots of machine learning algorithms (notably SVMs), you end up only needing inner products between different data elements. That is, we need to compute x^T y for two data elements x and y. In feature space, we need could compute this by doing f(x)^T f(y). However, it turns out that for certain feature spaces (like polynomials) one can compute the number f(x)^T f(y) quite quickly with out ever explicitly forming the big vectors x' or y'.
⬐ moultanoYou stopped just short of the explanation I was hoping for. :)⬐ NoneNone⬐ gaikasee http://videolectures.net/mlss09uk_schoelkopf_km/ - kernel methods in general, not limited to SVM⬐ lliiffeeTry section 7 of these notes:http://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf
While the computational benefits are certainly nice, I think that Mercer's theorem is much deeper than that. It says that any [positive definite](http://en.wikipedia.org/wiki/Positive-definite) function is an inner product, though the space of that inner product might be infinite. The consequences of this go much further than just speeding up algorithms-- we can take plain old linear methods like SVMs and apply them to any data type on which we can define a kernel. This is huge, because it allows machine learning to break free of the vector world in which the algorithms are classically defined. Want to classify gene sequences? You don't have to hack them into a fixed-width vector, just define a string kernel for genes. Want to classify fragments of lambda calculus? Just define an appropriate tree kernel. It's wonderfully powerful.