赞
踩
摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十四章《无监督学习》中第112课时《选取聚类数量》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In this video, I'd like to talk about one last detail of K-means clustering which is how to choose the number of clusters or how to choose the value of the parameter capital K. To be honest, there actually isn't a great way of answering this or doing this automatically. By far the most common way of choosing the number of clusters is still choosing it manually by looking at visualization or by looking at the output of the clustering algorithm or something else. But I do get asked this question quite a lot how do you choose the number of clusters. So, I just want to tell you what are people's current thinking on it although the most common thing is actually to choose the number of clusters by hand.
A large part of why it might not always be easy to choose the number of clusters is that it's often generally ambiguous how many clusters there are in the data. Looking at this data set, some of you may see four clusters and that would suggest using . Or some of you may see two clusters and that will suggest . And yet others may see three clusters. So, looking at the data set like this, the true number of clusters, it actually seems genuinely ambiguous to me, and I don't think there is one right answer. And this is part of unsupervised learning. We aren't given labels, so there isn't always a clear cut answer. And this is one of the things that makes it more difficult to say, have an automatic algorithm for choosing how many clusters to have.
When people talk about ways of choosing the number of clusters, one method that people sometimes talk about is something called the Elbow Method. Let me just tell you a little bit about that and then mention some of its advantages but also shortcomings. So the Elbow Method, what we're going to do is vary K which is the total numer of classes. So, we're going to run K-means with one cluster, that means really everything gest grouped into a single cluster and compute the cost function or compute the distortion and plot that here. And then we're going to run K-means with two clusters, maybe with multiple random initializations maybe not. But then weth two clusters we should get hopefully a smaller distortion and so plot that there. And then run K-means with three clusters, hopefully you get even smaller distortion and plot that there. I'm gonna run K-means with four, five clusters and so on. So we end up with a curve showing how the distortion goes down as we increase the number of clusters. And so we get a curve that maybe looks like this. And if you look at this curve, What the Elbow Method does is it says "Well, let's look at this plot. Looks like there is a clear elbow there". This is an analogy to the human arm where, if you imagine that you reach out your arm, then, this is your shoulder joint, this is your elbow joint, and I guess, your hand is at the end over here. So this is the Elbow Method. Then you find this sort of pattern where the distortion goes down rapidly from 1 to 2 and 2 to 3 and then you reach an elbow at 3 and then the distortion goes down very slowly after that. And then it looks like maybe using 3 clusters is the right number of clusters, because that's the elbow of this curve, right? If you apply the Elbow Method, and if you get a plot that actually looks like this, then that's pretty good. And this would be a reasonable way of choosing the number of clusters. It turns out that the Elbow Method isn't used that often, and one reason is that, if you actually use this on a clustering problem, it turns out that fairly often you end up with a curve that looks much more ambiguous, maybe something like this. If you look at this, I don't know, maybe there's no clear elbow, right? It looks like distortion continuesly goes down, maybe 3 is good number, maybe 4 is a good number, maybe 5 is also not bad. And so, if you actually do this in practice, if your plot looks like the one on the left then that's great. It gives you a clear answer. But just as often, you end up with a plot that looks like the one on the right and it's not clear where the ready location of elbow is. It makes it harder to choose a number of clusters using this method. So maybe the quick summary of the Elbow Method is that it is worth the shot but I wouldn't necessarily, have a very high expectation of it working for any particular problem.
Finally, here's one other way of thinking about how you choose the value of K. Very often people are running K-means in order to get clusters for some later purpose, or for some sort of downstream purpose. Maybe you want to use K-means in order to do market segmentation, like in the T-shirt sizing example that we've talked about. Maybe you want K-means to organize a computer cluster better, maybe a learning cluster for some different purpose. And so if that later downstream purpose, such as market segmentation, if that gives you an evaluation metric, then often, a better way to determine the number of clusters is to see how well different number of clusters serve that later downstream purpose. Let me step through a specific example. Let me go through the T-shirt sizing example again. I'm trying to decide do I want three T-shirt sizes? So, I choose , then I might have small (S), medium (M) and large (L) T-shirt sizes. Or maybe I want to choose , and then I have extra small (XS), small (S), medium (M), large (L) and extra large (XL) T-shirt sizes. So, you can have 3 T-shirt sizes or 5 T-shirt sizes. We could also have four T-shirt sizes. So, if I run K-means with , maybe I end up with, that's my small and that's my medium and that's my large. Whereas, if I run K-means with 5 cluster maybe I end up with, those are my extra small T-shirts, these are my small, these are my medium, these are my large and these are my extra large. And the nice thing about this example is that, this then maybe give us another way to choose whether we want 3 or 4 or 5 clusters. And in particular, what you can do is think about this from the perspective of the T-shirt business and ask: "Well, if I have five segments, then how well will my T-shirts fit my customers and so, how many T-shirts I can sell? How happy will my customers be?" What really make sense, from the perspective of T-shirt business in terms of whether I want to have more T-shirt sizes so that my T-shirts fit my customers better. Or do I want to have fewer T-shirt sizes so that I make fewer sizes of T-shirts. And I can sell them to the customers more cheaply. And so, the T-shirt selling business, that might give you a way to decide between three clusters versus five clusters. So, that gives you an example of how a later downstream purpose like the problem of deciding what T-shirts to manufacture, how that can give you an evaluation metric for choosing the number of clusters. For those of you that are doing the programming exercises, if you look at this week's program exercise associative K-means, there's an example there of using K-means for image compression. And so if you were trying to choose how many clusters to use for that problem, you could also, again use the evaluation metric of image compression to choose the number of clusters K. So, how good do you want the image to look versus, how much do you want to compress the file size of the image. And if you do the programming exercise, what I've just said will make more sense at that time.
So, just summarize, for the most part, the number of clusters K is still chosen by hand by human input or human insight. One way to try to do so is to use the Elbow Method, but I wouldn't always expect that to work well, but I think the better way to think about how to choose the number of clusters is to ask, for what purpose are you running K-means? And then to think, what is number of clusters K that servers that whatever later purpose that you actually run the K-means for.
<end>
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。