This is my area initiate This is no more questions Number of clusters here though not if you should be three natural places but walking into question stepping forward in your name So we going build our four clusters here Look at this So I’m going to build four clusters on dhe I’ve been I’ve been to four clusters Look at the output This is plus zero Sen Troy Clustered wants android closer tools Android three Central Are you comfortable with this All of you order these values.
The position of the central in the various dimensions That is what This This is okay so e true is one cluster 0123 When I built this and dances I put all these dimensions your data frame on so that it becomes easy to understand So plus zero sent droid is located on this activity When I did this and been on Dickens what I’m doing here this I’m taking the data sets back all the original rotate I’m taking it back on I’m sending it to the cluster Don’t predict what this place we don’t predict It’s it’s It’s deceptively like pro modeling This not mortally It’s not predicting anything.
All the records they’re being assigned to different less readies K means clustering has assigned it Now I want to store back the same The class right ease back into the recorded So that’s what is happening here So in the data frame we did not have any column called Centrowitz We did not have any column concentrates Now what This is going to do it It’s going to create a column called Group in the Group It’s going to store the car the class righties son creating a new column called Group The group calling will tell me which cluster this particular record belongs That is what the predict function is doing.
It takes us input The records tells you which plus a bill he’s explained in full in L A broker I should have actually label this Mr This is my initial initial means the some off variances across Britain clusters all summed up Okay so this is my objective function I want to minimize this What is he on The bottom is the number of clusters K sickle to to basically 23 You can’t have fractions So easy is it This is Tonto because I think it is to you So this is a scale you have said like this But it will be in full can start from one but one will be the highest Okay I see that the ship is too much It’s a significant shift in the act going for even though in the back panel we saw three natural luster But everything has to be so suppressed So four lights seems to be the likely number listed before Yes Yes yes Okay Ah now look at this I got my cluster.
I least stored inside the data thing Now how do I make use of this Okay so what I did hear this I stored back Pete Cluster labels into the records Which record belongs to which cluster label Okay that’s what I’m doing at this point Once they’re done bad when a big queues off a pear port make use of a box plot So in blocks not what we do is on every attribute So if there are eight attributes in this data set on every attribute I do a box float Michael Estes So when I do this look at this On the first dimension acceleration There are four clusters but many old legs Your own simulated box works Okay Now whenever box plots are overlapping the bodies of the box plotter overlapping That’s not a very good sign It’s not a good sign You will not be able to interpret these two clusters separately They lot come out as distinct listers They last similar properties Okay on uh because when you look at this this dimension from this angle this is acceleration Z scored.
You’ll see that they’re two humps here This is a separate company two homes and this is separate from four closeness Okay so not a good situation to be in But let’s more This is your cylinder It’s an integral columns of box Sports is coming Or the simple lines This is displacement Once again the clusters are overlapping It’s not coming out very strongly as distinct listless right Maybe it’s a four close test If I restrict myself to three clusters these two might get clubbed into one These might Clyde get clubbed into one and I might have distinct clusters Even the Caymans questions tell you four is right Number of clusters Let’s see We don’t know But there are many old clients here Our clothes indicate loose clusters Oakland’s indicate the data points lying on the edge of the clusters They’re making clusters very large and loose.
So behind a no players I removed outlasts here already This any data point which is lying beyond two standard divisions Gotta go I replaced it for the median Goingto your data frame Look at the records cluster Weiss sort in plaster ways Any data point which is lying More than two standard divisions within the cluster Replace it with the center of the cluster That’s what I’m doing here when I do this now when I formed the box blood again you know clans are almost gone But your class those if you’re not probably familiar with this I think I told you this in the class they never do You have a distribution on in the distribution of data points in the Outland regions You replace them with median your distribution becomes sharper Your standard division falls as a result of that data points with her earlier Not outclass They start appearing is hopeless.
You will never be ableto completely Eliminate the old lives They will always be there as long as I’ve handled the first I Trish note lives I’m OK so I’m going to ignore the second iteration All place No I’m going to do this Look at your boxes again The boxes are almost overlapping These two are overlapping These two are distinct from these two Right So by box board was not really helping me out with three or four clusters Even three questions had the same problem Which means the dimensions are not helping me separate all the questions properly What it is I went and in bid analysis using scattered flock one under this I took one of the dimensions Say for example horsepower the target variable which are to predict in muscle boys Methodists mpg So let’s do a blood between empathy and hard sport and see what the scatter plot looks like in these four clusters.
Now this is where the information will come out Okay Look at this particular plot Its body thing This green plus two This this is horsepower So what do you think This green clusters body think this red clusters green is large cars heavy cars very high horse powers These are probably the Martha Eat Hundreds of small cars Very low hospital My religious ready Hi My lease is very low I love you.
Okay Now look at this Board is coming out of this analysis ISS I look at the green cluster Your horse power is not a very good predictor Off your mpg there Linus Almost horizontal The slope is almost zero The slope is almost zero in case in orange Also the slope is much worse So your heart sport is not going to be good predictor for mileage in for orange color cards orange slices sedans But as if I go for small cars small cars between horsepower and mpg there seems to be the kind of relationship I expect a negative relationship on Very strong coefficients are able to get it You understand how we’re able to extract information All of clusters Let’s see this photo Look at this This is your mpg that displacement Not as Barris the new case but once again for large car.