row one, column three is the intersection between "MedInc" and "AveRooms"). The top-right triangle of the plot is a repeat of the bottom-left triangle, just with the items in the pair reversed (i.e. It it's more of a blob or a flat horizontal or vertical line then that suggests a low correlation. If you see a strong diagonal line it means that those variables are correlated in this data set. row three, column one is the intersection between "AveRooms" and "MedInc") shows the scatter plot of how the values of those variables relate to each other. The plot is arranged with all the variables of interest from top to bottom and then repeated from left to right so that any one square in the grid is defined by the intersection of two variables.Įach box that is an intersection of a variable with another (e.g. The produced graph has a lot of information in it so it's worth taking some time to make sure you understand these plots. If you pass any pandas DataFrame to the scatter_matrix() function then it will plot all the pairs of parameters in the data. Pandas also provides a quick method of looking at a large number of data parameters at once and looking visually at which might be worth investigating. $y$ is clearly sharing information with $x$, otherwise there would be no visible pattern. The other way to think about it is in terms of mutual information. They have a linear correlation of zero ( on average as $x$ increases, $y$ stays the same) but if you know the value of $X$, you clearly have information about what the value of $y$ is likely to be. To highlight this, consider the following two variables, $x$ and $y$: The way I like to think of it is, if I know the value of one of the two ariables, how much information do I have about the value of the other. At its core, correlation is a measure of how related two data sets are. However, correlation is a much broader idea than that and when doing machine learning, it's worth understanding the bigger picture. This is a useful measure because it's easy to calculate and most data only have either linear relationships or no relationship at all. This can involve deep study of how one parameter is likely to vary as you change another but the simplest start is to look a the linear correlation between them.Ĭorrelation is usually taught as being the degree to which two variables are linearly related, that is as one increases, on average how much does the other one increase. When presented with a new collection of data, one of the first questions you may ask is how they are related to each other.
0 Comments
Leave a Reply. |