Now comes the fun part! Is it possible to guess the nationality and gender of a person, if I know how long he/she is willing to wait for someone, or make someone wait?
This becomes a simple classification problem in machine learning.
First let’s try modeling the problem using the k-nearest neighbors (KNN) algorithm, which is a non-parametric method meaning that it makes no assumption about the mathematical relation between the input predictors and output response. When given an unknown, the KNN classifier first identifies the K points* that are closest to the unknown, then estimates the conditional probability for each classification. *K points - the exact number of points is determined by the user, and needs to be optimized in order to minimize prediction error. For a much better explanation, see chapter two of ‘An Introduction to Statistical Learning’ by James, Witten, Hastie, and Tibshirani.
Here is the raw data file: JpHWWaittimeDataPublic.csv
Nationality prediction score: 0.676470588235
Gender prediction score: 0.764705882353
We can divide the dataset into training set and test set, so that we can use the accuracy score for the test set to gauge the performance of the classifier in real case scenarios. However, since in this case, the number of subjects is not very many, instead of dividing the dataset into training and test sets, we can use cross validation to see how the classifier should perform in practice:
Predicting whether Japanese: [ 0.77777778 0.55555556 0.625 0.5 ]
Mean: 0.614583333333
Predicting gender: [ 0.66666667 0.66666667 0.5 0.75 ]
Mean: 0.645833333333
The above is only a very rough first estimate of our ability to predict nationality (Japanese or non-Japanese) and gender. To improve predictive ability, we can try,
1) using the same classifier, knn, but optimizing the number of n_neighbors
2) selecting only the important predictors using lasso regression
3) using derived features
4) switching to a parametric classifier, such as logistic regression
1) Optimizing n_neighbors:
It looks like a n_neighbors = 3 should give better accuracy, and using similar code as the above but n_neighbors = 3, the cross validation accuracy is as follows:
Predicting whether Japanese: [ 0.88888889 0.66666667 0.625 0.5 ]
Mean: 0.670138888889
Predicting gender: [ 0.77777778 0.77777778 0.375 0.625 ]
Mean: 0.638888888889
Changing to n_neighbors=3 helps with predicting whether the test subject is Japanese, however it doesn’t help with increasing accuracy with predicint gender.
Therefore, next I will try including only the features that have the biggest impact on the output. This can be done by using lasso regression:
The lasso regression shows that some features definitly contribute more towards the prediction accuracy over other features. So now we include only those features:
Using the knn classifier with n_neighbors = 3, the cross validation accuracy scores are:
Predicting whether Japanese (n=3): [ 0.88888889 0.77777778 0.75 0.625 ]
Mean: 0.760416666667
Predicting Gender (n=3): [ 0.77777778 0.77777778 0.625 0.625 ]
Mean: 0.701388888889
The accuracies are getting better. Next time we shall see if adding derived features and switching to logistic regression will further improve accuracy.