Models

In this section, we describe the overview of the modelling techniques and their performances for tag prediction. Classification and topic modelling approaches were utilised to model our solution.

These chord diagram represent the misclassification patterns for different models.

SVM :

This model classified most of CSS questions as HTML, most of Jquery questions as javascript. Majority of HTML misclassification comprises of being classified into javascript. These languages tend to be used together frequently and perhaps be part of same question many times.

Multinomial Naive Bayes:

Similarly the misclassfications in the Naive Bayes can be attributed to languages being used together more often than not are confused as one another.

LDA :

So, also is the case with LDA where it mixes up words from HTML and CSS into topic 4, Topic 9 consists of a small part of HTML words and Jquery.

The misclassification patterns are similar across all the models. This can be addressed by identifying better discriminator vocabularies for co-occuring technologies.

Performances

We have made 80-20 train-test split. All the classifiers were trained on the nouns extracted from the posts. And converted them into TF-IDF vectors to be used as instances training and testing.

Parameters:

SVM : default parameters with kernel set to Linear
Naive Bayes : defaults
Nearest Mean : defaults

Training times: Longest to shortest

SVM
Nearest Mean
Naive Bayes

Best Accuracies:

SVM - ~ 73%
Naive Bayes - ~ 69%
Nearest Mean - ~ 63%

The detailed tabulation of the classifiers’ performance also has been added in the following section. There we can see what was the performane for each class for a classifier.

SVM Linear

##      Accuracy         Kappa AccuracyLower AccuracyUpper 
##          0.73          0.70          0.72          0.74

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: android	0.89	0.99	0.91	0.98	0.91	0.89	0.90	0.14	0.12	0.13	0.94
Class: c	0.61	0.99	0.41	1.00	0.41	0.61	0.49	0.01	0.01	0.01	0.80
Class: c#	0.87	0.98	0.77	0.99	0.77	0.87	0.82	0.08	0.07	0.09	0.93
Class: c++	0.59	0.99	0.75	0.98	0.75	0.59	0.66	0.05	0.03	0.04	0.79
Class: css	0.75	0.97	0.03	1.00	0.03	0.75	0.05	0.00	0.00	0.03	0.86
Class: html	0.48	0.96	0.57	0.95	0.57	0.48	0.52	0.10	0.05	0.08	0.72
Class: ios	0.93	0.99	0.78	1.00	0.78	0.93	0.85	0.04	0.04	0.05	0.96
Class: java	0.87	0.98	0.79	0.99	0.79	0.87	0.83	0.09	0.07	0.09	0.92
Class: javascript	0.60	0.95	0.72	0.93	0.72	0.60	0.66	0.16	0.10	0.13	0.78
Class: jquery	1.00	0.98	0.01	1.00	0.01	1.00	0.01	0.00	0.00	0.02	0.99
Class: mysql	0.64	0.98	0.29	1.00	0.29	0.64	0.40	0.01	0.01	0.03	0.81
Class: php	0.64	0.98	0.77	0.97	0.77	0.64	0.70	0.07	0.05	0.06	0.81
Class: python	0.83	0.97	0.87	0.96	0.87	0.83	0.85	0.18	0.15	0.17	0.90
Class: r	0.77	0.99	0.73	0.99	0.73	0.77	0.75	0.04	0.03	0.04	0.88
Class: sql	0.49	0.99	0.78	0.98	0.78	0.49	0.60	0.04	0.02	0.02	0.74

Multinomial Naive Bayes

##      Accuracy         Kappa AccuracyLower AccuracyUpper 
##          0.70          0.67          0.70          0.70

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: android	0.89	0.98	0.88	0.98	0.88	0.89	0.88	0.13	0.12	0.13	0.93
Class: c	0.30	1.00	0.82	0.97	0.82	0.30	0.44	0.04	0.01	0.01	0.65
Class: c#	0.88	0.98	0.74	0.99	0.74	0.88	0.80	0.07	0.06	0.09	0.93
Class: c++	0.75	0.99	0.69	0.99	0.69	0.75	0.72	0.04	0.03	0.04	0.87
Class: css	0.30	0.99	0.69	0.96	0.69	0.30	0.42	0.06	0.02	0.03	0.65
Class: html	0.47	0.94	0.34	0.97	0.34	0.47	0.40	0.06	0.03	0.08	0.71
Class: ios	0.91	0.99	0.83	1.00	0.83	0.91	0.86	0.04	0.04	0.05	0.95
Class: java	0.82	0.98	0.79	0.98	0.79	0.82	0.80	0.09	0.07	0.09	0.90
Class: javascript	0.77	0.93	0.50	0.98	0.50	0.77	0.61	0.09	0.07	0.14	0.85
Class: jquery	0.26	0.99	0.72	0.96	0.72	0.26	0.38	0.05	0.01	0.02	0.63
Class: mysql	0.46	0.98	0.40	0.99	0.40	0.46	0.43	0.02	0.01	0.03	0.72
Class: php	0.66	0.98	0.76	0.97	0.76	0.66	0.70	0.07	0.05	0.06	0.82
Class: python	0.94	0.95	0.74	0.99	0.74	0.94	0.83	0.14	0.13	0.17	0.94
Class: r	0.64	1.00	0.91	0.98	0.91	0.64	0.75	0.06	0.04	0.04	0.82
Class: sql	0.47	1.00	0.85	0.98	0.85	0.47	0.61	0.04	0.02	0.02	0.73