Python ELI5 Baseline model

Python ELI5 Baseline model

At first, some data is needed. Let's load 20 Newsgroups data, having only 4 sections:

from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', '', ''] twenty_train = fetch_20newsgroups( subset='train', categories=categories, shuffle=True, random_state=42 ) twenty_test = fetch_20newsgroups( subset='test', categories=categories, shuffle=True, random_state=42 )

A fundamental document processing pipeline - bag of words features and Logistic Regression as a classifier:

from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegressionCV from sklearn.pipeline import make_pipeline vec = CountVectorizer() clf = LogisticRegressionCV() pipe = make_pipeline(vec, clf),;

LogisticRegressionCV is used here to adjust regularization parameter C by itself. This enables the comparison of several vectorizers - optimal C value could vary for different input characteristics (e.g. for bigrams or for character-level input). An alternative can be to use GridSearchCV or RandomizedSearchCV.
Now the quality of the pipeline is checked:

from sklearn import metrics def print_report(pipe): y_test = y_pred = pipe.predict( report = metrics.classification_report(y_test, y_pred, target_names=twenty_test.target_names) print(report) print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred))) print_report(pipe)

Other classifiers and preprocessing methods can be tried. Before that, it needs to be checked first what the model learned using  function:

import eli5 eli5.show_weights(clf, top=10)

The chart above doesn%u2019t make any sense. eli5 was not able to get feature and class names from the classifier object by itself. Feature and target names can be provided explicitly:

# eli5.show_weights(clf, # feature_names=vec.get_feature_names(), # target_names=twenty_test.target_names)

The above code works. A better way is to provide a vectorizer instead and let eli5 figure out the details by itself:

eli5.show_weights(clf, vec=vec, top=10, target_names=twenty_test.target_names)

Columns are target classes and in each column, there are features as well as their weights. The intercept (bias) characteristic is shown as  in the same table. Features and weights can be inspected because using a bag-of-words vectorizer and a linear classifier is used. For other classifiers, features can be more difficult to inspect.

Some features look good, while some don%u2019t. It seems the model learned some names specific to a dataset (email parts, etc.). It's time to check prediction results on an example:

eli5.show_prediction(clf,[0], vec=vec, target_names=twenty_test.target_names)