The data need to be cleaned first to make it more attractive; improving the model by
trying various classifiers, etc. doesn%u2019t make sense at this point - it may just
learn to leverage these email addresses properly.
Therefore, cleaning needs to be done by ourselves. In this example, 20 newsgroups dataset
presents an alternative to removing footers and headers from the messages. It
is now time to clean up the data and re-train a classifier.
twenty_train = fetch_20newsgroups(
twenty_test = fetch_20newsgroups(
vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
We just made the task harder and more effective for a classifier.
A fabulous result has been achieved. It looks like the pipeline has better quality on unseen messages. It is an evaluation that is fairer as of now. Inspecting points used by the classifier enabled us to notice a difficulty with the data as well as made a good change, notwithstanding numbers that told us not to do that.
Rather than removing headers and footers, evaluation setup could have been developed directly, using GroupKFold from scikit-learn. Then the quality of the old model would have dropped. Then we could have separated headers/footers and see improved accuracy. The numbers would have told us to eliminate headers and footers. It is not clear how to split data though, what combinations to use with GroupKFold. Then, what has the updated classifier learned? (output is less tedious as only a subset of classes is shown - see %u201Ctargets%u201D argument):
eli5.show_prediction(clf, twenty_test.data, vec=vec,
It does not use email addresses now. Nevertheless, it still doesn%u2019t look good: the classifier attributes high importance to seemingly unrelated words like %u2018do%u2019 or %u2018my%u2019. These words arise in many texts. Hence, the classifier may use them as a proxy for bias. Or maybe some of them are more prevalent in some of the classes.