Python ELI5 Baseline model, improved data














































Python ELI5 Baseline model, improved data



The data need to be cleaned first to make it more attractive; improving the model by
trying various classifiers, etc. doesn%u2019t make sense at this point - it may just
learn to leverage these email addresses properly. 

Therefore, cleaning needs to be done by ourselves. In this example, 20 newsgroups dataset
presents an alternative to removing footers and headers from the messages. It
is now time to clean up the data and re-train a classifier.


twenty_train = fetch_20newsgroups(
subset=
'train',
categories=categories,
shuffle=
True,
random_state=
42,
remove=[
'headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
subset=
'test',
categories=categories,
shuffle=
True,
random_state=
42,
remove=[
'headers', 'footers'],
)

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);


We just made the task harder and more effective for a classifier. 


print_report(pipe)



A fabulous result has been achieved. It looks like the pipeline has better quality on unseen messages. It is an evaluation that is fairer as of now. Inspecting points used by the classifier enabled us to notice a difficulty with the data as well as made a good change, notwithstanding numbers that told us not to do that.


Rather than removing headers and footers, evaluation setup could have been developed directly, using GroupKFold from scikit-learn. Then the quality of the old model would have dropped. Then we could have separated headers/footers and see improved accuracy. The numbers would have told us to eliminate headers and footers. It is not clear how to split data though, what combinations to use with GroupKFold. Then, what has the updated classifier learned? (output is less tedious as only a subset of classes is shown - see %u201Ctargets%u201D argument):


eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=[
'sci.med'])




It does not use email addresses now. Nevertheless, it still doesn%u2019t look good: the classifier attributes high importance to seemingly unrelated words like %u2018do%u2019 or %u2018my%u2019. These words arise in many texts. Hence, the classifier may use them as a proxy for bias. Or maybe some of them are more prevalent in some of the classes.


More Articles of SUVODEEP DAS:

Name Views Likes
Python ELI5 eli5sklearn 278 1
Python ELI5 eli5formatters 335 1
Python ELI5 some more top level API 259 1
Python ELI5 other top level API 249 1
Python ELI top level API showweights showprediction 214 1
Python ELI toplevel API explainweights explainprediction 256 1
Python ELI5 Permutation Importance 442 1
Python ELI5 Inspecting Black-Box Estimators: LIME 259 1
Python ELI5 Supported Libraries: lightning, sklearn-crfsuite, Keras 228 1
Python ELI5 Supported Libraries: XGBoost, LightGBM, CatBoost 287 1
Python ELI5 Supported Libraries: scikit-learn 223 1
Python ELI5 Explaining Keras image classifier predictions with Grad-CAM: Under the hood, Extra arguments, Removing softmax, Comparing explanations of different models 223 1
Python ELI5 Explaining Keras image classifier predictions with Grad-CAM: Loading our model and data, Explaining our model prediction, Choosing the target class, Choosing a hidden activation layer 406 1
Python ELI5 Named Entity Recognition using sklearn-crfsuite: Inspect model weights, Customization, Formatting in console 272 1
Python ELI5 Named Entity Recognition using sklearn-crfsuite: Training data, Feature extraction, Train a CRF model 309 1
Python ELI5 Explaining XGBoost predictions on the Titanic dataset: Explaining predictions and Adding text features 291 1
Python ELI5 Explaining XGBoost predictions on the Titanic dataset: Training data and Explaining weights 225 1
Python ELI5 Customizing TextExplainer 249 1
Python ELI5 TextExplainer 289 1
Python ELI5 Debugging HashingVectorizer 310 1
Python ELI5 Char-based pipeline 265 1
Python ELI5 Pipeline improvements 255 1
Python ELI5 Baseline model, improved data 244 1
Python ELI5 Baseline model 235 1
Python ELI5 Introduction 316 2

Comments