Python ELI5 Baseline model, improved data














































Python ELI5 Baseline model, improved data



The data need to be cleaned first to make it more attractive; improving the model by
trying various classifiers, etc. doesn%u2019t make sense at this point - it may just
learn to leverage these email addresses properly. 

Therefore, cleaning needs to be done by ourselves. In this example, 20 newsgroups dataset
presents an alternative to removing footers and headers from the messages. It
is now time to clean up the data and re-train a classifier.


twenty_train = fetch_20newsgroups(
subset=
'train',
categories=categories,
shuffle=
True,
random_state=
42,
remove=[
'headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
subset=
'test',
categories=categories,
shuffle=
True,
random_state=
42,
remove=[
'headers', 'footers'],
)

vec = CountVectorizer()
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);


We just made the task harder and more effective for a classifier. 


print_report(pipe)



A fabulous result has been achieved. It looks like the pipeline has better quality on unseen messages. It is an evaluation that is fairer as of now. Inspecting points used by the classifier enabled us to notice a difficulty with the data as well as made a good change, notwithstanding numbers that told us not to do that.


Rather than removing headers and footers, evaluation setup could have been developed directly, using GroupKFold from scikit-learn. Then the quality of the old model would have dropped. Then we could have separated headers/footers and see improved accuracy. The numbers would have told us to eliminate headers and footers. It is not clear how to split data though, what combinations to use with GroupKFold. Then, what has the updated classifier learned? (output is less tedious as only a subset of classes is shown - see %u201Ctargets%u201D argument):


eli5.show_prediction(clf, twenty_test.data[0], vec=vec,
target_names=twenty_test.target_names,
targets=[
'sci.med'])




It does not use email addresses now. Nevertheless, it still doesn%u2019t look good: the classifier attributes high importance to seemingly unrelated words like %u2018do%u2019 or %u2018my%u2019. These words arise in many texts. Hence, the classifier may use them as a proxy for bias. Or maybe some of them are more prevalent in some of the classes.


More Articles of SUVODEEP DAS:

Name Views Likes
Python ELI5 eli5sklearn 165 1
Python ELI5 eli5formatters 148 1
Python ELI5 some more top level API 152 1
Python ELI5 other top level API 152 1
Python ELI top level API showweights showprediction 134 1
Python ELI toplevel API explainweights explainprediction 152 1
Python ELI5 Permutation Importance 186 1
Python ELI5 Inspecting Black-Box Estimators: LIME 161 1
Python ELI5 Supported Libraries: lightning, sklearn-crfsuite, Keras 129 1
Python ELI5 Supported Libraries: XGBoost, LightGBM, CatBoost 139 1
Python ELI5 Supported Libraries: scikit-learn 136 1
Python ELI5 Explaining Keras image classifier predictions with Grad-CAM: Under the hood, Extra arguments, Removing softmax, Comparing explanations of different models 147 1
Python ELI5 Explaining Keras image classifier predictions with Grad-CAM: Loading our model and data, Explaining our model prediction, Choosing the target class, Choosing a hidden activation layer 194 1
Python ELI5 Named Entity Recognition using sklearn-crfsuite: Inspect model weights, Customization, Formatting in console 160 1
Python ELI5 Named Entity Recognition using sklearn-crfsuite: Training data, Feature extraction, Train a CRF model 171 1
Python ELI5 Explaining XGBoost predictions on the Titanic dataset: Explaining predictions and Adding text features 163 1
Python ELI5 Explaining XGBoost predictions on the Titanic dataset: Training data and Explaining weights 139 1
Python ELI5 Customizing TextExplainer 152 1
Python ELI5 TextExplainer 184 1
Python ELI5 Debugging HashingVectorizer 196 1
Python ELI5 Char-based pipeline 152 1
Python ELI5 Pipeline improvements 161 1
Python ELI5 Baseline model, improved data 154 1
Python ELI5 Baseline model 155 1
Python ELI5 Introduction 196 2

Comments