In unsupervised machine learning the "n parameter" i.e. the number of clusters for clustering experiments, the fraction of the outliers in anomaly detection, and the number of topics in topic modeling, is of fundamental importance.
When the eventual objective of the experiment is to predict an outcome (classification or regression) using the results from the unsupervised experiments, then the tune_model() function in the pycaret.clustering module , the pycaret.anomaly module, and the pycaret.nlp module comes in very handy.
To understand this, let's see an example using the "Kiva" dataset.
This is a micro-banking loan dataset where each row represents a borrower with their relevant information. Column 'en' captures the loan application text of each borrower, and the column 'status' represents whether the borrower defaulted or not (default = 1 or no default = 0).
You can use tune_model function in pycaret.nlp to optimize num_topics parameter based on the target variable of supervised experiment (i.e. predicting the optimum number of topics required to improve the prediction of the final target variable). You can define the model for training using estimator parameter ('xgboost' in this case). This function returns a trained topic model and a visual showing supervised metrics at each iteration.
tuned_lda = tune_model(model='lda', supervised_target='status', estimator='xgboost')
Comments