PyCaret is a machine learning library created by data scientist Moez Ali. PyCaret is a library that addresses the machine learning workflow from the point of loading data to the point of deployment.
What makes PyCaret so unique and sets it apart from other machine learning libraries out there is that it is a “low-code library”. This means that with a few lines of code one can carry out all the processes and steps in a traditional machine learning lifecycle. From loading the dataset, to data preparation, to selecting models, to tuning hyperparameters, to splitting data, to feature selection and engineering, PyCaret addresses all the unique points in a machine learning lifecycle in an easy, automated fashion.
What makes PyCaret so special is that currently, it addresses the following machine learning tasks: classification, regression, clustering, anomaly detection, natural language processing (NLP) and association rules mining. Association rules mining has to do with Apriori algorithms that deal with finding patterns in data using antecedents and precedents. This approach can be applied to working on projects that have to do with market basket analysis.
What PyCaret affords the data scientist or machine learning engineer is the opportunity to easily and quickly implement machine learning such that, through quick iteration and adjustments to the machine learning project, they can prioritize the business question in greater detail because they are afforded the luxury of not having to deal with copious lines of code.
In this article I’ll take you through an overview of using PyCaret to implement a simple classification project.
The first step is to install PyCaret in your IDE. For the purpose of this article, I’ll be using Anaconda’s Jupyter Notebook. It is advised to install PyCaret in a virtual environment. This is so PyCaret’s libraries and dependencies don’t clash with any dependencies you may already have installed on your computer. The process for this is:
conda create –name yourenvname python=3.6
To activate said environment:
conda activate yourenvname
To deactivate said environment:
To install PyCaret (within the environment) run:
pip install pycaret
Because PyCaret has its own libraries there is no need to import libraries like pandas, matplotlib.
Importing PyCaret into Your Notebook Environment
In the image above, in three lines of code we’ve imported PyCaret dependencies for a classification project, we’ve imported a dataset to be used from the PyCaret dataset library and we’ve loaded this dataset and defined our target variable as **class variable. **
When this line of code is passed this is the result. An information grid about your dataset:
The Target Type is whether the target variable is in a binary classification format or a multi classification format.
Label Encoded refers to when the target variable is categorical, i.e., in the format “Yes”, “No”. PyCaret uses categorical encoding to change this variable into 1s and 0s so whatever algorithms work on the data have a proper understanding of the data.
Missing Values refer to observing whether there are any missing values within any of the features.
Numerical and Categorical Features refer to the number of numerical and categorical features in the dataset.
Ordinal Features refer to features that have values that are ranked, values that present themselves in a particular order or hierarchy within the data, e.g. low, medium, high.
High Cardinality refers to the level of uniqueness found in features in the dataset. Uniqueness here refers to features such as email addresses, user IDs, product keys, etc.
Transformed Train & Test Set refer to the data being automatically split 70/30 for training and testing.
Shuffle & Stratify Train-Test refer to the cross-validation approach used to divide your data into training and testing sets.
The next step is to make use of the compare_models() function in PyCaret. What this does is compare all the relevant classification models within the PyCaret library and rank them based on metrics like accuracy, auc, recall, and precision.
From the image above it is very easy to evaluate the relevant models based on these metrics and decide which is most suitable for whatever use case is being worked on. For this example, the catboost model appears to be the model with the highest Area Under the Curve (AUC), a metric useful for evaluating how well a model can discriminate between classes.
Metrics For Assessing Your Champion Model
We can pick out the catboost model as our champion model to look at its performance like so:
We can plot the confusion matrix classification report and ROC curve of this model with three simple lines of code:
You can also plot the features of your champion model (catboost) to see which features the model deems most important.
The next steps are to make predictions on the hold-out sample of data that was separated during the setup/preprocessing phase of the model. The model will take this data and based on the information/patterns of data it has learned it will use that to make predictions using this blind, hold-out sample it hasn’t been exposed to before.
After running the model on the hold-out set, the model will be fit on the entirety of the dataset using the finalize_model() function. This will include the blind dataset that the model made predictions on in the previous step.
The final step is to save your model. This is a useful function in case the model is to be reused later or more data needs to be added to this particular model. It’s a simple process where you use the save_model() function : save_model(catboost, ’Catboost Model 2021’).
And, that’s it! You’re’ done. With these simple steps, you can create your own ML project.