Data Science part II - Model Development
This blog post focuses on the 2nd phase of the Data Science Process: Model Development. We used the data gathered and analyzed in the first phase as input to the development process. In this phase, we focus on developing a model that answers the question of interest in the fidelity we aim for. Terms used in this post are defined in our Introduction to this series.
Model Build
The model build/development process is an iterative trial and error process. In most cases, more than one model algorithm will fit a use case and provide an answer to the question of interest.
There is a direct relationship between the size of the data and the time it takes to train a model. Larger datasets generally require more computational resources and time for training because the model needs to process more information to learn patterns and relationships. As the data size increases, the number of computations increases, which can significantly extend the training duration, especially for complex models or algorithms. Efficient data handling, optimized algorithms, and powerful hardware can help mitigate some of these challenges.
The most common place to start developing a model is in a sandboxed environment with a small subset of the data. When ready, use a much larger subset of the data to train the model on the cluster.
Sandboxed Model Development
The data scientists will select a small representative sample of the data and try out several different approaches. The sandbox should be specific to the data scientist developing the machine learning model.
The data scientist will train models quickly and iterate through feature selection, model algorithms, and hyperparameter tuning. This sandboxed development enables the data scientist to quickly weed out model algorithms that are not fit for purpose, identify the most relevant data elements (aka features), and determine which model metrics to use.
Model Training
When the data scientist is ready, they will select a larger subset of the data and train the models they identified on a cluster. Depending on the data size and complexity, it is not uncommon for larger models to take hours or days to train.
The model training process usually includes feature transformation via a pipeline, training the model, and then generating metrics of the model's performance. It is very common for the models that are trained on a larger set of data to underperform. If this is the case, the data scientist will need to iterate changes to the model algorithm in their sandboxed environment.
Evaluate Models
The data scientists will pick one or more models based on the initial metrics identified during the model build process. Three key elements to evaluating a model to determine if it is fit for use are: Model Metrics, Model Validation, and Evaluating its Robustness and Interpretability.
Model Metrics
Digging deeply into model metrics is outside the scope of this blog post. The key takeaway is that several options always exist for determining a model's efficacy depending on the model type and use case. For example, a model predicting inflation numbers could use a Root Mean Squared Error (RMSE) metric, and a model categorizing pictures could use the F1 Score metric.
The goal is to determine which metrics are best for measuring a model's ability to answer the question of interest. The data scientists will need to work with the subject matter experts and stakeholders to assess the pros and cons of the metrics and pick the best ones.
Model Validation
Evaluating if a model overfits or underfits involves comparing its performance on training and validation datasets. Overfitting occurs when the model performs well on training data but poorly on validation data, indicating it has memorized rather than generalized patterns. Underfitting is when the model performs poorly on both datasets, failing to capture underlying patterns.
The key here is to ensure the model performs well on data it has never seen before. Whenever possible, model validation should use live data to test for under/over-fitting. In many cases, this is not possible (e.g., cycle time for live data is days or months). In the event that live data is not available, always keep set training data as a holdout for model validation.
Robustness and Interpretability
Evaluating a model for robustness and interpretability involves assessing more than its ability to generalize well to new data. It involves validating that the model performs well when the unexpected occurs.
A robust model maintains consistent performance across different datasets and conditions, indicating it can handle variations in input data without significant loss in accuracy or reliability. Robust models are less sensitive to noise or outliers in the data, demonstrating resilience in real-world applications.
An interpretable model provides clear explanations or insights into how it makes predictions or classifications. This transparency is crucial for understanding the reasoning behind the model's decisions, which enhances trust and enables stakeholders to comprehend and act upon the model's outputs effectively. Interpretability allows domain experts to validate the model's relevance and correctness in practical scenarios, facilitating adjustments or improvements as needed.
Defining the Metrics
Before a model is ready for use, data scientists need to determine and formally define how the model will be monitored. The first part is documenting or automating the metrics that were determined during the model build phase. These metrics focus on the model output (its predictions, categorizations, etc.) The data scientists must ensure that the models they develop emit the data and metric information.
The second part of defining the metrics is determining what production data requires monitoring. Over time, the distribution of the data captured and used in production will drift from the distributions of data used to train the model. This drift impacts a model's effectiveness. Common approaches are to track the data distribution of key features using statistical and divergence measures. It's imperative at this stage to establish a baseline from the training data to measure against.
Finally, the models, the training data, and the metrics need to be versioned. A model's version can be related back to a commit in your source code management system. The data used to train the model also needs to be versioned. A new version of a model can be generated from new data with no changes to the source code. This new model will have different metrics for both model performance and data distributions. All three need to be tracked together.
Continuing our Example
We’ll continue our example of Karsten Brewing LLC's quest to craft new recipes and increase sales by 20%. Armed with data from the first phase, the data science team initially built models that would categorize beers by type based on region, age bracket and restaurant/establishment. As they iterated through their process the found that the models did not always predict well. With additional analysis the data science team determined that a missing feature was time of year.
Armed with this new information, additional data was collected and a new approach was selected. Instead of using a single model to categorize the types of beers the teams used a time series approach allong with categorization to determine what types of beers by region, age bracket and time of year would increase sales.
Interestingly, the teams realized that they had access to the recipes for all of the beers under analysis as well as customer and critic reviews thanks to a data mesh that was being implemented in the enterprise. Using behavioural analysis techniques and behavioural models they believed that the could develop machine learning models that could aid their master brewers in crafting new recipes.
Definition of Done
Oftentimes, it's hard to know what “done” or “good” looks like. When leading a project like this, done and good often feel squishy. The following are questions you can use as a checklist to determine if your model is ready for delivery.
Good Enough
Does the Model Answer the Question of Interest Suffciently? The most important element here is that the model deliver at least the value of the investment. The best way to know this is to engage with subject matter experts in the domain area the question is being asked and the stakeholders. Metrics are meaningless if we answer the wrong question.
Will the Teams Be Able to Determine When the Model Performs Poorly? A model developed today is not guaranteed to work in 2 months. Decisions based on a poorly performing model can have significant negative consequences for the business, including financial losses and missed opportunities.
Can you Repeatedly Rebuild the Model and Reproduce the Results? Repeatability in model development is a requirement in many industries. If you are able to repeatable recreate the model, and generate the same inferences and predictions that means that you sufficiently managed the versioning of your source code and data.
Better
Will You Be Able To Detect Shifts In Your Data? Know your model is performing poorly is a “good enough” bar. proactively update or retrain models to ensure they remain accurate and reliable, maintaining their effectiveness and relevance in real-world applications.
Is Your Model Explainable? Model explainability is crucial for building trust, ensuring transparency, and facilitating debugging and improvement. It helps meet regulatory compliance, enhances user acceptance, and addresses ethical concerns by providing clear insights into how and why a model makes decisions, ensuring responsible and reliable use of machine learning models.
Is your Model Robust? A robust model is essential because it maintains consistent performance across diverse and unforeseen data variations. This reliability ensures that the model can handle real-world complexities, reduces the risk of failures, and enhances trust, leading to better decision-making and more effective application in practical scenarios.
Do You Capture All Relevant Versioning Information in the Metadata For the model? Versioning in machine learning is more complicated than in traditional software development. You must track the model version and be able to relate it to the training data, the source code that it created it and the metrics from the model when it was trained. All of these elements can be stored as metadata in the model registry.
Do You Track The Lineage of your Data? Data lineage is crucial for machine learning models as it ensures traceability, improves data quality, facilitates compliance, aids in debugging, and enhances reproducibility. By tracking data from source to destination, it helps maintain accurate, reliable, and transparent models, essential for robust and trustworthy machine learning applications.
Conclusion
Developing machine learning models is not for the faint of heart. In many cases it is a lot of trial, error and iteration. Explainability, transparency and metrics are as important to the process as the work to train a model.