Model Release & Assessment Phase

Data Services

Aug 23

This blog post focuses on the third phase of the Data Science Process: Model Release and Assessment. In the Question Formation and Data Analysis Phase, we gathered and analyzed data to input into the development process. In the Model Development Phase, we developed the model. In this phase, we release the Model and manage its use (monitor and assess its impact).

Model Release

The term Model Release is used here intentionally instead of terms like model deployment or model delivery. Machine learning models can be used in various ways, including deploying a model for use in an application. There isn’t a pro/con for the different release methods. How the model is released is dictated by the use case. For example, if sub-millisecond inference is a requirement, then the edge deployment methodology should be used. If inference needs to be done on billions of data points, batch processing is your go-to.

API, Edge Deployments, and Batch Processing are the most common release mechanisms.

API

The most common use of Machine Learning models is through an Application Programming Interface (API). The model runs as a service or as part of a model-serving platform (like Kubeflow). External applications or systems can communicate with the API to get real-time predictions.

This flexible approach allows the model to be integrated into various applications regardless of the programming language or platform. It is also scalable, as the model can be hosted on cloud infrastructure that automatically adjusts resources based on demand.

This release mechanism's most common use cases are real-time predictions, integration with web or mobile applications, and/or integration with microservice architectures.

Edge Deployments

Models are deployed directly onto edge devices like smartphones, IoT devices, or other embedded systems. The model runs locally on the device, making predictions without needing constant connectivity to a central server. The use of the model is not abstracted via an external service. In most cases, it is directly integrated into the device's system.

This method reduces latency since the predictions are made on the device itself. It also improves privacy because data doesn't need to be sent to a server. Additionally, it reduces reliance on network connectivity.

This release mechanism's most common use cases are applications requiring real-time or sub-millisecond responses, environments with limited or unreliable connectivity, and privacy-sensitive applications.

Batch Processing

Models are delivered to an artifact or model repository for use in data pipelines, which process large volumes of data in batches rather than in real time. Data is collected and stored, and the model is run periodically to generate predictions based on the accumulated data.

Batch processing efficiently handles large datasets and can be scheduled during off-peak hours to optimize resource usage. It’s ideal for use cases where real-time predictions are unnecessary, and results can be delivered after processing.

This release mechanism's most common use cases are predictive maintenance, risk assessment, and large-scale data analysis.

Monitoring

When monitoring a model, three key items should be tracked: the drift of the data in production from what it was trained on, the model metrics we identified during model development, and the health of the systems that host or use the models.

Data Drift

Monitoring data drift is essential for maintaining the performance and reliability of machine learning models in production. Data drift refers to changes in the input data distribution over time, which can degrade model accuracy if not addressed. There are two main types of data drift: covariate drift and concept drift. Covariate drift occurs when the distribution of input features changes, but the relationship between the features and the target variable remains the same.

To make covariate drift more tangible, let's look at a simple example. Let's say you train a model to predict the color of a ball picked from a toy box. In the sampled toy boxes are an even amount of red and blue balls. This mix is what the model is trained to predict. Over time, the boxes the model sees in production have more blue balls than other colors. The model is still trying to predict based on the distribution of a ball color that has changed over time. The solution is to retrain the model on more current data representations. Then, the model will be able to predict blue balls more accurately.

The other common type of drift is concept drift. This happens when the underlying relationship between features and the target variable shifts, leading to changes in the model's predictive efficacy.

To make concept drift more tangible, let’s look at a simple example. Let's say you have developed a model to determine what the animal is based on the sounds it hears. In your training data, the model learns that if there is a roar, it's a lion because that is the only animal at the zoo that roars. Over time, two new animals that also roar (a tiger and a bear) are added to the zoo. The model now can’t predict accurately because it only knew about the lions. Like the concept drift model, all that may need to be done is for it to be retrained on more current data. Then, the model will be able to predict lions and tigers and bears (oh my!).

Monitoring data drift is crucial because it helps detect when a model's performance might be compromised due to shifts in data patterns. If left unchecked, data drift can lead to poor predictions, increasing risks and costs associated with incorrect decisions. Regular monitoring allows for timely interventions, such as model retraining or adjusting for new data trends.

It’s important to establish thresholds based on historical performance and business requirements to assess tolerances for data drift. Tolerances can be determined by analyzing the impact of various degrees of drift on model accuracy. Statistical tests (e.g., Kolmogorov-Smirnov test for covariate drift) or model performance tracking can be used to measure drift. Once a threshold is exceeded, automated alerts or triggers can initiate a deeper investigation or prompt model retraining, ensuring the model remains robust and reliable.

Most model service platforms, such as Kubeflow, Sagemaker, and Azure machine learning, provide this as part of their platform. This greatly simplifies the monitoring of data that released models are used with.

Model Metrics

Monitoring machine learning models using the metrics selected during development ensures the model delivers accurate and reliable predictions. These metrics are gathered either as a result of the predictions or are emitted from the models themselves.

To effectively monitor drift using traditional metrics, it is important to establish performance baselines and track metrics over time. This involves:

Setting Benchmarks: Define acceptable performance levels for metrics based on historical data and business requirements.
Regular Evaluation: Continuously evaluate the model using a validation set or real-time data to detect deviations from the established benchmarks.
Tolerance Assessment: Determine thresholds for acceptable levels of drift. These thresholds should be based on historical performance and the acceptable margin of error for the specific application.

Systems and Infrastructure

Monitoring systems and infrastructure is a fundamental aspect of platform engineering. We need to monitor the systems that use machine learning models and those that serve/run machine learning models. Common approaches include monitoring system performance, application health, inference/prediction latency, and availability. TechTarget has a great reference article to get you started.

Assess Impact

Assessing the impact of machine learning models is crucial to ensure they are fair, transparent, and beneficial to society. It helps to identify and mitigate biases, ensures ethical decision-making, and evaluates broader social and economic effects. Thus preventing harm and ensuring that models contribute positively without exacerbating inequalities or causing unintended consequences.

Bias and Fairness Audits

The objective of this type of assessment is to ensure that the machine learning model does not disproportionately impact specific groups based on race, gender, or socioeconomic status. This assessment aims to identify and mitigate systemic biases embedded in the model’s predictions, thereby promoting fairness and equity in its outcomes.

You can do this by conducting comprehensive audits by analyzing model predictions across different demographic groups. Utilize fairness metrics such as disparate impact ratio or equal opportunity difference to evaluate potential biases. Engage external reviewers or stakeholders to provide an unbiased perspective on fairness. Implement regular checks and updates to address any detected biases, adjusting the model or retraining with more balanced data if necessary.

Transparency and Explainability Reviews

This assessment aims to ensure that the machine learning model's decisions are understandable and justifiable, particularly to those affected by its predictions. It also aims to enhance transparency, foster trust, and ensure accountability in the model’s decision-making process.

This is done by implementing explainable AI techniques such as SHAP values or LIME to interpret individual predictions and feature importance. Create user-friendly visualizations and explanations to communicate how decisions are made. Regularly review the explanations with stakeholders and affected individuals to ensure clarity and alignment with governance standards. Update the model’s documentation to include detailed explanations of its decision-making process and rationale.

Social and Economic Impact Assessment

This assessment helps to evaluate the broader societal and economic consequences of deploying the machine learning model, ensuring it does not cause unintended harm or exacerbate existing inequalities. It also helps to understand the model’s impact on community well-being and economic factors.

The governance and data teams need to conduct impact assessments by analyzing changes in key metrics such as employment rates, income distribution, or access to services that result from the model’s deployment. Leaders should engage with community groups, policymakers, and experts to gather qualitative feedback on the model’s impact. Use this feedback to identify and address any negative externalities or unintended consequences. Regularly update the assessment to reflect ongoing changes and ensure the model continues to have a positive social and economic impact.

Continuing our Example

The models created during development produced 20 recipe recommendations. The data scientists, in partnership with the brewmasters, used the models to create 20 new summer recipes for different regions. Combining sales numbers and direct feedback from customers and buyers, it was determined that of the 20 recipes created, 5 were very successful. Overall, sales increased by about 16%. For the target age group, 25- to 35-year-olds, sales increased by about 25%.

Overall, using models to help craft beer recipes increase sales was effective. During audits and reviews, no groups were disenfranchised. However, it was determined that the methods to gather input focused on areas with colleges, which doesn’t provide a holistic understanding of the consumer landscape.

When machine learning is used to craft new recipes for winter brews, the feedback channels will include more age groups and larger demographic areas.

Definition of Done

Often, it's hard to know what “done” or “good” looks like. When leading a project like this, being done and good often feels squishy. The following are questions you can use as a checklist to determine if your model is ready for delivery.

Good Enough:

Do you capture and monitor model metrics? Your observability systems should capture metrics from the models. This enables the establishment of baselines and tolerances.
Do you have observability for the systems and platforms that host and use the models? Monitoring the models themselves is not enough. You must also monitor the rest of the platform and systems that host your applications.
Do you store data used by the models? Storing the data used with production models allows for periodic drift analysis. This enables the data science teams and MLOps engineers to determine whether a model’s efficacy may be impacted.
Do you perform audits and reviews of model predictions and inferences? You should perform transparency and explainability reviews and bias and fairness audits as part of your governance models. This ensures that your models do not inadvertently disenfranchise vulnerable portions of the population.

Better:

Do you have automatic measures for transparency and traceability of your model behavior? Automatically generating things like Shapley values or Lime values for features allows you to automate tolerances for transparency and traceability.
Do you automate the monitoring of data drift? Monitoring data drift is necessary to ensure model accuracy over time, as shifts in data distribution can degrade performance and lead to incorrect predictions or decisions. Doing this via automated means enables the establishment of baselines and tolerances.
Do you automate retraining and testing of machine learning models? If you are tracking model metrics and data drift, you can set tolerances for model retraining. Using tolerances from other metrics enables you to drive automated pipelines that re-train, test, and deliver models when they are no longer effective.

Conclusion

Releasing models for use and tracking model efficacy can be very complicated. It includes tracking model metrics, understanding and monitoring the data used by models in production, and evaluating models for efficacy and fairness. This step feeds back to model development; if simply retraining models isn’t enough, they must be redeveloped. This step also feeds into question formation when understanding the relationships of the data, which provides clarity to the question of interest.

We have a blog post if you want to know more about operating models and the MLOps process.

MachineLearningModelDeploymentDatascienceAIDataGovernance

Paul Karsten