Introducing a Data Science Process for AI/ML

This is an introduction to a series of blog posts describing the process of creating and operating a Machine Learning (ML) model to deliver true business value.

Senior leaders frequently encounter difficulties in steering and supporting AI and ML initiatives in today’s frenzied race to participate in (or at least seem to participate in) the rapidly evolving Artificial Intelligence (AI) race. This is due to a lack of clarity on what “good” looks like. Part of this is defining success metrics and benchmarks for the effort. Conversations about ML projects often become mired in implementation minutiae rather than engaging in a candid assessment of the business value and return on investment associated with AI/ML endeavors. What is needed is a lightweight and repeatable approach for managing these initiatives.

A Modern Approach to the Data Science Process

In an enterprise of any size, there is often someone who advocates for a specific framework that will “revolutionize” how data analysis should be that “thing” is done. Data Science/Machine learning is no different. Examples of some common frameworks used in Data/Business Intelligence/Machine learning are CRIPS DM, TDSP, KDD, or SEMMA.

The biggest challenge to using these frameworks is that they originated in the data mining communities or as project management frameworks. Thus, they struggle with being flexible enough to adapt to the rapidly changing state of data we see today. These frameworks, by and large, also lack checkpoints for explainability and impact assessment.

Modern software development frameworks emphasize agility and rapid feedback loops. To be effective, data science processes must also adopt these principles while addressing privacy and governance concerns.

Modern Data Science Process

This data science process is logically broken into three distinct phases:
**Question Formation and Data Analysis** - In this phase, we identify the business value and determine what data are needed to support this business value.
**Model Development** - The machine learning model is then iteratively developed, tested, and evaluated. At this point, metrics specific to the model efficacy are defined.
**Model Release and Assessment**—The model is “delivered” to the end users. Delivery here can take several forms dependent on the use case (e.g., inference/analysis versus prediction). Mechanisms to assess the model's efficacy and impact are implemented as part of delivery.

Series Posts

The posts that follow will explore each of these phases in more detail. We will then explore how to leverage this process to drive business value from your AI/ML experiments.

Question Formation and Data Analysis - This blog post discusses how, from a leadership perspective, to approach the first phase of developing a machine learning model.

Developing the ML Model—This blog post will discuss the complexities of developing machine learning models and what leaders need to understand about the process to coach teams successfully.

Delivering the Machine Learning Model—This blog post provides a framework for delivering the machine learning model developed. The type of model developed and its use case drive how models are delivered/deployed and monitored.

Explainable AI in the Data Science Process - In this final blog post, we wrap up the series with a discussion about how transparency in model development and the ability to explain “why” a model made its inference or prediction will soon be a requirement for all businesses. Laws regarding the use of AI and ensuring XAI (Explainable AI) are being drafted now. Understanding its impact on the development and use will be key to the long-term viability of AI/ML in the enterprise.

Defining Common Terms and Concepts For Model Development

Like all formalized disciplines, data science has technical terms and concepts. This section defines some key terms and concepts before exploring the model development process. Understanding these terms and concepts will provide context and a better understanding as we explore how models should be developed.

Common Terms

Labeled Data

Labeled data is data annotated with target outcomes or categories. This data is considered “ground truth” because it is the authoritative reference data used to train a model.

Model Training/Learning

Model training is the process where a machine learning model learns the relationships between input data (features) and their corresponding labels (outputs) from labeled datasets. The model improves its ability to predict or classify new, unseen data through iteration accurately. This learning process enhances the model's capability to generalize from training examples to make reliable predictions or classifications in real-world applications.

Model Features

Model features are individual measurable properties or characteristics of data used as input variables in machine learning models. They are data elements that the machine learning model can create relationships between during training. These features enable the model to make predictions or classifications. Elements of the data to be used as features are selected by a process known as feature engineering.

Feature Engineering

Feature engineering is the process of selecting and, when needed, transforming parts of the raw data. This data is used as features for the model during during training. The process of transforming the data is similar to that of traditional Extract Transform and Load (ETL).

Hyperparameters

Hyperparameters are external configurations of a machine learning model, set before training, that control its learning process and affect performance. You can think of these as parameters provided as input to a function—these change model elements during training.

Model Algorithms

A machine learning model algorithm is a set of rules or procedures used to identify patterns in data, enabling the model to make predictions or decisions based on new input data. Examples include linear regression, k-nearest neighbors, neural networks, etc. We will not dig into any of the algorithms in this blog post. It is important to know that model algorithms align with the types of learning and use cases. For example, linear regression is a supervised algorithm primarily used for prediction.

Supervised Learning

Supervised learning is a type of machine learning where the model is trained on labeled data. Each training example consists of an input-output pair, with the input being the features and the output being the known label. The model learns to map inputs to outputs by finding patterns in the training data. After training, the model can predict new, unseen data labels. The big takeaway for supervised learning is that the data used during training has labels that the model can use to create the relationships (aka ground truth data). You will often use supervised learning when developing inference, predictive, and, in some cases categorative models.

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is trained on unlabeled data. The goal is to identify patterns or structures within the data without prior knowledge of outcomes. Unsupervised learning is used for tasks like grouping similar items, anomaly detection, and data compression, providing insights and hidden structures in the data. Unsupervised learning is, in most cases, used for clustering (e.g., grouping things you have never seen together), association rule learning (e.g., how closely related are “things”), and generative modes (e.g., ChatGPT).

Functional Model Use Cases

Models can be generally broken down into four functional use cases: Inference, Predictive, Categorative, and Generative.

Inference Models

Models used for inference are concerned with explaining “why” something is how it is. Examples are understanding housing price fluctuations, causes for food deserts in cities, the impact of laws on a portion of the population, etc. Inference is closely tied to the next type of model use case: prediction.

Predictive Models

As the name implies, will attempt to predict something in the future based on data from past events. Examples of this are stock prices, items that you will likely buy, the outcome of an upcoming election, the next word in a sentence, etc.

Categorative Models

Categorative models are used to group a “thing” together. We have put quotes around a thing because models can be used to group data across many modalities and group them together, based on data that the model has already seen. Examples of this are grouping photos by animal types, grouping music into categories, grouping customers into what products they are likely to buy, etc.

Generative Models

Generative models are used to create a “thing” based on data the model it has already seen. Again, we use double quotes because generative models can create things across many modalities. Examples of this are creating music, pictures, essays for homework, etc. There is a crossover between generative models and predictive models. In some cases, the generative model predicts the likelihood of the next word or phrase and then uses that prediction to predict the “thing”.