Question Formation and Data Analysis in Data Science

This blog post focuses on the first phase of the Data Science Process: Question Formation and Data Analysis. In this phase, we iterate multiple times through question formation, data collection, and exploration. Initial questions are likely to be of low fidelity. Through the process of data exploration, the questions gain fidelity and drive toward business value.

Question Formation & Data Analysis

Formulating a question for a data science project is the most critical step. It lays the foundation for the entire investigation and encapsulates the business value. Business value is defined in one of 4 ways: Increase in Revenue, Protection of Revenue, Reduction of Cost, or Avoidance of Cost. The question (sometimes referred to as the question of interest) serves as a guiding beacon, directing the focus of data collection and analysis. To ensure effectiveness, the question must be specific, answerable, and actionable. This combination provides clarity and a clear target for the effort's objectives. A common “thought” structure for the question of interest is, “If I knew X, then I would be able to do Y, which would benefit us in Z way.”

Specificity is key as ambiguity in the question leads to vague outcomes that will not deliver the value the business or the customer expects. A specific question helps define the scope of the project and identify relevant data sources.

The question must be answerable within the constraints of available data and resources. Gathering the necessary data and performing analysis to address the question effectively should be feasible. This requires considering the data's quality, availability, and relevance.

Without action, there is no purpose in asking the question. If knowing the answer does not prompt or enable action, then it does not drive business value. Question formation is an iterative process. Initially, there may not be enough understanding of the business need, customer need, or the available data to formulate a good question. As the teams iterate through this phase, they gain insight into the business through the data. This refines the question and clarifies business value and enables action.

Getting the Data

Data collection is a pivotal phase of the Exploratory Data Analysis (EDA) process, laying the groundwork for insightful analysis and decision-making. We consider it a component separate from EDA in this phase because data science teams can encounter significant challenges in acquiring the necessary data for EDA.

Data availability poses a significant hurdle. Teams often struggle to access comprehensive datasets encompassing all relevant variables and time periods. Incomplete or fragmented data can hinder the efficacy of EDA, leading to biased insights or incomplete conclusions.

Data quality issues present a common challenge. Inaccuracies, inconsistencies, and missing values within the dataset can skew analysis results and compromise the reliability of findings. Data cleansing and preprocessing as part of the collection are essential tasks in addressing these issues effectively.

Data privacy and compliance considerations add complexity to the data collection process. Ensuring compliance with regulations such as GDPR or HIPAA while accessing and handling sensitive data requires careful navigation of legal and ethical frameworks. Failure to adhere to these regulations can result in legal repercussions and damage to organizational reputation.

Data Governance plays a crucial role in overcoming these challenges and facilitating the data collection process. Effective governance frameworks establish protocols for data acquisition, ensuring data integrity, security, and compliance.

Getting the data is an iterative process. As the data is explored and the question of interest is fine-tuned, the teams may realize they need additional data. In many cases, data is already in the enterprise. In some cases, needed data may be external to the enterprise. Data like weather reports and surveys can enrich the enterprise's data.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) determines insights, identifies patterns, and explains a dataset's underlying structure. It involves visually and statistically exploring the data to uncover potential relationships, anomalies, and trends. EDA serves multiple purposes, including hypothesis generation, data cleaning, and feature selection, laying the groundwork for more advanced analytical techniques.

One key output of EDA is the formal definition of features that can be used in model development. Through EDA, data scientists can identify and select the most relevant variables for predictive modeling based on their correlations, distributions, and predictive power. These relevant variables are collected and sometimes transformed, becoming the key “features” used when developing an ML model.

As part of the EDA process, documentation is generated. This documentation captures insights, patterns, and underlying structures discovered during the analysis. This documentation is valuable for stakeholders, providing a comprehensive overview of the dataset characteristics and informing decision-making processes. It helps communicate findings, justify analytical approaches, and guide subsequent steps in the data analysis pipeline. This documentation is sometimes referred to as the “Data Book.”

Governance in EDA ensures data accessibility, quality, privacy compliance, and documentation. It sets protocols for data access, quality standards, privacy protection, and documentation, which are integral for maintaining data integrity and regulatory compliance throughout the exploratory process. Governance also defines how privacy concerns should be addressed as part of discovery if it is determined that there may be privacy or bias issues.

Without continuing to flog the old iteration horse, EDA is at its core iterative process. As insights are uncovered, they are fed back to the question-formation process and data gathering. The learnings gained refine the needed data and provide clarity for the question of interest.

A Simple Example of the Process

For this example, we’ll use everyone's favorite pastime, BEER! Let's say you are a senior leader at a large beer manufacturer (Karsten Brewing, LLC). You are leading a team in defining a strategy for the new beer flavors and types that should be developed. The assumption is that a minor change to the existing flagship sub-brand (Karsten Light) will yield additional sales and profits. The senior leadership team wanted to extend the traditional analysis and recommendations with “AI”. The hope was that machine learning practices could yield more targeted results.

The initial question of interest was:

Will a minor formulation change to Karsten Light yield a 20% uptick in sales?

This question is vague because it is too broad and isn’t actionable.

The team set about gathering the initial sets of data, which are primarily sales revenue data by sub-brand and date ranges. As part of the analysis, the data scientist begins to see geographic trends, times of year in the sub-brands sales data.

The question is refined to be:

If we tailor sub-brands for a geographic area, will that yield a 20% uptick in sales?

The data science teams collect additional data from outside sources, such as surveys. The surveys contain consumer age, beer type preferences, where the beer is consumed, etc. This yields additional insights, which are fed back to the question-formation process.

To jump to the end of the process, it's determined that the highest per person spend on beer is in areas with a higher density of micro-breweries. It also determined that people aged 25 to 35 are the highest consumers in these areas, usually at restaurants and bars. The final question of the process was:

“If we knew what kind of beers were the most popular by geography, consumed in restaurants, and age bracket, we would be able to craft recipes that would increase sales by 20% for consumers between the ages 25 to 35?”.

This is a good question because it is specific and actionable. The leadership can take action if the machine learning models can predict the types of beers to craft by market.

It's not so far-fetched that machine learning can be used to make better beer. In fact, it was successfully done earlier this year.

Definition of Done

Oftentimes, it's hard to know what “done” or “good” looks like. When leading a project like this, done and good often feel squishy. The following are questions you can use as a checklist to determine if you are ready to move into model development.

Good Enough

  • Would the Answer to the Question Prompt Action? If the question of interest can be answered AND if the answer prompts action, then question formation is done.

  • Do you have a documented understanding of the data? As part of EDA, have the teams (data scientists, business analysts, etc.) generated documentation? Specifically, a document that provides both an overview of the analysis and a detailed, reproducible technical analysis. If so, then EDA is done (enough for now).

  • Does leadership and business SMEs agree on moving forward? Is there agreement between leadership and the business SMEs that the investment is worth pursuing? It takes time (man hours) and systems (compute) to develop and test machine learning models. If it's determined that there is a return on the investment of the time and systems, you are good to move forward.

Better

  • Do you have a repeatable process for collecting and delivering data? As we discussed, data collection is hard. Doing it repeatedly is even harder. This is not a requirement for moving from Question Formation and EDA to model development, but it will make later iterations simpler and faster. Effective DataOps and a modern Data Platform (data mesh) support this.

  • Are you monitoring your data pipelines? Bad data leads to bad inferences and predictions. It's important that you monitor the health of your data pipelines, not just for the processes but also for the quality.

  • Do you have a process to track the lineage of your data? Data lineage is crucial for machine learning as it provides a transparent trail of data from its origin to its usage in models. Understanding this lineage ensures data quality, aids in debugging models, and enhances model interpretability, fostering trust and confidence in machine learning outcomes.

Conclusion

Question formation (aka business value) and data analysis are the most important phases when developing machine learning models. They require quality data, a structured approach to analysis, and lots of iteration. When done well, this phase's output will ensure that the models developed will provide business value.

Previous
Previous

A Coalition of the Motivated

Next
Next

Introducing a Data Science Process for AI/ML