Transformative Data Pipelines for Analytics Using AWS Glue

Data Services

Oct 10

Practical Considerations

There are many alternatives and approaches to building data pipelines for analytical use. We wanted to share a recent experience we had when implementing data pipelines and what we found the practical considerations to be in that implementation.

Our implementation was based on AWS's available service offerings, including Glue, DataBrew, and SageMaker.

Outcome Objectives

Create data sets that would originate from multiple data sources (Excel files, APIs from 3rd party applications, data stores such as PostgreSQL)
Capture the raw incoming data and data source for lineage tracking
Assert certain levels of quality checks, validations, and verifications
Transform the data by dropping, augmenting, and enhancing certain data aspects
Make the data sets readily available to data scientists as well as other parts of the business for analytical purposes

AWS Service Evaluations

AWS Glue DataBrew

DataBrew offered an approach that required less technical expertise in order to achieve a working ETL data pipeline. The service was unfortunately, at the time of our implementation, not significantly stable. We experienced frequent and exasperating service failures and delays. Building complex solutions was more difficult with DataBrew than with other options. We also concluded that the DataBrew setup still requires a fairly proficient technical level of skills to build something effectively (knowledge of AWS services like S3, Glue, IAM). In other words, the setup is not fit for a non-technical person.

SageMaker

We also looked at the SageMaker service. This is a rich service targeted at building data products. SageMaker also exhibited stability issues that made it difficult to configure and use. We ultimately did not choose this service though because it was significantly more expensive than the service we did end up going with. The SageMaker solution was thousands of dollars per month compared to less than two hundred dollars with our selected solution.

AWS Glue

Ultimately, we ended up going with the AWS Glue service as a solution. This is a serverless solution that allowed us to leverage Jupyter notebooks and Python to craft the data consumption. Transformation and final data set in the most efficient and straightforward way.

Our Approach

Infrastructure as Code

We built our solution to work across multiple environments like development, testing, and production. It was important to ensure as best as possible that all the environments were alike, with the exception of capacity. Therefore, we needed a reliable and repeatable way to build the environments.

We chose Terraform for the Infrastructure as Code (IaC) because of its wide adoption. The infrastructure that we applied included:

Storage: s3
Security: IAM - Service Accounts, Roles, Policies
Data Lake: Databases and Tables

The data lake resources utilized Iceberg tables because Iceberg provides the following desired features for our needs:

Schema evolution
Time travel
Partitioning and bucketing
Compression
Data validation
Data lineage
Integration with Glue

IaC Challenges

By far, the biggest challenge was getting the right combination of service accounts, roles, and policies to build a least privileged security posture. The data lakes that we built utilized Iceberg for the features stated above. However AWS's current implementations of Glue and Iceberg required a hybrid approach to roles and policies. We had to do a significant amount of experimenting to get the right combination between which privileges were in IAM and which were also in Lake Formation.

Common Code Library

As Jupyter notebooks were being built out, it was apparent that some code such as normalization of column headers, would be repeatedly used across notebooks. So we created a common Python library with common code and utilities that we shared across the notebooks. This not only reduced the amount of code duplication but allowed us to keep the notebooks more understandable and focused on data set transformation.

This common code was also versioned. The versioning allowed us to enhance the common code capabilities without disrupting notebooks that were already in service.

Notebook Configuration

AWS Glue does offer three alternatives for building ETL jobs.

Visual ETL
Notebook
Script Editor

If our ETL was rather straightforward, the VIsual ETL seemed fine to work with. But we often ran into the need to have more control over the ETL and while we might have been able to work through that with the Visual ETL, the level of effort it was requiring seemed more than we wanted to invest.

The Notebook option allowed us the right level of flexibility to build the data sets needed. Being able to build and test in steps proved us to be the most effective and helped when debugging as well.

We built multiple notebooks, one for each translation:

Raw extract intake consumption
Data preparation for a right-shaped data set fit for the business’s analytical needs

Notebook Challenges

AWS Glue notebooks are supposed to support versioning via source code management systems. CodeCommit has just been deprecated as we were starting our project so we were not able to use that as a source code management tool. Our other alternative was BitBucket for the other source code management needs. We were able to get AWS Glue to connect to their BitBucket service, but AWS Glue would not allow us to push or pull the notebooks. We did extensive research and experimentation with roles and policies as well, but we were never able to make this feature work.

Workflows

AWS Glue also provides workflow (orchestration), so we chained the notebooks together using this approach. This allows for a single trigger (cron, event, on-demand) to run the data pipeline jobs, calling the raw extract notebook first and if that succeeds, following up by calling the prepare notebook to complete the ETL.

Final Thoughts

Building data pipelines for creating analytically viable data sets using AWS Glue services is achievable. Getting the security posture correct while keeping to a least privileged best practice will require work and experimentation to get it right. Using notebooks for your ETL jobs will give you a good deal of flexibility but will require talent who are familiar with Python, Pandas, and Spark.

If you are on a similar journey, we would love to hear what your practical experience has been. If you need help with your journey, please reach out to us and we will be more than happy to speak to you about how we might be able to help accelerate and accomplish your goals.

DatascienceCloudInfrastructureAsCodeMachineLearning

Paul Karsten