Thursday, December 1, 2022
HomeBig DataUse SQL queries to outline Amazon Redshift datasets in AWS Glue DataBrew

Use SQL queries to outline Amazon Redshift datasets in AWS Glue DataBrew

Within the publish Knowledge preparation utilizing Amazon Redshift with AWS Glue DataBrew, we noticed create an AWS Glue DataBrew job utilizing a JDBC connection for Amazon Redshift. On this publish, we present you create a DataBrew profile job and a recipe job utilizing an Amazon Redshift reference to customized SQL.

DataBrew is a visible information preparation device that may assist you simplify your extract, rework, and cargo (ETL) course of. Now you can outline a dataset from Amazon Redshift by making use of customized SQL statements. Making use of a customized SQL assertion to a big supply desk lets you choose, be a part of, and filter the info earlier than cleansing, normalizing, and remodeling it in a DataBrew mission. Filtering and becoming a member of the info out of your information supply and solely bringing within the information you wish to rework simplifies the ETL course of.

On this publish, we exhibit use customized SQL queries to outline your Amazon Redshift datasets in DataBrew.

Answer overview

To implement this resolution, you full the next high-level steps:

  1. Create an Amazon Redshift connection.
  2. Create your dataset and use SQL queries to outline your Amazon Redshift supply datasets.
  3. Create a DataBrew profile job to profile the supply information.
  4. Create a DataBrew mission and recipe job to remodel the info and cargo it to Amazon Easy Storage Service (Amazon S3).

The next diagram illustrates the structure for our resolution.


To make use of this resolution, full the next prerequisite steps:

  1. Have an AWS account.
  2. Create an Amazon Redshift cluster in a personal subnet inside a VPC as a safety greatest observe.
  3. As a result of DataBrew instructions require that the cluster has entry to Amazon S3, be sure you create a gateway VPC endpoint to Amazon S3. The gateway endpoint offers dependable connectivity to Amazon S3 with out requiring an web gateway or NAT machine out of your VPC.
  4. Allow the enhanced VPC routing within the Amazon Redshift cluster. Enhanced VPC routing forces all Amazon Redshift instructions to make use of the connectivity to the gateway VPC endpoint to Amazon S3 in the identical AWS Area as your cluster.
  5. Create a database and tables, and cargo the pattern information within the Amazon Redshift cluster.
  6. Put together a SQL question to extract the supply dataset. You utilize this SQL question later on this publish to create an Amazon Redshift supply dataset in DataBrew.
  7. Create an S3 bucket to retailer information from the profile and recipe jobs. The DataBrew connection quickly shops intermediate information in Amazon S3.
  8. For our use case, we use a mock dataset. You may obtain the DDL and information information from GitHub.

Safety greatest practices

Take into account the next greatest practices with a view to mitigate safety threats:

  • Evaluation the shared duty mannequin when utilizing DataBrew.
  • Limit community entry for inbound and outbound site visitors to least privilege. Make the most of the routing site visitors inside the VPC through the use of an Amazon S3 gateway endpoint and enhanced VPC routing in Amazon Redshift.
  • Allow the lifecycle coverage in Amazon S3 to retain solely mandatory information, and delete pointless information.
  • Allow Amazon S3 versioning and cross-Area replication for vital datasets to guard towards unintentional deletes.
  • Allow server-side encryption utilizing AWS KMS (SSE-KMS) or Amazon S3 (SSE-S3).
  • DataBrew makes use of Amazon CloudWatch for logging, so it is best to replace your log retention interval to retain logs for the suitable size of time.

Create an Amazon Redshift connection

On this part, you create a connection in DataBrew to connect with your Amazon Redshift cluster.

  1. On the DataBrew console, select Datasets within the navigation pane.
  2. On the Connections tab, select Create connection.
  3. For Connection identify, enter a reputation, equivalent to order-db-connection.
  4. For Connection sort, choose Amazon Redshift.
  5. Underneath Connection entry, present the Amazon Redshift cluster identify, database identify, database person, and database password.
  6. Select Create connection.

Create your dataset by making use of a customized SQL assertion to filter the supply information

On this part, you create a Amazon Redshift connection, add your customized SQL assertion, and validate it. It’s also possible to validate your SQL assertion straight in your Amazon Redshift cluster through the use of the Amazon Redshift question editor v2. The aim of validating the SQL assertion is that can assist you keep away from failure in loading your dataset right into a mission or job. Additionally, checking the question runtime ensures that it runs in beneath 3 minutes, avoiding timeouts throughout mission loading. To investigate and enhance question efficiency in Amazon Redshift, see Tuning question efficiency.

  1. On the DataBrew console, select Datasets within the navigation pane.
  2. On the Datasets tab, select Join new dataset.
  3. For Dataset identify, enter a reputation, equivalent to order-data.
  4. Within the left pane, select Amazon Redshift beneath Database connections.
  5. Add your Amazon Redshift connection and choose Enter customized SQL.
  6. Enter the SQL question and select Validate SQL.
  7. Underneath Extra configurations, for Enter S3 vacation spot, present an S3 vacation spot to quickly retailer the intermediate outcomes.
  8. Select Create dataset.

Create a DataBrew profile job

On this part, you employ the newly created Amazon Redshift dataset to create a profile job. Knowledge profiling helps you perceive your dataset and plan the info preparation steps wanted in working your recipe jobs.

  1. On the DataBrew console, select Jobs within the navigation pane.
  2. On the Profile jobs tab, select Create job.
  3. For Job identify, enter a reputation, equivalent to order-data-profile-job.
  4. For Job sort¸ choose Create a profile job.
  5. Underneath Job enter, select Browse datasets and select the dataset you created earlier (order-data).
  6. For Knowledge pattern, choose Full dataset.
  7. Underneath Job output settings¸ for S3 location, enter the S3 bucket for the job output information.
  8. For Function identify, select an AWS Id and Entry Administration (IAM) position with permission for DataBrew to connect with the info in your behalf. For extra data, confer with Including an IAM position with information useful resource permissions.
  9. Select Create and run job.

Verify the standing of your profile job. A profile output file is created and saved in Amazon S3 upon completion. You may select View information profile to see extra data.

Along with an output file, DataBrew additionally offers visualizations. On the Dataset profile overview tab, you may see information visualizations that may assist you perceive your information higher. Subsequent, you may see detailed statistics about your information on the Column statistics tab, illustrated with graphics and charts. You may outline information high quality guidelines on the Knowledge high quality guidelines tab, after which see the outcomes from the info high quality ruleset that applies to this dataset.

For instance, within the following screenshot, the quantity column has 2% lacking values, as proven on the Column statistics tab. You may present guidelines that keep away from triggering a recipe job in case of an anomaly. It’s also possible to notify the supply groups to deal with or acknowledge the lacking values. DataBrew customers can even add steps within the recipe job to deal with the anomalies and lacking values.

Create a DataBrew mission and recipe job

On this part, you begin analyzing and remodeling your Amazon Redshift dataset in a DataBrew mission. The customized SQL assertion runs in Amazon Redshift when the mission is loaded. Databrew performs read-only entry to your supply information.

Create a mission

To create your mission, full the next steps:

  1. On the DataBrew console, select Initiatives within the navigation pane.
  2. Select Create mission.
  3. For Venture identify, enter a reputation, equivalent to order-data-proj.
  4. Underneath Recipe particulars¸ select Create new recipe and enter a recipe identify, equivalent to order-data-proj-recipe.
  5. For Choose a dataset, choose My datasets.
  6. Choose the dataset you created earlier (order-data).
  7. Underneath Permissions, for Function identify, select your DataBrew position.
  8. Select Create mission.

DataBrew begins a session, constructs a DataFrame, extracts pattern information, infers primary statistics, and shows the pattern information in a grid view. You may add steps to construct a metamorphosis recipe. As of this writing, DataBrew gives over 350 transformations, with extra on the way in which.

For our instance use case, Firm ABC has set a goal to ship all orders inside 7 days after the order date (inner SLA). They need an inventory of orders that didn’t meet the 7-day SLA for extra investigation. The next pattern recipe accommodates steps to deal with the lacking values, filter the values by quantity, change the date format, calculate the date distinction, and filter the values by transport days. The detailed steps are as follows:

  1. Fill lacking values with 0 for the quantity column.
  2. Filter values by quantity larger than 0.
  3. Change the format of order_timestamp to align with ship_date.
  4. Create a brand new column referred to as days_for_shipping utilizing the dateTime perform DATEDIFF to indicate the distinction between order_timestamp and ship_date in days.
  5. Filter the values by days_for_shipping larger than 7.

Create a recipe job

To create your DataBrew recipe job, full the next steps:

  1. On the DataBrew console, select Jobs within the navigation pane.
  2. Select Create job.
  3. For Job identify¸ enter a reputation, equivalent to SHIPPING-SLA-MISS.
  4. Underneath Job output settings, configure your Amazon S3 output settings.
  5. For S3 location, enter the placement of your output bucket.
  6. For Function identify, select the IAM position that accommodates permissions for DataBrew to attach in your behalf.
  7. Select Create and run job.

You may examine the standing of your job on the Jobs web page.

The output file is in Amazon S3 as specified, and your information transformation is now full.

Clear up

To keep away from incurring future costs, we advocate deleting the sources you created throughout this walkthrough.


On this publish, we walked via making use of customized SQL statements to an Amazon Redshift information supply in your dataset, which you need to use in profiling and transformation jobs. Now you can concentrate on constructing your information transformation steps figuring out that you simply’re engaged on solely the wanted information.

To study extra in regards to the numerous supported information sources for DataBrew, see Connecting to information with AWS Glue DataBrew.

Concerning the authors

Suraj Shivananda is a Options Architect at AWS. He has over a decade of expertise in Software program Engineering, Knowledge and Analytics, DevOps particularly for information options, automating and optimizing cloud based mostly options. He’s a trusted technical advisor and helps clients construct Properly Architected options on the AWS platform.

Marie Yap is a Principal Options Architect for Amazon Net Providers based mostly in Hawaii. On this position, she helps numerous organizations start their journey to the cloud. She additionally focuses on analytics and fashionable information architectures.

Dhiraj Thakur is a Options Architect with Amazon Net Providers. He works with AWS clients and companions to offer steerage on enterprise cloud adoption, migration, and technique. He’s obsessed with know-how and enjoys constructing and experimenting within the analytics and AI/ML house.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments