Creating a Data Warehouse from S3 to Redshift in AWS

Hitoshi Kumagai
5 min readJul 18, 2021

--

With the development of cloud technology, the environment for handling big data is becoming better and better, and while handling TB-class ‘real’ big data requires investment, small data can easily use the environment for handling big data.Cloud environments are offered by Google, Microsoft, IBM, and Amazon, but which one is best? Let me introduce you to AWS, as I acquired the skills to use AWS by taking a data engineering course at Udacity.

What you can learn from this article

  1. How to build a database on AWS
  2. How to build a database using the CLI
  3. Copying data from S3 to Redshift by CLI

1.How to build a database on AWS

The major steps in this process are:

① You create a master key in AWS KMS.

②You load the data into S3.

③S3 retrieves the key from KMS and encrypts the data.

④You run the COPY command in Amazon Redshift to load the data from S3.

In the cloud, cryptographic keys are used for security. Therefore, the keys need to be issued by KMS. The next step is to load the data into s3. Copy the data from s3 to Redshift based on the key you just issued.
The CLI (Command Line Interface) is useful because it is very complicated to do these tasks manually.

2. How to build a database using the CLI

Let’s create a robot using the Python library, boto3.

# import library
import boto3
import json
import configparser
from botocore.exceptions import ClientError
import pandas as pd

Let’s load the configuration to get the encryption key.

# set config
config = configparser.ConfigParser()
config.read_file(open('dwh_full.cfg'))
KEY = config.get('AWS','KEY')
SECRET = config.get('AWS','SECRET')
DWH_CLUSTER_TYPE = config.get("DWH","DWH_CLUSTER_TYPE")
DWH_NUM_NODES = config.get("DWH","DWH_NUM_NODES")
DWH_NODE_TYPE = config.get("DWH","DWH_NODE_TYPE")
DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB = config.get("DWH","DWH_DB")
DWH_DB_USER = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT = config.get("DWH","DWH_PORT")
DWH_IAM_ROLE_NAME = config.get("DWH", "DWH_IAM_ROLE_NAME")(DWH_DB_USER, DWH_DB_PASSWORD, DWH_DB)pd.DataFrame({"Param":
["DWH_CLUSTER_TYPE", "DWH_NUM_NODES", "DWH_NODE_TYPE", "DWH_CLUSTER_IDENTIFIER", "DWH_DB", "DWH_DB_USER", "DWH_DB_PASSWORD", "DWH_PORT", "DWH_IAM_ROLE_NAME"],
"Value":
[DWH_CLUSTER_TYPE, DWH_NUM_NODES, DWH_NODE_TYPE, DWH_CLUSTER_IDENTIFIER, DWH_DB, DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME]
})

The contents of dwh_full.cfg are as follows

[AWS]
KEY=
SECRET=
[DWH]
DWH_CLUSTER_TYPE=multi-node
DWH_NUM_NODES=4
DWH_NODE_TYPE=dc2.large
DWH_IAM_ROLE_NAME=dwhRole
DWH_CLUSTER_IDENTIFIER=dwhCluster
DWH_DB=dwh
DWH_DB_USER=dwhuser
DWH_DB_PASSWORD=Passw0rd
DWH_PORT=5439

Create a bot to do the work.

# Create bot
ec2 = boto3.resource('ec2',
region_name="us-west-2",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)
s3 = boto3.resource('s3',
region_name="us-west-2",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)
iam = boto3.client('iam',aws_access_key_id=KEY,
aws_secret_access_key=SECRET,
region_name='us-west-2'
)
redshift = boto3.client('redshift',
region_name="us-west-2",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)

This is all you need to access S3. Let’s take a look at the actual data.

# show resource
config = configparser.ConfigParser()
s3 = boto3.resource('s3',
region_name="us-west-2",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)
# create object then read data and show in pandas dataframe
content_object = s3.Object('udacity-dend', 'log-data/2018/11/2018-11-30-events.json')
file_content = content_object.get()['Body'].read().decode('utf-8')
# convert to pandas dataframe
events_data = pd.read_json(file_content, lines=True)
song_data.head()

3.Copying data from S3 to Redshift by CLI

Now, let’s do ④ in the figure on the left.This process involves creating a table in the beginning and then copying the data from S3.Hmmm. Is it difficult? Essentially, I end up with the following two codes. Just two! It’s so easy, isn’t it?

Now, let’s define a table on Redshift to copy the data from s3.

# create table to copy
staging_songs_table_create = ("""
CREATE TABLE IF NOT EXISTS staging_songs (
num_songs INTEGER
,artist_id VARCHAR SORTKEY DISTKEY
,artist_latitude FLOAT
,artist_longtude FLOAT
,artist_location VARCHAR
,artist_name VARCHAR
,duration FLOAT
,song_id VARCHAR
,title VARCHAR
,year INTEGER
)

Okay, time to copy from S3 to Redshift!

staging_songs_copy = (“””
copy staging_songs from ‘{}’
credentials ‘aws_iam_role={}’
region ‘us-west-2’
json ‘auto’
“””).format(SONG_DATA,ARN)

Now we just need to define the schema of the data warehouse. For a star schema, you just need to define the fact table and the dimension table.When the above process is pipelined, the following results are obtained and the analysis can be started successfully.

root@ab169576ac59:/home/workspace# python create_tables.py
Start Drop Tables
Start Create Tables
Execute End and Dissconnected
root@ab169576ac59:/home/workspace# python etl.py
Connected to DB
Start Copy to Staging Table
Start Copy Table No1:Run Query
Start Copy Table No2:Run Query
End Copy Process
Start Insert to Table
Start Insert to Table No1:Run Query
Start Insert to Table No2:Run Query
Start Insert to Table No3:Run Query
Start Insert to Table No4:Run Query
Start Insert to Table No5:Run Query
Execute End and Dissconnected
root@ab169576ac59:/home/workspace#

Conclusion

  1. How to build a database on AWS
  • ① :You create a master key in AWS KMS.
  • ②:You load the data into S3.
  • ③:S3 retrieves the key from KMS and encrypts the data.
  • ④:You run the COPY command in Amazon Redshift to load the data from S3.

2. How to build a database using the CLI

  • By using the Python library bot3, you can use AWS via CLI.
  • In the CLI, I created bots for S3, Redshift, and IAM_role.

3. Copying data from S3 to Redshift by CLI

  • Define a table on Redshift to copy the data to S3.
  • Copy the data from S3 to the table on Redshift.

Next

This time, I only explained the flow of the process, but please refer to the following for details.

https://github.com/hitoshikumagai/Udacity/tree/main/data-engineering/project/data%20warehouse

--

--

Hitoshi Kumagai
Hitoshi Kumagai

No responses yet