Name	Name	Last commit message	Last commit date
parent directory ..
cdk_stacks	cdk_stacks
src/main/python	src/main/python
README.md	README.md
app.py	app.py
cdk.context.json	cdk.context.json
cdk.json	cdk.json
redshift-query-editor-v2-connection.png	redshift-query-editor-v2-connection.png
redshift_streaming_from_msk_serverless.svg	redshift_streaming_from_msk_serverless.svg
requirements-dev.txt	requirements-dev.txt
requirements.txt	requirements.txt
source.bat	source.bat

Amazon Redshift Streaming Ingestion from Amazon MSK Serverless CDK Python project!

This is an Amazon Redshift Streaming Ingestion from MSK Serverless project for CDK development with Python.

The cdk.json file tells the CDK Toolkit how to execute your app.

This project is set up like a standard Python project. The initialization process also creates a virtualenv within this project, stored under the .venv directory. To create the virtualenv it assumes that there is a python3 (or python for Windows) executable in your path with access to the venv package. If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv manually.

To manually create a virtualenv on MacOS and Linux:

$ python3 -m venv .venv

After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.

$ source .venv/bin/activate

If you are a Windows platform, you would activate the virtualenv like this:

% .venv\Scripts\activate.bat

Once the virtualenv is activated, you can install the required dependencies.

$ pip install -r requirements.txt

At this point you can now synthesize the CloudFormation template for this code.

(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
(.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
(.venv) $ cdk synth --all \
              -c vpc_name='your-existing-vpc-name' \
              -c aws_secret_name='your_redshift_secret_name'

ℹ️ Before you deploy this project, you should create an AWS Secret for your Redshift Serverless Admin user. You can create an AWS Secret like this:

$ aws secretsmanager create-secret \
    --name "your_redshift_secret_name" \
    --description "(Optional) description of the secret" \
    --secret-string '{"admin_username": "admin", "admin_user_password": "password_of_at_least_8_characters"}'

Use cdk deploy command to create the stack shown above.

(.venv) $ cdk deploy --all \
              -c vpc_name='your-existing-vpc-name' \
              -c aws_secret_name='your_redshift_secret_name'

To add additional dependencies, for example other CDK libraries, just add them to your setup.py file and rerun the pip install -r requirements.txt command.

Clean Up

Delete the CloudFormation stack by running the below command.

(.venv) $ cdk destroy --force --all \
              -c vpc_name='your-existing-vpc-name' \
              -c aws_secret_name='your_redshift_secret_name'

Run Test

Amazon MSK Serverless setup

After MSK Serverless is succesfully created, you can now create topic, and produce on the topic in MSK as the following example.

Get cluster information

$ export MSK_SERVERLESS_CLUSTER_ARN=$(aws kafka list-clusters-v2 | jq -r '.ClusterInfoList[] | select(.ClusterName == "your-msk-cluster-name") | .ClusterArn')
$ aws kafka describe-cluster-v2 --cluster-arn $MSK_SERVERLESS_CLUSTER_ARN
{
  "ClusterInfo": {
    "ClusterType": "SERVERLESS",
    "ClusterArn": "arn:aws:kafka:us-east-1:123456789012:cluster/your-msk-cluster-name/813876e5-2023-4882-88c4-58ad8599da5a-s2",
    "ClusterName": "your-msk-cluster-name",
    "CreationTime": "2022-12-16T02:26:31.369000+00:00",
    "CurrentVersion": "K2EUQ1WTGCTBG2",
    "State": "ACTIVE",
    "Tags": {},
    "Serverless": {
      "VpcConfigs": [
        {
          "SubnetIds": [
            "subnet-0113628395a293b98",
            "subnet-090240f6a94a4b5aa",
            "subnet-036e818e577297ddc"
          ],
          "SecurityGroupIds": [
            "sg-0bd8f5ce976b51769",
            "sg-0869c9987c033aaf1"
          ]
        }
      ],
      "ClientAuthentication": {
        "Sasl": {
          "Iam": {
            "Enabled": true
          }
        }
      }
    }
  }
}

Get booststrap brokers

$ aws kafka get-bootstrap-brokers --cluster-arn $MSK_SERVERLESS_CLUSTER_ARN
{
  "BootstrapBrokerStringSaslIam": "boot-deligu0c.c1.kafka-serverless.{region}.amazonaws.com:9098"
}

Connect the MSK client EC2 Host.

You can connect to an EC2 instance using the EC2 Instance Connect CLI.
Install ec2instanceconnectcli python package and Use the mssh command with the instance ID as follows.
```
$ sudo pip install ec2instanceconnectcli
$ mssh ec2-user@i-001234a4bf70dec41EXAMPLE
```

Create an Apache Kafka topic After connect your EC2 Host, you use the client machine to create a topic on the cluster. Run the following command to create a topic called ev_stream_data.

[ec2-user@ip-172-31-0-180 ~]$ export PATH=$HOME/opt/kafka/bin:$PATH
[ec2-user@ip-172-31-0-180 ~]$ export BS={BootstrapBrokerString}
[ec2-user@ip-172-31-0-180 ~]$ kafka-topics.sh --bootstrap-server $BS --command-config client.properties --create --topic ev_stream_data --partitions 3 --replication-factor 2

client.properties is a property file containing configs to be passed to Admin Client. This is used only with --bootstrap-server option for describing and altering broker configs.
For more information, see Getting started using MSK Serverless clusters - Step 3: Create a client machine

[ec2-user@ip-172-31-0-180 ~]$ cat client.properties
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler

Produce and consume data

(1) To produce messages

Run the following command to generate messages into the topic on the cluster.
```
[ec2-user@ip-172-31-0-180 ~]$ python3 gen_fake_data.py | kafka-console-producer.sh --bootstrap-server $BS --producer.config client.properties --topic ev_stream_data
```
(2) To consume messages

Keep the connection to the client machine open, and then open a second, separate connection to that machine in a new window.
```
[ec2-user@ip-172-31-0-180 ~]$ kafka-console-consumer.sh --bootstrap-server $BS --consumer.config client.properties --topic ev_stream_data --from-beginning
```
You start seeing the messages you entered earlier when you used the console producer command. Enter more messages in the producer window, and watch them appear in the consumer window.

Amazon Redshift setup

These steps show you how to configure the materialized view to ingest data.

Connect to the Redshift query editor v2

Create an external schema to map the data from MSK to a Redshift object.

CREATE EXTERNAL SCHEMA evdata
FROM MSK
IAM_ROLE 'arn:aws:iam::{AWS-ACCOUNT-ID}:role/RedshiftStreamingRole'
AUTHENTICATION iam
CLUSTER_ARN 'arn:aws:kafka:{region}:{AWS-ACCOUNT-ID}:cluster/{cluser-name}/b5555d61-dbd3-4dab-b9d4-67921490c072-3';

For information about how to configure the IAM role, see Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka.

Create a materialized view to consume the stream data.

Note that MSK cluster names are case-sensitive and can contain both uppercase and lowercase letters. To use case-sensitive identifiers, you can set the configuration setting enable_case_sensitive_identifier to true at either the session or cluster level.

-- To create and use case sensitive identifiers
SET enable_case_sensitive_identifier TO true;

-- To check if enable_case_sensitive_identifier is turned on
SHOW enable_case_sensitive_identifier;

The following example defines a materialized view with JSON source data.
Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the refresh_time value. The refresh_time is the start time of the materialized view refresh that loaded the record. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream.

CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(6) sortkey(1) AUTO REFRESH YES AS
  SELECT refresh_time,
    kafka_timestamp_type,
    kafka_timestamp,
    kafka_key,
    kafka_partition,
    kafka_offset,
    kafka_headers,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'_id',true)::character(36) as ID,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'clusterID',true)::varchar(30) as clusterID,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'connectionTime',true)::varchar(20) as connectionTime,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'kWhDelivered',true)::DECIMAL(10,2) as kWhDelivered,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'stationID',true)::INTEGER as stationID,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'spaceID',true)::varchar(100) as spaceID,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'timezone',true)::varchar(30)as timezone,
    json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'userID',true)::varchar(30) as userID
  FROM evdata.ev_stream_data
  WHERE LENGTH(kafka_value) < 65355 AND CAN_JSON_PARSE(kafka_value);

The code above filters records larger than 65355 bytes. This is because json_extract_path_text is limited to varchar data type. The Materialized view should be defined so that there aren’t any type conversion errors.

Refreshing materialized views for streaming ingestion

The materialized view is auto-refreshed as long as there is new data on the MSK stream. You can also disable auto-refresh and run a manual refresh or schedule a manual refresh using the Redshift Console UI.
To update the data in a materialized view, you can use the REFRESH MATERIALIZED VIEW statement at any time.
```
REFRESH MATERIALIZED VIEW ev_station_data_extract;
```

Query the stream

Query data in the materialized view.
```
SELECT *
FROM ev_station_data_extract;
```

Query the refreshed materialized view to get usage statistics.

SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
   ,SUM(kWhDelivered) AS Energy_Consumed
   ,count(distinct userID) AS #Users
FROM ev_station_data_extract
GROUP BY to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
ORDER BY 1 DESC;

Useful commands

cdk ls list all stacks in the app
cdk synth emits the synthesized CloudFormation template
cdk deploy deploy this stack to your default AWS account/region
cdk diff compare deployed stack with current state
cdk docs open CDK documentation

Enjoy!

References

Getting started using MSK Serverless clusters
Configuration for MSK Serverless clusters
Actions, resources, and condition keys for Apache Kafka APIs for Amazon MSK clusters
Amazon Redshift - Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka
Amazon Redshift - Electric vehicle station-data streaming ingestion tutorial, using Kinesis
Amazon Redshift Configuration Reference - enable_case_sensitive_identifier
Real-time analytics with Amazon Redshift streaming ingestion (2022-04-27)

Connect using the EC2 Instance Connect CLI

$ sudo pip install ec2instanceconnectcli
$ mssh ec2-user@i-001234a4bf70dec41EXAMPLE # ec2-instance-id

ec2instanceconnectcli: This Python CLI package handles publishing keys through EC2 Instance Connectand using them to connect to EC2 instances.
Mimesis: Fake Data Generator - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from-msk-serverless

from-msk-serverless

README.md

Amazon Redshift Streaming Ingestion from Amazon MSK Serverless CDK Python project!

Clean Up

Run Test

Amazon MSK Serverless setup

Amazon Redshift setup

Query the stream

Useful commands

References

Files

from-msk-serverless

Directory actions

More options

Directory actions

More options

Latest commit

History

from-msk-serverless

Folders and files

parent directory

README.md

Amazon Redshift Streaming Ingestion from Amazon MSK Serverless CDK Python project!

Clean Up

Run Test

Amazon MSK Serverless setup

Amazon Redshift setup

Query the stream

Useful commands

References