Let’s start setting up some backend infrastructure for our analytics pipeline. First, we need to ingest data into the pipeline. To do this, we can configure a
Kinesis Data Stream and a
Kinesis Data Firehose Stream. Kinesis Data Streams allows you to ingest streaming data with your own custom producer and consumer applications and fan-out this stream to multiple different consumers. Kinesis Data Firehose delivers streaming data to one of four built in destinations – S3, Splunk, Redshift, and ElasticSearch. For this workshop, we will use an S3 data lake as the destination.
AWS Management Console, select the
Amazon Kinesis service.
Data streams box and select
Create Data Stream.
name for the stream. For example, the name Peculiar-KDS is used for this workshop.
Under data stream capacity, enter 1 for the
Number of open shards. This will create a Kinesis Data stream with a write capacity of 1 MiB/second, 1000 Data records/second and a read capacity of 2 MiB/second.
In a production environment, you’ll want to estimate and monitor your data throughput to know how many shards you need for your stream. Each shard ingests up to 1 MiB/second and 1000 records/second and emits up to 2 MiB/second, so you can scale the shards up and down depending on your throughput.
Create data stream
Amazon Kinesis Data Firehoseand select
Process with delivery stream. You will create a Kinesis Data Firehose stream to consume data from Kinesis Data Streams and deliver that data to Amazon S3.
delivery stream name. For example, this lab will use Peculiar-KDF for the name.
Keep the source as default, which should be
Kinesis Data Stream. The stream should auto select to the Kinesis Data Stream you just created, which is Peculiar-KDS.
In a production environment, you can optionally choose to enable server-side encryption for source records in the delivery stream which is a security best practice.
Here, you can optionally choose to transform source records with an AWS Lambda function. This is useful for doing basic ETL and transforming data before storing it in your data lake. The Game Analytics Pipeline Solution does this to transform streaming data from JSON to Parquet before storing in S3. Parquet is an optimal format for storing data to be used for analytics because its columnar and can cost optimize and enhance the performance of your analytics. We will configure a Lambda function to do basic ETL in the next section of this lab.
For now, leave these configurations as default for now. and hit
Choose a destination, select
Amazon S3 and
Create new bucket.
Give your bucket a
name and hit
Create bucket. This lab uses peculiar-wizards-data-lake. S3 bucket names must be globally unique.
Ignore the other configurations for now, and hit
Kinesis Firehose buffers incoming records before delivering them to your S3 bucket. Set
Buffer size to 1 MB and
Buffer interval to 60 seconds.
For this workshop, the buffer size and buffer interval are set to the minimum values to speed up data delivery to Amazon S3 during testing, but this results in less optimized batching. In a production environment, you will want to change your buffer interval to the max of 15 minutes.
Leave compression and encryption
disabled for now, and leave error logging
Scroll down to
Permissions. This is the Identity and Access Management role that you need to specify to give Kinesis the appropriate permissions it needs to access your S3 bucket and any other resources it may need. Click
Create or update IAM role.
Review your configurations and hit
You’ve successfully created your Kinesis Data Streams, Kinesis Data Firehose, and S3 bucket for ingesting and storing data!