Ingesting data with Kinesis

Let’s start setting up some backend infrastructure for our analytics pipeline. First, we need to ingest data into the pipeline. To do this, we can configure a Kinesis Data Stream and a Kinesis Data Firehose Stream. Kinesis Data Streams allows you to ingest streaming data with your own custom producer and consumer applications and fan-out this stream to multiple different consumers. Kinesis Data Firehose delivers streaming data to one of four built in destinations – S3, Splunk, Redshift, and ElasticSearch. For this workshop, we will use an S3 data lake as the destination.

Configuring Kinesis Data Stream

  • In the AWS Management Console, select the Amazon Kinesis service.

  • Find the Data streams box and select Create Data Stream.

  • Enter a name for the stream. For example, the name Peculiar-KDS is used for this workshop.

  • Under data stream capacity, enter 1 for the Number of open shards. This will create a Kinesis Data stream with a write capacity of 1 MiB/second, 1000 Data records/second and a read capacity of 2 MiB/second.

In a production environment, you’ll want to estimate and monitor your data throughput to know how many shards you need for your stream. Each shard ingests up to 1 MiB/second and 1000 records/second and emits up to 2 MiB/second, so you can scale the shards up and down depending on your throughput.

  • Click Create data stream

Configuring Kinesis Data Firehose

  • On the next page under Consumers, find Amazon Kinesis Data Firehose and select Process with delivery stream. You will create a Kinesis Data Firehose stream to consume data from Kinesis Data Streams and deliver that data to Amazon S3.

  • Enter a delivery stream name. For example, this lab will use Peculiar-KDF for the name.

  • Keep the source as default, which should be Kinesis Data Stream. The stream should auto select to the Kinesis Data Stream you just created, which is Peculiar-KDS.

In a production environment, you can optionally choose to enable server-side encryption for source records in the delivery stream which is a security best practice.

  • Hit Next.

  • Here, you can optionally choose to transform source records with an AWS Lambda function. This is useful for doing basic ETL and transforming data before storing it in your data lake. The Game Analytics Pipeline Solution does this to transform streaming data from JSON to Parquet before storing in S3. Parquet is an optimal format for storing data to be used for analytics because its columnar and can cost optimize and enhance the performance of your analytics. We will configure a Lambda function to do basic ETL in the next section of this lab.

  • For now, leave these configurations as default for now. and hit Next.

  • Under Choose a destination, select Amazon S3 and Create new bucket.

  • Give your bucket a name and hit Create bucket. This lab uses peculiar-wizards-data-lake. S3 bucket names must be globally unique.

  • Your final configurations should look similar to this:

  • Ignore the other configurations for now, and hit Next.

  • Kinesis Firehose buffers incoming records before delivering them to your S3 bucket. Set Buffer size to 1 MB and Buffer interval to 60 seconds.

For this workshop, the buffer size and buffer interval are set to the minimum values to speed up data delivery to Amazon S3 during testing, but this results in less optimized batching. In a production environment, you will want to change your buffer interval to the max of 15 minutes.

  • Leave compression and encryption disabled for now, and leave error logging enabled.

  • Scroll down to Permissions. This is the Identity and Access Management role that you need to specify to give Kinesis the appropriate permissions it needs to access your S3 bucket and any other resources it may need. Click Create or update IAM role.

  • Hit Next.

  • Review your configurations and hit Next.

You’ve successfully created your Kinesis Data Streams, Kinesis Data Firehose, and S3 bucket for ingesting and storing data!