Amazon Kinesis Cheat Sheet
Amazon Kinesis is a platform for real-time data streaming on AWS. It makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. The Kinesis family includes four main services.
1. Kinesis Data Streams
Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from hundreds of thousands of sources.
Core Concepts
-
Data Producer: The source of data that puts records into a stream (e.g., web servers, mobile clients, IoT devices).
-
Data Consumer: An application that processes data from a stream. You can have multiple consumers for a single stream.
-
Data Stream: A set of shards. It acts as the channel for the data.
-
Shard: The base throughput unit of a Kinesis data stream. A single shard provides a capacity of 1 MB/s data input and 2 MB/s data output. One shard can also support up to 1000 records per second for writes. You specify the number of shards when you create a stream and can scale this number up or down.
-
Data Record: The unit of data stored in a Kinesis data stream. A record is composed of a sequence number, a partition key, and a data blob. The maximum size of a data blob is 1 MB.
-
Partition Key: Used to group data by shard within a stream. Kinesis uses the partition key to segregate and route records to different shards. Records with the same partition key always go to the same shard.
-
Retention Period: By default, records are accessible for 24 hours after they are added to the stream. This can be extended up to 365 days.
Throughput and Scaling
-
The total capacity of a stream is the sum of the capacities of its shards.
-
You can dynamically scale the number of shards in a stream up or down using shard splitting or shard merging.
-
Enhanced Fan-Out: A feature for consumers that provides dedicated throughput of 2 MB/sec per shard per consumer, and delivers records with lower latency. This is ideal for applications that need fast, parallel processing.
2. Kinesis Data Firehose
Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools. It is a fully managed service that automatically scales to match the throughput of your data.
Core Concepts
-
Firehose Stream (or Delivery Stream): The underlying entity of Firehose. You send data records to a delivery stream.
-
Source: The source of your streaming data. This can be a Kinesis Data Stream, Amazon MSK, or data sent directly via the AWS SDK, Kinesis Agent, or CloudWatch Logs.
-
Destination: The data store where your data will be delivered. Destinations include:
-
Amazon S3
-
Amazon Redshift (data is first staged in S3)
-
Amazon OpenSearch Service (formerly Elasticsearch Service)
-
Generic HTTP endpoints
-
Third-party services like Datadog, New Relic, MongoDB, and Splunk.
-
-
Buffering: Firehose buffers incoming data to a certain size (Buffer Size) or for a certain period of time (Buffer Interval) before delivering it to the destination. This is configured on the delivery stream.
-
Data Transformation: Firehose can invoke an AWS Lambda function to transform incoming source data before delivering it to destinations. This is useful for cleaning, filtering, or enriching data.
-
Record Format Conversion: For S3 destinations, Firehose can convert the format of your input data from JSON to more efficient columnar formats like Apache Parquet or Apache ORC.
Key Differences from Data Streams
-
Management: Firehose is fully managed and requires no provisioning of shards or throughput. Data Streams requires you to manage shards.
-
Data Consumers: Firehose delivers data to a pre-configured destination. Data Streams requires you to build custom consumer applications.
-
Real-time Access: Data Streams provides real-time access (sub-second latency). Firehose has a minimum buffer interval of 60 seconds, resulting in near real-time delivery.
3. Kinesis Data Analytics
Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. It provides the underlying infrastructure to run your Apache Flink applications or standard SQL queries against streaming data.
How It Works
-
Input: You specify a streaming data source (a Kinesis data stream or a Kinesis Data Firehose delivery stream).
-
Process: You write data processing code using:
-
Kinesis Data Analytics for SQL Applications: Write standard SQL queries to filter, transform, and aggregate data.
-
Kinesis Data Analytics for Apache Flink: Build sophisticated streaming applications in Java, Scala, or Python using the Apache Flink framework.
-
-
Output: You configure one or more outputs, or destinations, where the results of the analysis are sent. This can be another Kinesis stream, a Firehose delivery stream, or a Lambda function.
Use Cases
-
Generate real-time metrics for application monitoring.
-
Perform real-time ETL (Extract, Transform, Load).
-
Analyze clickstream data to understand user engagement.
-
Detect anomalies and fraud in real-time.
4. Kinesis Video Streams
Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing.
Core Concepts
-
It durably stores, encrypts, and indexes video data in your streams.
-
You can access your data through easy-to-use APIs.
-
It enables you to build applications with real-time computer vision and video analytics using popular ML frameworks.
-
You can view live or recorded video streams using HTTP Live Streaming (HLS).