Published on

Exploring Options for Unbounded Streams

Part 1 of "Requesting Unbounded Streams In Aws"

Welcome to the Series

Generally, data is pushed into our serverless applications or actively retrieved as messages when needed. This series aims to explore different options for maintaining open, long-running TCP connections to an event provider to listen for updates as they come in.

Table of Contents

Overview

The problem

When prototyping and working with external data providers, it is quite easy to spin up a local server which requests and receives updates from an existing websocket.

WebSocket Connection

The below python example opens a connection to a WebSocket which will remain open for as long as the the server is running and both parties keep the connection.

url = 'example.com/api/stream'
headers = {
    # ...
}

stream = requests.Session()
stream.get()

with stream.get(url, headers=headers, stream=True) as resp:
    for line in resp.iter_lines():
        if line:
            # ...

This becomes quite a bit more complex when the solution needs to be packaged. One option is to dedicate some physical hardware (ie IoT device or a server) to maintain an active connection to receive events and then forward them to our application for further processing. Another alternative would be to make use of cloud services to maintain the connection. On the cloud, our options fall into two main camps: serverless, where the cloud provider maintains the underlying hardware for us, and whatever the opposite of serverless is. AWS Lambda is the obvious solution for serveless implementations and containerized server is another alternative.

Why Not Kinesis Data Streams?

Or, API Gateway WebSockets? AppSync WebSockets? AWS IoT? Or, any of the other services AWS offers? Simply put, there aren't any services or tooling dedicated to maintaining long running, outbound connections. Or any that I could find, and the listed examples all focus on inbound events or data.

This makes sense when considering that the core principles oppose a direct serverless solution of our problem.

Target architecture

The focus of this series is to explore various solutions for the Connection Manager abstraction in the below architecture. The main goal of this component is to maintain a connection, receive any events, and write those events to a datastore (like DynamoDB). Once the datastore is written to, an event can be passed onto the Event Processor abstraction. Simplified Target Architecture

Possible Solutions

In the below examples, it is assumed that the processing and handling of the data will remain the stable and relatively consistent without large spikes or scaling requirements.

Note that the below suggestions are some ideas based on initial research and will likely change with additional testing and research which will be included in future posts.

1. Lambda

Deploy code to an existing Lambda function which manages the underlying resources to run a function in the cloud

Cost

*Note: prices are for the ap-southeast-2 region and are correct as of writing

  • min $0.0000000021 / millisecond
  • min $5.4432 / 30 days

Why?

  • Each new request or connection is handled by it's own Lambda
  • Serverless implementation
  • Many different and easy to integrate solutions to trigger the function
  • Easy to set-up and connect with other AWS resources

Why not?

  • 15 minute hard limit for any Lambda functions
  • Additional complexity if wanting to to re-open connection and determine state

2. Lambda + Step Functions

Use AWS Step functions to manage the Lambda instances, and open a new connection as the existing Lambda is about exceed it's time

Cost

*Note: prices are for the ap-southeast-2 region and are correct as of writing

  • In addition to the Lambda costing
  • $0.025 for the first 1,000 state transitions
  • $0.000025 / state transition -- min 2,880 state transitions in 30 days
  • min $0.72 for 30 days

Why?

  • Easy to trigger and manage states
  • Step function can run up to 1 year
  • Decouples the Lambda code process of spawning new connections

Why not?

  • Additional complexity: -- Regularly spinning up and closing connections -- Duplicate messages, potential message loss
  • Additional cost on top of the Lambda pricing

3. EC2 for long running applications

Swap out the Lambda functions with EC2 instances to maintain an active connection. Use the server to receive events and then hand them off to serverless resources for processing. Although not a serverless solution, definitely needs to be explored. Involves hosting code on an EC2 instance, which is effectively physical resources which can be provisioned and accessed over the cloud.

Cost

*Note: prices are for the ap-southeast-2 region and are correct as of writing

  • min $0.0042 / hour for a t4g nano
  • min $3.024 / 30 days

Why?

  • Servers are the natural solution for this type of task
  • Decouples the long running tasks from the rest of the serverless solution
  • Explore pairing with Elastic Container Service (ECS) and AWS Fargate to launch and configure containers

Why not?

  • Additional set-up and management overhead
  • Additional complexity to trigger and stop instances
  • Scaling based on the number of connections needed
    • Scale out or scale up (add more power to handle more connections or add more servers)
    • If scaling up, may still need to add more servers

4. EC2 for all processing

If we have already done the work to configure the EC2 instances, why not attach an persistent data store and handle all processing on the server instead of just gathering the events and sending them into our serverless architecture? We will explore some of these options later in the series.