For distributed workflows - tradeoffs having stateless workers operate on central state, versus stateful workers?

I'm working on a problem right now that processes incoming data at a very high rate. Each event that flows in has an association ID, and each group of associated events will affect behaviour over time window. For example, assume all staggered events blow share the same association ID: ---> | ---> | ---> | Workflow over some time (t - say, 5 minute window) ---> | ---> | TPS is high (100k+) and all associated events need to be processed statefully by a worker that has access to that state. Actions are based on some calculations performed on some previous window of state for each group of associated events. Something like: // last 5 minutes of events deriveAction(associatedEvents: Event[]) { // some logic // some action } After some point, associations windows expire and a new group is worked on. My question is: Is it better to send all associated events to the same worker (i.e. distribute the load based on association ID, in a way that doesn't result in hot instances). Or, have state centralized in say, DynamoDB and have workers operate on the central state with some trigger? The workers would not all necessarily be working on the same workflow - which would instead of event based and each worker would fetch state as required. The sharding approach (1) seems far simpler to model, but harder to manage. The event stream is Kafka, so I was originally looking at Flink. Another requirement is that actions need to be very low latency. I can't afford to group events over say, 10 seconds and then send those events in batches to a worker. Any advice would be appreciated.

Feb 5, 2025 - 03:08
 0
For distributed workflows - tradeoffs having stateless workers operate on central state, versus stateful workers?

I'm working on a problem right now that processes incoming data at a very high rate. Each event that flows in has an association ID, and each group of associated events will affect behaviour over time window.

For example, assume all staggered events blow share the same association ID:

--->               |
      --->         |
             --->  |  Workflow over some time (t - say, 5 minute window)
  --->             |
          --->     |

TPS is high (100k+) and all associated events need to be processed statefully by a worker that has access to that state.

Actions are based on some calculations performed on some previous window of state for each group of associated events. Something like:

// last 5 minutes of events
deriveAction(associatedEvents: Event[]) {
   // some logic
   // some action
}

After some point, associations windows expire and a new group is worked on.

My question is:

  1. Is it better to send all associated events to the same worker (i.e. distribute the load based on association ID, in a way that doesn't result in hot instances).
  2. Or, have state centralized in say, DynamoDB and have workers operate on the central state with some trigger? The workers would not all necessarily be working on the same workflow - which would instead of event based and each worker would fetch state as required.

The sharding approach (1) seems far simpler to model, but harder to manage.

The event stream is Kafka, so I was originally looking at Flink. Another requirement is that actions need to be very low latency. I can't afford to group events over say, 10 seconds and then send those events in batches to a worker.

Any advice would be appreciated.