Workflow

Daniel Rix

Apr 8, 2022 • 11 min read

Business logic is easy, mapping it out and making sure it actually happens in the order and timeframe desired is hard.

This is extremely generalized except where otherwise noted, which go with AWS SWF.

Lets define a lot of things from the start

A workflow is a complete, end-to-end description of actions and events.
The stages of a workflow are determined from the decider. It reads the history of events and makes at zero or more choices on what to do next.
Activities, they are the actual work-horse that do the business logic.
Activity runners, they call the activities. They interface with the workflow service, usually on a long-poll, looking for ActivitySchedule events. These are kept seperate from the Decider, because they do actual work.
History/events, this is a log of everything that goes on with the workflow, including external events if needed.

Scope of events:

Service-level - The workflow service itself inserts these into the event stream. These are usually timeout events, and the DeciderSchedule event.
Decider-level - The decider can only issue schedule activities, WorkflowSuccess, WorkflowFailed, and DeciderCompleted.
External-level - These are events that are from an external provider, such as calling an API endpoint to make a signal.
Activity-level - These events are from the activity runners. They only consist of ActivitySuccess, ActivityFailed, and ActivityHeartbeat.

Lets get different events out of the way as well.

WorkflowStart - Service-level. This is always event # 1. It contains the input to the workflow, and critial timeout times. It is not possible to specify this, it is done by the workflow service itself. Right after, the service does a DeciderSchedule.
WorkflowSuccess - Decider-level. The (positive) last event. No other event can be called after this. Workflow completed.
WorkflowFailed - Decider-level (or in very rare circumstances, the service-level.) The (negative) last event. No other event can be called after this. Workflow completed.
WorkflowTimedout - Service-level. The (agnostic) last event. No other event can be called after this. Workflow completed. This is done from the outer-most timeout on the workflow, usually defined by the service, or when the workflow is first started.
DeciderCompleted - Service-level, it waits until this event is recieved before processing any of the other events from the decider decision. This is the last event from a decider decision.
DeciderScheduled - Service-level. This is triggered from almost any other event, which the caveat that there can only be one since the last DeciderCompleted event. The idea here is to only allow one decider active at any one time. The logic of how we can make that guarentee will be expanded on later.
DeciderTimedOut - Service-level. If it took too much time to mark itself complete, or the decider never ran. When this event is added to the history, the service adds another DeciderScheduled task.
ActivityScheduled - Decider-level. This is the core of the scheduling. The decider schedules a task to be done. This is not a call to do it immediatly. This is a call to put it on the stack for the activities to pick up and run. This contains the timeout time as well, how long should it wait to be started, finished, and/or interval for heartbeat.
ActivityStarted - Service-level. One of the weirder events. This is triggered once the Activity-runner acknowledges the long-poll from the ActivityScheduled event.
ActivitySuccess - Activity-level. When business logic is done, Activity calls this.
ActivityFailed - Activity-level. If the business logic had a failure, or otherwise could not recovered, this is the event the Activity should send.
ActivityTimedout - Service-level. This is usually caused if the ActivityRunner isnt actually running. There are other causes, such as the Activity taking too much time to complete.
ActivityScheduledTimeout - Service-level. Another one of the weirder events. This is a timeout event for when the Activity-Runner has failed to pick it up. The timeout time is determined from the WorkflowStart.
ActivityTaskCancel - Decider-level. Should an Activty stop work. This is hard to enforce without using heartbeats.
TimerScheduled - Decider-level.
TimerScheduleFail - Service-level. Timers are named, if the name already exists, you cant schedule it again.
TimerExpired - Service-level.
TimerCanceled - Decider-level.
TimerCanceledFailed - Service-level. Interesting case where the timer might have already expired, so a cancel of an already tripped timer fails.
Signal - External-level.
Marker - Decider-level.
MarkerFailed - Service-level. I have never seen this in the wild; the assumption is a service-level issue preventing the marker from entering the history stream, but can enter this own in the stream.

So now that the definitions are out of the way...
There are a lot of timed events (and timeouts) that go into making a workflow, work.
Lets start with the timeouts that are defined at the Service-level.

WorkflowTimeout - How long should a workflow be 'active' for before firing this and just ending it where it stands. AWS sets this limit (by default) at 1 Year. It is configurable when you make the API call to start the workflow from the service endpoint.
ActivityTimeout - There are multiple variations of this, and they range from start-to-close, schedule-to-start, heartbeat, and schedule-to-close. The first, start-to-close is how long it actually takes to complete the job. The schedule-to-start is how long it takes the Activity-Runners to pick it up. Rarely will you have issues with this timeout type. Heartbeat is usually the odd-one-out and not used unless you have long-running, active tasks. schedule-to-close is more of a total time from when the decider knew it needs to do something to the time it was actually completed. Most of the time, this is moot.
DeciderTimeout - How long the decider has to make the choices, schedule them, and mark itself as completed. The decider itself should only take a few seconds at most to make the calls. No logic should be present in the decider; only what is required to make a proper decision.

With those definitions out of the way, lets start with the weird part about the decider history flow.

Decider History Flow

Deciders look a the event history, but they only need to look at a subset of history instead of the whole thing. That subset is from the last DeciderStarted event (or down to event #1,) to the current DeciderStarted. The reason is the last decider doesn't know any history from after it started; that's our decider history range that we care about.

That's how a decider operates; how about how the service deals with them.
There are a few events that trigger a DeciderScheduled event, such as Success/Fail/Timeout events, workflow timers, and signals. The caveat is that there should only be one DeciderScheduled at a time unless it has a matched DeciderStarted (or DeciderTimeout.)
The point that the service is doing with that is to ensure that it is only one at a time, so they aren't stepping on each others toes; it's hard to have two deciders running at the same time, if the other isn't schedule until after the last already started.
To go along with this, the DeciderStarted event is where the 'last known point' was for anything the decider has to do.

The decider has a few options on what it can do; it can mark the workflow as done (success/failure), choose to schedule at least one activity (more than one is completely acceptable,) or choose to not schedule anything (decider logic has to be built correctly on this, because it is easy to make a decider do nothing, and hit the workflow timeout itself.) Once it's chosen what to do, it need to mark itself as completed.

The service level takes the output of the decider and adds them to the history, and adds the DeciderCompleted event.

On decider issue...

There are times when a decider may get stuck, not get all the history, or just cosmic bit-flips. When those happen, the service watches the time, and issues a DeciderTimeout if needed, and issues another DeciderScheduled, and DeciderStarted event when another decider picks it up.

Specific to AWS, there is a weird edge-case where the decider gets the notice to start, and starts to pull down the history; the decider long-poll expires. Any decision that would be made is invalid, because AWS knows that the decider token has expired. This forces the AWS service to wait for a DeciderTimeout, and issue another DeciderSchedule and DeciderStart.

What if you want to wait until a parallel task is fully complete?

Some observers might notice that there is only one DecisionSchedule at any one time; which means a decider might not see all the parallel tasks complete within the same DecisionSchedule cycle. This is a true statement. In the cases where you need to know those, you can go back through the history and do a count. We cant make decision on anything that is outside of our 'window', but we can use the previous history and make a count and see what has been done and what is still pending.

Important Note: The decider should have a hard upper limit on the history event it can see up to, that is the DeciderStarted event. Since the decider is originally tasked to look at stuff from the last Started to the current Started, you can't look at history that, officially, a decider isn't able to see yet.

To do this, you can match up Scheduled, Started, and Completed/Failed/Timedout Activities. If all the parallel tasks are complete, then there is your answer; if they dont match up just yet, then it's best to not schedule anything and wait until the decider gets called when they do match up. We include the Timedout because we might want to rerun the activity that timed out, which increases the Scheduled and Started number.

The perfect empty execution

WorkflowStarted
DeciderScheduled
DeciderStarted
DeciderCompleted
WorkflowSuccess

This is a perfect execution because it showcases the decider choosing to end the workflow as a success. Even though there are no Activities, this is a valid, simple workflow (albeit not useful, but a valid one none-the-less.)

As you start building a flow and Activities start increasing, so does the complexity of the decider; not only do you have to do the counts to make sure everything happens at the right time, but you also have to start dealing with failures, timeouts and general shinanigans from the Service, the Decider, and the Activities themselves. We'll touch on how to do all these later. For now, lets look at a perfect flow with multiple activities.

The perfect workflow with multiple activities

The perfect workflow event log, with 3 activities and notes along with it.
TaskA and TaskB should be done in parallel, followed by TaskC when either one is completed.

WorkflowStarted - Whatever input
DeciderSchedule - First schedule for the decider
DeciderStarted - Run of the first schedule of decider
DeciderCompleted - First run completed, and scheduled 5 and 6.
ActivityScheduled - TaskA
ActivityScheduled - TaskB
ActivityStarted - TaskB - There is no set reason why things have to start in the order they were scheduled. They were scheduled at the same time, which means they should be running in parallel.
ActivityStarted - TaskA
ActivitySuccess - TaskB
DeciderSchedule - TaskB finished first, the service added this, since there has been no DeciderSchedule since the last DeciderStarted (3).
DeciderStarted - The decider has all the history, but only needs to work on stuff from the last DeciderStarted (3). Tasks 4 - 11 are the events that need to be looked at.
ActivitySuccess - TaskA
DeciderSchedule - The service added this, since there has been no DeciderSchedule since the last DeciderStarted (11)
DeciderCompleted - This is from the decider started on (11)
ActivityScheduled - TaskC
DeciderStarted - This is from the schedule (13). It really only has events 11 - 16, which has a ActivitySuccess; but since TaskC was scheduled, lets not reschedule it.
ActivityStarted - TaskC
DeciderCompleted - From 16
ActivitySuccess - TaskC
DeciderSchedule
DeciderStarted - Has events 16 - 21
DeciderCompleted - Decider determined C was complete, and there is nothing else to do but mark as done.
WorkflowSuccess

That's... a lot of stuff for 3 tasks... but... it ensures that it happens in the correct order, and without losing information. Lets take it a step further, with errors.

It should be noted that even for smaller workflows, the history count can be in the hundreds, especially if there is parallel processing involved; each Activity is at a minimum three events (scheduled, started, completed.) Lots of DeciderSchedule/Started/Completed as well. I've worked on workflows that reach into the thousands of history events.

Errors

It's a fact of life, shit happens. Machines try their hardest to not mess up, but they do, whether that's human error, or cosmic bit-flips. A workflow understands that things happen that weren't anticipated, and tries to correct for it with another set of constructs. If all else fails, have the decider alert a human that something didn't go right, and needs attention.

The Service-level handles (read: triggers) all the timeout cases; they wait the timeout time, and add an event (or two if a DeciderScheduled is needed) to the history for a timeout; it's the deciders choice on what to do with that. Some timeouts are okay, for example a metrics or telemetry Activity; if that activity doesn't start/finish on-time, it's not the end of the world. Other timeouts mean that the decider needs to schedule the Activity again, usually with an exponential backoff. For other timeouts, such as a DeciderTimeout, there really isn't anything you can do. You might look at how long it is taking the decider to do things.

The Activity-level errors are pretty much only restricted to ActivityFailure. On a history event with a Failure, the Service does the same DeciderSchedule dance if needed. Its up to the decider on what to do. With a Failure, you can read a metadata field to find out why it failed and go from there. Usually, that 'fail-code' from the metadata would be used to determine whether to reschedule the task, forget it, or start another task.

At the end of the day, the workflow alerts when there are problems, and gives the decider complete authority on what to do about them.

External events

Say you have a third-party service you make a call to in an Activity; you don't know how long the third-party will take to execute and get a result, or is an asynchronous service. These are known as signals.

A signal is a history item that triggers the DeciderSchedule dance. On a signal, which has some metadata attached, the decider runs to take an action.
A common use-case for signals are also for first-party notices. If there is an issue, cancelation (where a workflow cancelation wouldn't be enough, such that you have to back-out of operations,) or otherwise something different needs to happen, a signal can be there to tell the decider that something else needs to happen.

Internal events

We'll touch on this quickly. These are known as markers; it's a history item, but it does not trigger a DeciderSchedule. It is used only for some metadata for the decider; usually used for some indexing, or other data that otherwise has some bearing on the decider logic in some fashion.

Timers

Another thing we'll touch on is timers. They are scheduled, and the Service takes responsibility of them to fire another event when the timer expires. The expired event triggers the DeciderSchedule dance. There is a TimerCancel as well, so there wouldn't be an event trigger.
Timers themselves are used for a time-gated event, such as waiting a set period of time before the next step, like giving an external cancelation to expire. If the external cancelation does go through, that can be either a signal or an actual workflow cancelation. For the signal, a TimerCancelation can ensure that the steps after the time-gate are not executed, but instead have the signal set of operations.

Other things that don't fit

ActivityTask Lists

These are a way to filter the ActivityRunners. The best example are Credit Card processing Activities. Those activities are required (for PCI compliance) to be on separate servers, usually in their own secure racks. The ActivityRunner for the processing would have a list of "PCI", and when the workflow calls for something that needs PCI compliance, the decider can schedule an activity with a task list of "PCI"; that ensures that only the runners located on the PCI servers can pick it up.

AWS-specific

Child-workflows

These are a little funny. In essence, these start a net-new workflow, another WorkflowStarted event and all; only doing a DeciderSchedule when the child workflow has completed. The primary use-case for these are to abstract away some things that might be used elsewhere, such as Payment Processing. The parent workflow can still accept signals and timers.

Lambda

They are the same as ActivityTasks; a few bugs in the implementation with timeout times, since Lambda has it's own timeout.

ContinuedAsNew

Little use for this, because it removes history. Have not ran into this in the wild. Only thought of use would be if the workflow reached a steady-state, but is getting close to a history item limit (too many items in the history.)