Efficiency, Production and a pinch of Slack
“Good companies excel in creative use of slack. And bad ones only obsess about removing it.”
After enjoying the far reaching insights in the excellent "Construction, Efficiency, and Production Systems" article by Brian Potter, I needed to know more. Mr Potter demonstrates via simulation that performance of a system, particularly linear processes like production, is typically most at threat from variability. In other words, unpredictability at any step in the process can lead to wastage and delays in an otherwise very well ordered system.
Mr Potter's article is motivated by the widespread, but poorly understood, criticism of the construction industry as one of the persistently more inefficient. But as he points out, the same Achilles' heel threatens "any system where a set of inputs is transformed, step by step, into a set of outputs". That means any lean orientated methodology, including those adopted by the popular scrum and agile frameworks.
Mr Potter concludes:
The upshot is that it’s often possible to improve a system’s performance substantially simply by reducing variability. Even production systems with long process times and significant manual labor can be substantially improved if you can control and restructure them in a way that makes them more predictable.
But why is variability such a threat? Surely, as long as the average process time remains the same, variability only results in short term perturbations? I needed to see this in action to understand the source of the inefficiency.
Time to fire up me some Jupyter. All the graphs and animations in this article have been generated from a Jupyter Notebook. Thanks to the beauty of the Internet, you can see for yourself, if you like.
Let's start by running the first production line model in the article: Brian's Perfect Pin Factory. It has four production steps, each taking one second.
To start with, let's run the factory model, and plot the Out Tray for each Production Step in the Production Line. For the sake of simulation, each In Tray starts at zero, except the first Production Step, which has an essentially unlimited In Tray.
No surprises there - each step in the line takes a second to get "primed", and from that point there is a constant output at the end of the line, at a rate of one completed job per second.
Now what happens when we add some randomness to the step time? The article uses this as an example:
Let's do the same and add 0.5 seconds of normally distributed noise to the step time. Note that statistically, this means sometimes we end up with a negative step time! A Poisson distribution might be more applicable, but I'm going to stick with the specification in the original article. To make it mathematically valid, I just cap the minimum step time by the time resolution of the simulation.
Now see how the line performs. Again, we are visualising the Out Tray for each Production Step in the Production Line.
And here we discover our first big surprise: there's nothing particularly surprising here! Sure, progress is jumpy rather than predictable, as you would expect, but the output still proceeds at roughly one completed job per second.
To understand where the inefficiencies the original article discusses arise, we need to dig a little deeper. Let's start by simultaneously plotting the work in process (WIP) and cycle time (CT) as defined in the article, alongside the Out Tray counts.
Hmm, that's strange - WIP and CT both start at expected values of about 3 jobs-in-progress and about 5 seconds of cycle time. But after 20 seconds of production, both metrics seem to start growing. Where are they going?
To understand their trajectory, we need to simulate for far longer. Real-time animations start to lose their appeal at this point, so let's turn to pre-calculated results.
Here's a plot of WIP and CT, verses seconds of production time. Since our average step time remains at 1 second for 1 completed job, we can keep the graph simple by using the same y-axis for both metrics. Jobs and seconds are approximately the same scale. This time however, we extend production time on the x-axis - all the way out to 1000 seconds.
Ah ha! Our WIP and CT values soon depart predictable values (that is, roughly 3 jobs and 5 seconds respectively) and head a long way north. Where do they end up? Let's check with a few more simulations. Let's do the same thing, four more times.
Beyond 30 or 40 it's practically a random walk!
And herein lies the core of the problem. As the original article puts it:
The culprit is the variation.
As soon as the step time has some randomness associated with it, WIP and CT grow unpredictably. Instead of predictable results like having each of the first three steps with a work in progress, or the cycle time of any particular job being 4 seconds from Step 1 to Step 4, we end up with WIP and CT values of at least 20, and potentially above 120. If production runs longer than 1000 seconds, even higher WIP and CT is possible.
The net result is that if you have a process that has 4 steps, and each step takes 1 second, you can expect that it takes 4 seconds for a input to reach the output of the process. If you supply inputs as often as needed, you can expect each step will bank up at most one input before it completes the last. If however, real world effects result in each step taking more-or-less 1 second, with some statistical variation above and below, the process with gradually become less efficient. After running for some time you should expect that the time taken for an input to reach the output will blow out from 4 seconds to at least 20, and possibly much, much more. Similarly, you can expect that at any point in time there is likely to be many more than the ideal 3 works in progress.
Given that no real-world process ever completes in a precisely predictable time, what are we to do?
As the simulations demonstrate, the root cause of the inefficiency is the gradual accumulation of backlog.
There are various ways to tackle backlog, depending on the constraints surrounding the process. One effective way to ensure backlog never accumulates unbounded, is to introduce process slack. Slack, as defined by Tom DeMarco in "Slack: Getting Past Burnout, Busywork, and the Myth of Total Efficiency", is defined as the degree of freedom required to effect change. DeMarco extolls the surprising benefits for ensuring productive work results in beneficial outcomes. He describes it as an opportunity for growth and improve effectiveness, not just efficiency.
For a primer on DeMarco's exploration of slack, see https://fs.blog/2021/05/slack/
But, as it turns out, slack would also have a direct impact on efficiency, at least as modelled in these simulations. Let's attempt to introduce a little slack at each step in the production line. Let's gift individual steps in the production line with the ability to dwell for a moment, if they find themselves ahead of subsequent steps. More explicitly, we will model slack as time taken to do nothing, if the Out Tray is well stocked.
And then we roll the dice again, just as before, only this time the steps have the slack feature.
As before, no big surprises here. Progress ticks along much like before with the output still proceeding at roughly one completed job per second, after an initial seed period of about 4 seconds.
So as before, let's simultaneously plot the work in process and cycle time as defined in the article, alongside the Out Tray counts.
Ah ha! Now for the first time we can see the benefit of slack. Neither WIP nor CT grow much beyond their expected values. They're not always as low as the ideal world of zero-variability (3 jobs-in-progress and 5 seconds of cycle time) but now at least they seem bounded.
To be really sure, let's do just as before and simulate for much longer.
Boom! Clear as day. By introducing slack, our production line's predictable behaviour is restored. Both WIP and CT stay well below 20, hovering at about 10 or below. This is entirely predictable of course - now that each step in the line goes slack if it has at least 2 jobs already queued up for the next step in the line, there can't be any more than 9 jobs in progress. Consequently, any job entering the line will never have more than 9 other jobs ahead of it in the queue to get processed, so the cycle time is also well contained.
So there you have it - if your workflow has enough complexity to exhibit variability in processing time at each step, you can keep your backlog low simply by introducing a little slack into the system.
But that's not the whole story is it? Sure, our queues are shorter, but let's be honest - we didn't actually improve the output rate. That's stubbornly stuck at the average step time. Introducing slack isn't going to get more than an average of one job out of the line a second. So is there much point? Well of course, as DeMarco explains, the point is to:
reintroduce enough slack to allow the organization to breathe, reinvent itself, and make necessary change.
Which demands the question: how much slack have we introduced? Let's check!
First, we add a variable to monitor slack. Then we run the whole production line again, this time plotting total accumulated slack (in green) alongside WIP and CT.
Note the y-axis has been zoomed in from a maximum of 120 to a maximum of 20 for clarity. Also note that accumulated slack has been divided by 60, so is in units of minutes while cycle time (in orange) and x-axis (run time) is in seconds.
And there you have it. The sum of the time spent slacking by the first three steps in the production line rises fairly linearly to 15-18 minutes over the course of 16.7 minutes. That's an average of 5 to 6 minutes per step. In other words, each step can spend one-third of the time slacking, and not affect the production rate! Indeed, the time spent slacking actually improves the cycle time and work-in-progress metrics.
But it gets better - each step has one third of the time "to breathe, reinvent itself, and make necessary change". That time might be spend sharpening tools, improving skills, recuperating or even thinking about how the bigger picture could be reframed.