This is an interesting case I was working on during a side project a few years ago. The graphic shows (with changed time-stamps) the daily timing of a process flow that was seeding information for a variety of recipients.
What can you see?
- The task was automated and scheduled to run at 7 and later on at 6:38, visible in the floor of the time-series.
- The one data point when the task ran before 7 was a complete failure, as the prior day it didn’t run.
- The volatility of this process was a desaster.
What you can’t see, but what was the consequence of this task:
- The team that was involved in this process was badly trained, understaffed and the resources provided for the system were insufficient. (Hence causing the volatility.)
- The culture in the organization didn’t allow information of failing schedules to reach the person accountable for changing the resources. This issue was twofold: first, the direct communication of the issue from the bottom to the top wasn’t working due to the incentives within the chain of command. Furthermore, the information also couldn’t percolate through the system of stakeholders, because the person in charge was protected from the gate-keepers above it that didn’t allow the accountable person to voice any opinion or analysis on this. The top management attention was too high and failure not an option and the political environment was too dangerous to address an obvious issue. Any stakeholder attempting to address the issue was bullied down.
- The effects were disasterous as stakeholders were no longer able to synchronize their activity with their stakeholders as they were out of sync with the provision of the prerequisites of their tasks – which was this flow.
But what changed the situation?
- The core step forward towards a solution was a clear cut assessment of costs and operational risk incurred by the problem. But the teams involved in this process and their system were too isolated.
- The IT department granted access to network drives that laid bare the filesystem of the system. The entire flows related to this flow were anlyzed by using filesystem information on file creation and updates.
- Time-series analysis was used to root-cause the flows that lead to the failure of the core system and it could be concluded that 40% of failures were due to pre-ceding processes managed by external service providers and 60% was caused by the core unit not being able to handle issues and desasters – that occured frequently – in an appropraite amount of time.
- The entire data was used to build a dynamic systems model for around the flow which spanned flows before and after the process in question. The important fact was to completely assess the impact due to wait-times and to provide an accurate measure of when an actual very costly desaster would occur. The total impact on costs due to wait-times wasn’t really understood as relevant, as there was no possible way to heal the incentive design issues in the costdistribution system. The contract design between stakeholders was too bad and there was no chance costs-incurred could be a attributed to the cost center that caused the costs, which destroyed relevance of this finding. What was more important was a Monte Carlo simulation tht indicated how high the probability of a complete desaster was within a reasonable time-frame that would hurt the careers and position of the accountable persons.
- The next step was to devise a communication strategy that would inform even higher positions that then would become accountable for the larger failure in order to re-design the political dimension of such a failure.
Ultimately, a tiny portion of resources was assigned to the team that reduced the operational risk and lead to higher synchronizity. Luckily, I was an external consultant. I would likely been have fired for causing such a stir.