So, you built yourself the world’s most perfect, IT-based file workflow and you have a state-of-the-art media facility? Congratulations, have a beer, but… at some point it’s going to go wrong! As far as I can tell, the 100% trouble-free media production and distribution facility has not yet been built, and I don’t think it will ever exist. Dinner is on me if you prove me wrong!
At this point, Bruce’s Shorts subscribers are already wondering if I’m starting to ramble, as I discussed this topic in Bruce’s Shorts Season 2, Episode 28. I come back to the topic because I am often asked why software systems that worked last year suddenly stop working for no apparent reason just a year or so later. This often comes down to an understanding of the word “working.”
Here we are, your beautiful IT-based, file workflow suddenly went wrong (I told you it would, didn’t I?!) and you just can’t find a proper workaround. While I won’t insult your intelligence by assuming that you forgot the KISS principle, I would suggest taking a quick look at this list.
Given that fewer than 50% of support cases in any company that I have worked at result in changes to software, it helps to have a good analytical approach to finding the causes of problems and communicating them effectively to your suppliers.
What do you do? How do you find those problems?
Obviously, the IT world is more flexible and has many more variables than the tape-based, SDI world. The first approach is to separate symptoms from causes. “It’s doing something weird all by itself” has rarely proven an effective opening line to a support department. “Our MAM system triggers an API refresh cycle at 2 minutes past every hour, which seems to happen 5 minutes before the storage anomaly” would be a much better starting point. This phrase informs support that you are seeing something happen regularly and that there is some correlation between the symptom and another event in the system / workflow.
As we all know, correlation does not imply causation. For example, in the UK there is a strong correlation between seeing lightning and rain. This does not mean that rain causes lightning, nor does it mean that lightning causes rain. All it means is that when you see lightning it is likely to be raining. So how do you find causes and not symptoms?
Many troubleshooting techniques have been developed, but some are more adapted to IT and the broadcast & media industry than others. One of my favourites is “The 5 Whys”! Wikipedia has the following definition: “The 5 Whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question ‘Why?’”
Here is an example where chaining the question why:
The media file can’t be transcoded (the problem)
Why #1? – The source file can’t be opened.
Why #2? – Permission failure. Checking the logs reveals that the file cannot be opened from the transcode machine with the account associated with the transcode service.
Why #3? – Group settings. The file can be opened from the storage host, but only with users in a particular ACL group. We check the account on the transcode machine and discover it is in the wrong group so we change it. It still won’t open.
Why #4? – Drive mount settings. Digging a bit further reveals that the way in which the drive was mounted in the operating system had an override for the group control. Updating the mount instruction remaps the group to the correct value and the system works. But why did the system go from “working” to “not working”?
Why #5? – System maintenance. Updating the security of the infrastructure is always risky. You can test 99% of the use cases, but unless you knew to look for the fact that a single machine in a farm was hand-installed in a rush and had the wrong mount overrides then you’ll never find the cause of the symptoms until something else changes.
Okay, that’s a first example. Let’s look now at some other things that you can do when you’re troubleshooting your file-based workflow!
When it goes wrong, the first thing to do is to find out what makes the issue repeatable. Being able to reproduce the symptoms greatly helps to isolate, identify and resolve an issue. When an issue is not repeatable… Well, you’re in trouble, because those are really tricky to find! But if you find the stimuli that make it repeatable, you can then start to vary those stimuli to differentiate between symptoms and causes.
Real-life example: transcoding from XDCAM HD to MP4 changes channel mapping. Try several sources of XDCAM HD. We have had examples where a specific version of editing code was making bad XDCAM HD. The problem was not in the transcoder but in the bad source files being presented to it.
Let’s try to switch between SD and HD – do the symptoms still show? Try to look at the dependency of each parameter of the input.
Real-life example: many of our aspect ratio issues seen by customers are dependent upon the signalling in the input file, as well as being able to express what the output should actually be. For example, when creating an HD output, the pixels are square. An SD input may have several pixel shapes depending on the underlying signal standard and display aspect ratio. The phrase “anamorphic” is usually not specific enough to cover all the cases. Try to be precise with phrases like “720 pixel by 576 line input with a 4:3 display aspect ratio” at the input and “1920 pixel by 1080 line output with a 16:9 display aspect ratio with narrow black bars at the sides and cropping top and bottom giving an active picture with a 14:9 aspect ratio.”
Does the failure occur when there are more people on the system? At a particular time of day? On a specific day of the week? When the playout servers are being updated?
Real-life example: one of the hardest ones that we have found in real life involved finding that files on a particular NAS were being read back as corrupt. To cut a long story short the cause was simultaneously writing high bitrate and low bitrate data from more than one machine onto that NAS. The peculiar mix of data rates and write pattern was causing a pathological failure in the cache of the NAS. This resulted in data between the application writing the data onto the Ethernet wire and it hitting the disk. The only way we were able to find it was to be very precise in all of our tests and to continually eliminate symptoms until we were left with the root cause of the problem.
Does it only fail at a particular time of the day?
Real-life example: it may sound obvious, but when a customer called us and said “Your software is failing at 1pm every day – what are you going to do about it?”, I replied by saying that I didn’t remember instructing our engineering team to write some code that said:
This reset the anger and frustration of the situation and we got to work, asking “Why?” a lot, tracking down and eliminating symptoms until we found a cause. It took a few days of investigation because we could only see the effects at 1pm every day (obviously). Eventually a network traffic analysis showed that the editing staff went to lunch at 1pm and everyone backed up their projects before doing so. Their real time requirements from the storage had a higher priority than our software did. Unfortunately, our real time ingest can only buffer so much until there is no more memory or local resource available. At that time the errors occurred.
I won’t pretend that finding issues in big software systems is easy. It requires dedication and a methodical nature as well as extreme calm when all around you are panicking. It is also crucial to differentiate between symptoms and causes. In the final example above, there was a strong correlation between the quality of the food and the failure of the software. One way to fix the problem would be to hire a terrible chef. No-one would leave their desks for lunch, there would be no backups and our software would have been just fine. Thank goodness there’s more to life than software. I’m off to lunch, but before I do, I think that I’ll have a quick review of my recent support tickets and see if I follow my own rules. Why don’t you do the same?
See y’all soon.