The Flash Crash, Systems Failures, and Mission-Critical Engineering

The Flash Crash was caused by a complex interaction between different IT systems. Some of those systems failed in an ungraceful manner, and other systems could not cope with these failures. What I wish is that, instead of wringing our hands about how computers now manage our trading infrastructure, we took the time to understand that computers manage systems that are far more mission-critical than our financial markets.

I have worked extensively on trading systems and appreciate the importance of stable markets. I think that any software connecting to markets should have multiple layers of safety checks which are as independent as possible. In fact, I would argue for much greater risk checks than we have now. Algorithmic trading companies understand well what can happen if their safety measures fail. But, the worst thing that can happen to HFTs and exchanges is generally bankruptcy. [1]

It’s true that ordinary investors can get caught in IT failures and lose money, though hopefully erroneous trades can be unwound in those instances. It’s also true that traders illegally gaming each other and manipulating markets can defraud innocent investors. Given the spectacular systems failure of the Flash Crash, it’s worth reflecting on the risks associated with shoddy automation in other areas of life. I can’t even begin to detail all of the serious IT failures we’ve had in the last 50 years, but here are a few (in no particular order) that make the Flash Crash look like a triviality:

  1.  The Northeast Blackout of 2003. Complex interactions between multiple events, including a software failure, interrupted electricity delivery for over 50 million people, contributing to many deaths.
  2.  The Therac-25 radiotherapy device. Buggy software with poor safeguards allegedly caused at least 6 cancer patients to receive fatal or near-fatal doses of radiation.
  3.  Problems with software that controls airbag deployment in cars, including certain Cadillacs.
  4.  Problems with electronic voting in the 2014 Belgian Elections. Only 2000 voters seem to have been affected, but I might guess that fewer than that number of traders were seriously affected by the Flash Crash.
  5.  The infamous Toyota “Unintended Acceleration” issue may have been caused by faulty software, according to some experts . The issue has allegedly caused dozens of deaths.
  6.  A flaw in the software of a Soviet satellite reportedly triggered alarms that the US had launched 5 ICBMs. Human operators, suspecting a false alarm, fortunately waited for radar confirmation of the launches before reacting.


This is far from a complete list, and often we don’t even know if faulty IT contributed to a fatal accident. I think that many financial professionals suffer from déformation professionnelle. [2] The reality is that, despite the hullabaloo over the Flash Crash, it had few serious consequences in the grand scheme of things.

The Flash Crash very temporarily resulted in the loss of about $1T in the market-value of securities. It also triggered a media firestorm which potentially convinced some retail investors to hold cash, and miss the post-crisis stock market recovery. For those active traders who lost money that day, there’s no doubt that the Flash Crash was a big deal. But the reality is that the market recovered within minutes and many of the accidental transactions at absurd values were cancelled. A Flash Crash is also much less likely today, at least in American equity markets, where we have circuit breakers that halt trading when it becomes sufficiently volatile.

As computerized systems rightfully take on more responsibility, I hope we can learn some lessons from the Flash Crash. Unlike exchanges, designers of life-critical systems don’t have the luxury of shutting down for a few minutes when problems are detected. It’s not great that financial markets’ electronic infrastructure couldn’t handle a little stress, but it’s lucky that a failure like this attracted media attention without resulting in any loss of life. The anniversary of the Flash Crash is a reminder that all critical systems require intelligent regulation and, most importantly, is an opportunity to thank the engineers who keep us safe.


[1] It might be worse for an HFT to have some of their traders violate compliance rules and commit crimes. But I wouldn’t really call that an IT glitch, though with compliance being increasingly automated, maybe one day that’ll change.

[2] I have looked for an English equivalent to this term, and the closest I’ve found is “occupational psychosis,” which sounds a bit more extreme than I’d like.

4 thoughts on “The Flash Crash, Systems Failures, and Mission-Critical Engineering

  1. jh

    There has been a quiet revolution (for those not in the field) over the last 30 years in the field of program verification. A lot of those accidents you mentioned can and are now being avoided with techniques like model checking and abstract interpretation. Any part of the software engineering process (specifications, protocols, processes, actual code/implementations) can be checked for bugs or more powerful proved absent (!) of certain classes of bugs, or even proved for termination (which is impossible in the general case but very possible in most real world cases).
    Many mission critical systems in aviation, military, healthcare, and software/hardware design are using these tools. Here’s some links for the interested:
http://www.di.ens.fr/~cousot/AI/IntroAbsInt.html
https://courses.cs.washington.edu/courses/csep573/11wi/lectures/ashish-satsolvers.pdf

    Eventually these techniques will be fully automatic (some already are), built into our compilers and taken for grant just like e.g. type inference is nowadays (once a research topic).

    Btw, those exact tools could easily be applied to trading systems to prove the absence of certain errors in design/specification/code. Obviously the difficulty is modelling the interaction of different algos, but also that can be modelled.

    Like

    Reply
    1. Salvatore Sferrazza

      Imandra by Aesthetic Integrations seems to be be at the forefront of engineering proofs for trading venues: http://aestheticintegration.com/imandra/

      Their whitepaper detailing how Imandra might have forestalled the compliance violations of UBS’s ATS was a very good read. Although without guidance and ruling (or sufficient deterrent via enforcement) from the SEC/CFTC accordingly its unclear whether participants would begin this practice organically.

      I have not yet delved into CFTC’s proposed Regulation AT (which weighs in @ just over 500 pages, more than SCI’s proposal but less than SCI’s final ruling). That could be an opportunity to introduce a proofs regime into compliance, but it requires the leadership of a regulator (or at least a VERY prominent participant) to step in with some good ideas and then galvanize an entire industry around them.

      I fear that absent the severe motivations of massive crisis, a sea change of this sort is unlikely.

      Like

      Reply
  2. Ann Onymous

    From my experience, most HFT shops have risk controls that are designed to be robust (simplistic, stable, testable, often rules driven with binary trade/no-trade outputs) based on various ways the firm has gotten screwed over badly in the past (loss aversion bias?). This also makes them stupid. Crossed markets, pull quotes. Spread too wide, pull quotes. Slow confirms and fills, pull quotes. Odd fill prices, pull quotes. Too much risk, pull quotes. Feed gaps, pull quotes and wait for retransmit. Feed slow, pull quotes. Positions don’t reconcile, pull quotes. Enormous price changes, pull quotes. Arb lines up that looks “too good to be true”, don’t you dare try it. Excessive message rate, pause orders. Lose too much money too quickly, pull quotes and liquidate. Etc. Etc…

    From the individual operator’s point of view this is a good but not great approach. Better to give up opportunity in a wild market situation you’ve never seen in backtesting than blow up the firm due to technology errors. There are more elegant ways to handle these situations but most introduce complexity for little visible benefit. Making fancy risk controls might get you a bit of extra PnL but could also get you fired if something breaks. If you’re trying to make a fraction of a tick and your data is lagged by several ticks, you can’t really do it. When a majority of liquidity is provided by agents using similar rules and little tolerance for loss, you can end up with catastrophic interactions as they all pull liquidity and run for the exits simultaneously.

    Coordinated circuit breakers/pauses are the best solution since they allow human judgment to take place and encourage liquidity replenishment by slower traders with a longer time horizon.

    Like

    Reply

Leave a comment