Who’s Afraid Of The Big Bad Algorithm?
Editor’s Note: Like any good geek, I went to see Captain America: The Winter Soldier this weekend. Suffice to say I LOVED IT, but for reasons I didn’t expect going in. It was a very smart movie in all ways, but perhaps smartest was the integration of smart tech, Big Data, and predictive analytics into the major plot of the story. It was actually called “Operation Insight” in the movie! After the mandatory discussion about the flick with my family who went with me, I broadened the discussion on social media. Immediately Eric Swayne started tweeting back about his take on the movie along the same lines as mine and we both agreed it would make a great blog post. Since I knew there was no way I could fit it into my schedule this week, Eric volunteered to do the honors and it is an excellent post! I hope it’s the first of many from Eric: he is brilliant. Enjoy!
By Eric Swayne
WARNING: The following post will contain copious references to Captain America: The Winter Soldier, some of which may reveal key plot points in the movie. Proceed on at your own personal superhero movie preferred level of risk.
Saw the aforementioned movie this past weekend, and I must admit, it was awesome. And I’m not alone: the movie has already set a box office record for April (beating Fast and Furious 5), and has received critical acclaim from even the most staunch of fanboys. Sure, there were Easter Eggs and references galore, including one brilliant nod to Samuel L. Jackson’s iconic role in Pulp Fiction. But what makes this movie really pack a punch isn’t your standard fare of pecs and abs – I think it wins with audiences at a deeper level because it deals with many of the very real concerns we have in our very real lives today. Drones and electronically-controlled death from the sky? In the movie. Government being the entity you can’t trust anymore, because of their hidden agendas? Got it. Questions about the nature of privacy and the terrifying power of data mining? Check.
But beyond those, you see discussed what might be modern society’s new unmentionable “A”-word: Algorithm.
In the movie, a mad scientist (and I’ll leave the description there) creates an ultimate algorithm that can predict which individuals will be dangerous to the “bad guys” in the future, thus giving them targets to attack in their nefarious scheme. Within the movie, it’s stated that this algorithm uses the detritus of our digital lives to accomplish its evil machinations: credit card statements, phone calls, text messages, social networks, et cetera. Of course, this level of data collection sounds a lot like some recently-unveiled REAL government programs that are rocking the international community, so you immediately see the monsters in the shadows the film’s creators are implying. Moreover the concept implies that, given enough data about an individual, an algorithm has almost infinite powers of clairvoyance with deadly accuracy. Here is where fiction diverges from reality, because while the data may be limitless (tremendous privacy issues aside), algorithms have considerable limits they don’t discuss on the silver screen:
1. Algorithms are based on assumptions.
The most common assumption baked into most algorithms is that past performance predicts future behavior. This is often the case, but isn’t correct in all cases for all time. In the film’s case, the algorithm assumed people were binary – either enemies or not. Pay close attention when assumptions hit binary “chokepoints” like these when dealing with human data – humans are extremely messy from the perspective of their data, at an aggregate level. In fact, assumptions one makes about the world could be very correct at the time they are made, but can rapidly become obsolete. One classic example of this is in Natural Language Processing for sentiment analysis: most solutions currently on the market equivocate their answers into discrete buckets of “positive,” “negative,” or “neutral.” This already presupposes the content has a sentiment, and that it can be categorized for the whole of the analyzed segment. It also assumes certain linguistic patterns are reliable clues for that sentiment, when we all know language evolves rapidly, and words of a negative connotation can change or even reverse their value. Sentiment algorithms are great for high-level directional measurement, but at ground-level, can be insufficient.
It’s extremely critical to be self-aware when baking these assumptions into any equation, because they necessarily limit the outcomes you can create. After all, if the world is only black and white, it’s extremely hard to see color.
2. Algorithms are probabilistic.
Most behavioral algorithms are based off of statistical models, using past events to find patterns that appear with a significant level of consistency, then applying these patterns against new data gathered to score potential outcomes for the future. The key word there is “score” – very often, algorithms provide a confidence interval, not an “answer.” These scores define a range of futures that may be more possible than others, but they all could exist. For the ultimate sport of stat nerds – baseball – ESPN often provides probabilities for a given team winning a given game at a given time. Even though these may say my beloved Rangers have a 99% chance of winning, there’s always a chance for the other team to find that 1% – usually with a ball somewhere in the bleachers. I have a lot of certainty before that point, but that certainty isn’t absolute.
3. Algorithms are reactionary.
Let’s say you wanted to create an algorithm to predict what time I would get home from work every weekday. How would you start? Like any good behaviorist, you’d like to have a set of information about my habits to start from. Especially useful would be data that have a high correlation with the event you want to know about – things like what time every day I pass the gas station down the street. But the important nuance here is that you have to start from informative data about this event, not just any old data. For example I could tell you my car is grey, has a v6 engine, and the right-front tire is about 4psi low. All very personal data points, but totally useless for this purpose. They’re not an effective data set for training an algorithm in my behavior. In fact, it’s very possible to create a bad algorithm using this data, using an assumption (there’s that word again) that people who own grey cars consistently arrive home after 5pm. This can be entered into an algorithm, but it’s still totally incorrect. Every algorithm created is reactive to the data set available to the creator, and can not be pulled out of thin air.
Algorithms are tools. Like all tools, they carry no inherent “good” or “bad.” And, like all tools, they carry the flaws of the humans that create them. So the next time you see movies (or real life) treat algorithms as some omniscient source of clairvoyance, just remember they’re only that way in the comics.