Lessons from Automating Social Media Monitoring
By Jeffrey Henning
Social media monitoring can be fragile.
I’ve been recapping the top 5 research links of the week since I coined the #MRX hashtag back in July of 2010. Originally I did it by hand, and then I had one of my sons automate the process for me in June of 2011. He had to make some minor tweaks to it whenever there were changes to the Twitter API (its Application Programming Interface, which is the way other programs are instructed to interact with Twitter as opposed to fetching and parsing its web pages).
As people use social media differently, as programs interact with it differently, as APIs change, social media monitoring programs can behave strangely. For instance:
- At some point along the way, we stopped giving stories from The New York Times the clout they deserved, as their paywall interfered with how we looked up the URL. (Sadly, this is still a problem.)
- When we wrote the system, Twitter didn’t have embedded images. People used third-party tools for that. Now that Twitter supports embedded images (e.g., https://pbs.twimg.com/media/B1SpTxTCYAAHKin.png:large from this tweet), sometimes the URL of an image would confuse our system.
- When the system was originally written, emoji was not as prevalent. The initial implementation could not handle emoji and a variety of other obscure characters, so it ignored tweets with these symbols. Once emoji was added to the default keyboards of both iPhone and Android in 2011 and 2013 respectively, these symbols got used more often and our system would ignore more and more tweets.
- The worst news at some point is that our system started producing bad data, and I didn’t notice, as it occasionally highlighted a top story that wasn’t a top story at all. This wasn’t because of an explicit change to the Twitter API, but a change to how data was returned. Sometimes, for no reason that we can tell, the URL would be returned as “t.co/…” and would inflate the count of another story. (Twitter uses its own URL shortener, t.co, even if you’ve already used a third-party URL shortener to get around the 140-character limit.)
Because of the shift to tweets with rare Unicode characters such as emoji, my son ended up rewriting our system from scratch. And the system now outputs additional diagnostics so I can verify its accuracy.
The algorithm seems to be working well now. Now, algorithm is just a formal word for automating a sometimes arbitrary process. We’ve implemented certain heuristics – another formal word, for rules of thumb! Really, a program is just an embodiment of judgment calls. Some of ours:
- Tracking influence – We give everyone who tweets an influence score, based on some factors. There’s been a lot of research into measuring influence, and we have our own method for estimating it. If @lennyism retweets a link, it counts for more than if a new Twitter user retweets that same link.
- Handling spam – If you retweet the same link three times to #MRX over a week, we’ve always counted it only once. We’ve implemented some new rules to better handle bots and other spammy activity that we’ve seen, including closely-related accounts retweeting a link. Beating spam is a constant battle.
- Determining which link is canonical – We resolve shortened URLs so that we are counting the underlying link, not the different representations of it. For instance, we treat it/1D5E09h, bit.ly/1vpik1H, and lnkd.in/d8EVzD3 all as http://www.greenbookblog.org/2014/12/30/embracing-change-in-mr-a-year-end-perspective/. And we’ve added a few special rules to account for some different versions of hyperlinks to the same pages. Rediscovering how hard it is simply to track URLs makes me realize how error prone tracking brand names must be!
Fortunately, in our case, our program produces a report for a human to read and analyze, rather than simply spits out its results to Twitter. So a human can catch the things that the automation didn’t. For instance, I manually exclude references not related to market research (I am so not looking forward to the February release of the Bollywood film Mr. X!). I skip over any expired links – invitations to webinars now passed, for instance. And I curse any spammers who get by the system.
The lesson? Social media monitoring automation requires vigilance and updates, even for a hobbyist project like tracking the top 5 research stories of the week. Implementing custom brand trackers requires even more diligence – you should schedule regular audits to double-check the results. The myth of social media is that the data is free and therefore the analysis is as well. The data is free, but the analysis can be time-consuming and tricky. If social media is fragile, your monitoring must be robust.