“Analytics is Easy”
By Kevin Gray
Erroneous thinking about analytics continues to hang on in the marketing research community. Often it is tacit, but at times articulated candidly. This is worrisome given that marketing research is a research industry and no longer a young industry. Some, for example, see analytics as little more than cross tabs and charting that can be done by anyone who has point-and-click software installed on their PC. This is a bit like saying that if you can talk, you can do qualitative research. Others think it’s “just programming.” There are other misperceptions as well and one consequence of all this confusion is shoddy analytics which, in turn, raises doubts about the value of analytics.1 In this short article, I will demonstrate that analytics, in fact, is not easy and why this mistaken belief is potentially costly for marketing any researcher to hold.
Cross tabulations and graphics are an indispensable part of analytics but only part of it, and marketing researchers have long had a vast assortment of sophisticated tools at their disposal. Even basic analyses should not be undertaken in a slapdash fashion, however. Churning out stacks of cross tabs is not unheard of in our business but is very risky because even with big data there always will be fluke results. Instead of placing our bets on shotgun empiricism, as researchers, we should plan cross tabulations and other analyses when designing the research, and interpret the patterns of our findings in the context of other pertinent information, not simply highlight isolated results. The Improbability Principle: Why Coincidences, Miracles, and Rare Events Happen Every Day by David Hand, a past president of the Royal Statistical Society, is a great read and I can recommend it to marketing researchers.
Another example of substandard analytics can be found in mapping. Nowadays mapping, in practice, frequently seems to mean junior research execs or even clerical personnel mass producing correspondence analysis maps, usually with the software’s default settings. The maps are nearly always brand maps and user maps and other kinds of mapping are underutilized, in my opinion. Moreover, though correspondence analysis is a wonderful technique it is just one of many appropriate for mapping, and biplots, MDPREF, MDS, factor analysis, discriminant analysis, canonical mapping or other methods may be better suited to the problem at hand. What’s more, I still see maps being interpreted incorrectly.
Somewhat more elaborate but, nonetheless, debatable practice is psychographic segmentation with what has been called the tandem approach. Though it began to be seriously questioned many years ago this method is still quite popular and, put simply, consists of K-means or hierarchical cluster analysis of factor scores derived from attitudinal ratings. Tandem refers to the dual use of factor and cluster analysis in the segmentation. The psychographic statements respondents rate are often improvised, making matters worse. Poor questionnaire design plagues many kinds of marketing research and items that make little sense to respondents or mean different things to different people will sink a segmentation whatever statistical methods are used. In the tandem approach, segments obtained from the cluster analysis are cross tabulated with demographics and other data in the hope meaningful and actionable segments will materialize. They often do not and, accordingly, I sometimes call this the “Factor, Cluster & Pray” method.
Regression is perhaps the most widely-used statistical method of them all but is also deceptively complex. Many books have been written which detail how regression analysis can be badly abused and Frank Harrell’s Regression Modeling Strategies is the most comprehensive and hard-hitting I’ve read. Marketing researchers seem to make the sorts of mistakes people working in other disciplines do, though perhaps more often. Some examples are using highly correlated predictors, neglecting residual analyses, ignoring correlations across time (e.g., in weekly sales data) or space (e.g., regions of a country), categorizing the dependent variable and confusing correlation with causation.
Another concern I have, in fact, pertains to causation. Whenever we say things like “This sort of consumer does this because of that,” we are making a statement about causation whether or not we are conscious of it. Causal analysis is a subject even bigger than regression and one bible is Experimental and Quasi-Experimental Designs for Generalized Causal Inference (Shadish et al.). Trying to establish causation can be likened to walking though a minefield, to paraphrase a comment once made to me by a Marketing professor with a PhD in Statistics. We need to tread carefully!
The next time you’re in a very brave mood, ask your senior finance director if they are no better at their job than they were 10 years ago. Common sense should tell us that experience counts, particularly in highly technical professions. Formal education only lays the groundwork for statisticians and even veterans are constantly learning new things and new tricks. The list of viable analytic options continues to grow (for examples see Analytics Revolution) and we’ve reached the point where we now have so many tools that skill levels are becoming diluted. Over-specialization, on the other hand, is also something we need to be wary of and some less-experienced analysts lean on a pet method for nearly any situation…if all you have is a hammer, everything looks like a nail…
Now, here comes the bad news: The math stuff can actually be the easiest part of analytics! Every so often I’m asked questions such as “If I give you 10 million customer records, what technique would you use?” To characterize questions like these as naive would be too diplomatic, as they reveal little grasp of the fundamentals of research. The Cross Industry Standard Process for Data Mining (CRISP-DM), illustrated in the diagram below, will help make clear what I mean by this.
Here are very succinct definitions of each CRISP-DM component, courtesy of Wikipedia.2
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.
Bravo! Properly understood, analytics is not just cross tabs, visualization or programming, or even fancy statistical techniques. It is a process intended to enhance decision-making. The first step listed above, Business Understanding, is often the most demanding and, along with Data Understanding and Data Preparation, can absorb the bulk of a project’s time and energy. CRISP-DM was not developed specifically for marketing research but is applicable to our business and drives home the point that analytics is a multifaceted, iterative process which involves more than narrow technical skills…or the ability to use a mouse. Serious errors can occur anywhere, anytime and even simple mistakes can have important consequences.
So, the next time someone even suggests that analytics is easy, I’d advise you to be on guard. It just ain’t so.
1 Some other reactions I have come across are that analytics is “too complicated,” or that isn’t needed or that it doesn’t work.
2 For a brief summary of CRISP-DM see Wikipedia: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining. For a more in-depth look, see Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (Linoff and Berry), a popular, non-technical introduction to Data Mining and Predictive Analytics.