An Analytics Tookit
By Kevin Gray
“Definitions are hard. Especially when it comes to what they mean.” – Yogi Berra
Okay, I made that up. But definitions are hard. Analytics, for example, means different things to different people, and sometimes different things to the same person on different occasions. Here is my own definition, with no official sanction, though heavily inspired by Wikipedia and other knowledgeable sources:
Analytics is the discovery and communication of meaningful patterns in data. It makes use of information technology, statistics and mathematical algorithms to develop knowledge, to quantify performance or to make predictions. It uses the insights gained from this process to recommend action or to guide decision making. Analytics is best thought of as a research procedure for decision making, not simply as isolated tools or steps in a process.
There are many steps in the analytics process that can come into play in quantitative MR projects:
1. Defining Objectives
2. Data Collection
3. Data Preparation and Cleaning
4. Model Building
5. Model Evaluation
7. Scoring New Data or Simulations Using the Model
8. Communication of Results and Implications to Decision Makers
9. Implementation and Monitoring Effectiveness
In this article I’ll only be able to touch on one of these – Model Building. However, the vast bulk of an analyst’s time is spent on the other components and each is critical to success. Getting the objectives straight and how the results are used – the bookends of the process – are usually most critical. A sophisticated answer to the wrong question is not a good answer, and a sophisticated answer to the right question can be used inappropriately or simply ignored.
An Analytics Toolkit
There are countless methods for analyzing data and, for simplicity, here are some broad ways of categorizing analytics tools, with brief illustrations shown in parentheses:
- Descriptive and Exploratory Analysis (frequencies, means, bar charts)
- Models that Predict (predicting consumption frequency of new customers)
- Causal Models (identifying brand choice drivers)
- Analysis of Cross-Sectional Data (data collected at one period in time)
- Analysis of Longitudinal or Time-Series Data (data collected at several periods in time)
- Models with Quantitative Dependent Variables (monthly spend)
- Models with Categorical Dependent Variables (product user/non user)
- Time-to-Event Models (customer churn analysis)
- Methods that Group Variables (factor analysis of attribute ratings)
- Methods that Group Cases (cluster analysis of consumers)
- Text Mining (analysis of social media conversations)
- Image and Voice Analytics (facial imaging, automatic interpretation)
- Simulations and Forecasts (sales forecasts under various marketing mix scenarios)
Big Data: There has been Big Hype about Big Data in the past few years but the term is still not used consistently to mean any one thing. The size and type of data we analyze obviously has an impact on the analytics tools we use, though size by itself is less important than the hype about Big Data might suggest. Some marketing scientists (with assorted job titles) have worked with huge data bases for many years.
Most of the time, to build and evaluate a model, we actually don’t need to use all our data at once. Samples normally are sufficient. Remember that most of our “old” statistical tools were designed to generalize from very small samples to the population, and these methods are widely used in Big Data analytics as well. Many marketing scientists are also accustomed to very wide data – hundreds or thousands of variables – so this is not a new challenge to us either.
Prediction versus Interpretation: Perhaps more important is whether we need to predict or need to explain. A lot of data mining and predictive analytics is mainly concerned with prediction, for example, predicting whether or not a consumer will purchase a product. However, in marketing, understanding why consumers behave one way or another is a big plus. If we also have clues as to Why or Why Not, they can provide important insights for branding, creative and execution as well as for new product development.
Statistics versus Machine Learning: Machine Learning has also been getting a lot of press and, like Big Data, has no single universally-accepted definition. Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data. Confusingly, machine learning is also used to refer to very familiar statistical methods such as regression, cluster analysis and principle components analysis. Better ask for specifics if you’re not sure how the term is being used.
A moment ago I showed you about a dozen ways to categorize analytics tools. You can also collapse this down into two macro categories, supervised and unsupervised. Supervised methods are used when there is a dependent variable – something you are trying to explain or predict from one or more independent variables. Regression is one example of a supervised method. Unsupervised methods are used when there is no distinction between dependent and independent variables. Factor analysis is an unsupervised method.
Dependence and interdependence are also used to mean supervised and unsupervised, respectively, and label, target variable and criterion variable are sometimes used in place of dependent variable. (Aren’t you glad Yogi wasn’t a statistician?)
Popular Supervised Methods
Generalized Linear Models (GLM) are a large family of statistical techniques that are extremely versatile and can be used when dependent variable is:
– Continuous (OLS regression)
– Categorical (binary logistic and multinomial logistic regression)
– Ordinal (ordered logistic regression)
– Count (Poisson regression)
– Repeated over time (longitudinal analysis)
– Clustered (e.g., departments within divisions of a company)
GLM are widely used in key driver analysis, choice modeling and conjoint analysis. “Linear” does not mean that curvilinear relationships and interactions cannot be handled by GLM – they can and often are. Both Frequentist and Bayesian estimation can be used – yet another “macro” way of classifying analytics.
Here are some methods widely-used in data mining and predictive analytics that may be less familiar to you:
– K-nearest neighbors
– Artificial Neural Networks (“neural nets”)
– Support Vector Machines
– Boosting (e.g. AdaBoost)
– Bagging (e.g. Random Forests)
These tools are sometimes easier to use and more accurate for prediction but not always. Their downsides are that they’re more black box than GLM techniques and less informative about how the variables interrelate. This is a drawback in Key Driver Analysis and some other analytics. Fortunately, it’s often possible to use one method for prediction and another to help us understand the why and how.
Popular Unsupervised Methods
Here are some familiar faces:
– Principal Components Analysis
– Factor Analysis
– Correspondence Analysis
– Canonical Correlation
These methods are very useful for brand and user mapping. As many of you know, big brands can dominate perceptual maps and when that happens mapping may not be very informative since it merely shows us graphically what we already knew – that the market leaders score highest on most attributes. Correspondence Analysis is one way – but not the only way – to reduce brand size effect when that is a concern.
Principal Components Analysis is also widely used for pre-processing data, for example, prior to cluster analysis or regression. Canonical Correlation, a kind of multi-barreled Principal Components Analysis, and Multiple Correspondence Analysis can also be used this way.
As an aside, when marketing researchers say “Factor Analysis” we usually mean Principal Components. However, the two methods have different origins, Psychology and Statistics, respectively, and are not identical.
The methods shown next are frequently used for “discovered” (post hoc) segmentation.
– K-means Cluster Analysis
– Agglomerative Hierarchical Clustering (AHC)
– Partitioning Around Medoids (PAM)
– Self-Organizing Maps (Kohonen networks)
– Mixture Modelling and Latent Class
– Frequent Pattern Mining (e.g., Apriori, FP-Growth)
Mixture Modeling, which subsumes Latent Class, is highly versatile and also used when there are dependent variables. Frequent Pattern Mining and Association Rules are used for market basket analysis and recommender systems, to name two examples of how they are put into play.
Structural Equation Modeling (SEM)
SEM is arguably the most multitalented of the bunch. It is certainly the most versatile – uniting GLM and Factor Analysis – and is invaluable in attitudinal research. A Mixture Modeling variant (SEMM) combines SEM with Cluster Analysis, and SEM can also be used with longitudinal and clustered (i.e., multi-level, hierarchical) data.
Some SEM software is very abuser-friendly and, unfortunately, SEM may be the most misunderstood and misused statistical technique of them all. When I think of SEM I sometimes imagine a wonderful all-around athlete who’s injured a lot… Partial Least Squares (PLS) is a competitor to SEM and there are often heated debates between these rival camps but I will not join the fray here.
Time is an important dimension in some kinds of analytics. This is a very large group of methods that originated in Operations Research, Econometrics, Statistics and other disciplines. It’s used when data have been collected at many points in time, (e.g., weekly sales). In marketing research it’s used most often for sales forecasting and Marketing Mix Modeling, which is also known as Market Response Modeling. We employ these methods to try to find out which marketing activities have the biggest payoff and sometimes for forecasting sales under various marketing scenarios.
Some of the most commonly-used methods are:
– Exponential Smoothing
– VAR, VEC
– State-Space Modelling
– Dynamic Factor Models
Definitions can be tricky, as I’ve noted. Longitudinal is a generic term used to mean data collected over time. Typically it implies fewer points in time than time-series, which can have dozens or even thousands of data points. I personally would call two years of quarterly tracking figures longitudinal data data and weekly sales figures over a two-year period time-series data. Some researchers would maintain that a consumer tracking study is not truly longitudinal since we (normally) are not following the same consumers from wave to wave. We do follow the same brands, however.
This kind of analytics comes in handy when analyzing the expected time until one or more events happen and is also known as Survival, Duration or Event History Analysis. It’s quite complex and is heavily used by Medical researchers and also by Economists, Engineers and in Operations Research. Kaplan-Meier, Cox regression and parametric models are the main methods and recent variations can include a segmentation component instead of assuming all customers, patients, etc. are the same.
Some of the ways it’s used in marketing research are to find out:
– What factors cause customer churn?
– For predicting how long a customer will remain (“survive” as a) customer
– In analysis of purchase behavior and website usage
These methods are very useful but very easy to get very wrong. Many dense textbooks have been written about this one topic or include it as major section. The admonition “Don’t try this at home, folks” comes to mind.
Why Use Advanced Analytics?
This topic may strike you as complicated and many of you may be wondering “Why bother?” There are many reasons. Advanced analytics adds value to data – it can help data speak to us. For example, in consumer attitude studies we can use psychometric tools to take respondent scale usage patterns and background characteristics into account. Doing so will provide us with a deeper and more accurate understanding of consumer attitudes and behaviors and how they connect with each other.
In any kind of research totals and crosstabs only show us the surface and are just the first steps in exploratory data analysis. Moreover, running lots of crosstabs increases the risk of fluke “findings”, and clients can make bad decisions based these chance results.
Looking at variables two-at-a-time also can be very misleading. Older consumers, for instance, may seem to be heavier users of a particular category but, after taking gender, income and other characteristics into account, we may find that category usage actually declines with age!
Appearances can be deceiving!
What I’ve described above are only some of the tools a Marketing Scientist may wish to include in his or her toolbox. Standardized methodologies also have their place – I used to help design them – but for many projects customized analytics are the better route, and also faster and less expensive when done competently.
More than the tools, though, it is how they are used that’s most important – making advanced analytics work involves much more than math and programming…You need to put the patient before the cure!
IoT, Artificial Intelligence, Quantum Computing and unforeseen innovations will likely have a profound impact on our lives in the future. This means they will also impact MR and analytics. Some kinds of analytics will be largely automated in the not-so-distant future, though human judgment will remain essential.
Further ahead, Marketing and MR may be extensively automated…but in that sort of world will they still be necessary? I wonder.
A version of this article was presented at the Festival of the NewMR February 2nd, 2016.