|Publisher:||University of California Press|
|Edition description:||First Edition|
|Product dimensions:||7.00(w) x 9.80(h) x 0.70(d)|
About the Author
Read an Excerpt
Data Mining for the Social Sciences
By Paul Attewell, David B. Monaghan, Darren Kwong
UNIVERSITY OF CALIFORNIA PRESSCopyright © 2015 Paul Attewell and David B. Monaghan
All rights reserved.
WHAT IS DATA MINING?
Data mining (DM) is the name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data. Using those patterns, DM can create predictive models, or classify things, or identify different groups or clusters of cases within data. Data mining and its close cousins machine learning and predictive analytics are already widely used in business and are starting to spread into social science and other areas of research.
A partial list of current data mining methods includes:
recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests
multi-layer neural network models and "deep learning" methods
naive Bayes classifiers and Bayesian networks
clustering methods, including hierarchical, k-means, nearest neighbor, linear and nonlinear manifold clustering
support vector machines
"soft modeling" or partial least squares latent variable modeling
DM is a young area of scholarship, but it is growing very rapidly As we speak, new methods are appearing, old ones are being modified, and strategies and skills in using these methods are accumulating. The potential and importance of DM are becoming widely recognized. In just the last two years the National Science Foundation has poured millions of dollars into new research initiatives in this area.
DM methods can be applied to quite different domains, for example to visual data, in reading handwriting or recognizing faces within digital pictures. DM is also being used to analyze texts—for example to classify the content of scientific papers or other documents—hence the term text mining. In addition, DM analytics can be applied to digitized sound, to recognize words in phone conversations, for example. In this book, however, we focus on the most common domain: the use of DM methods to analyze quantitative or numerical data.
Miners look for veins of ore and extract these valuable parts from the surrounding rock. By analogy, data mining looks for patterns or structure in data. But what does it mean to say that we look for structure in data? Think of a computer screen that displays thousands of pixels, points of light or dark. Those points are raw data. But if you scan those pixels by eye and recognize in them the shapes of letters and words, then you are finding structures in the data—or, to use another metaphor, you are turning data into information.
The equivalent to the computer screen for numerical data is a spreadsheet or matrix, where each column represents a single variable and each row contains data for a different case or person. Each cell within the spreadsheet contains a specific value for one person on one particular variable.
How do you recognize patterns or regularities or structures in this kind of raw numerical data? Statistics provides various ways of expressing the relations between the columns and rows of data in a spreadsheet. The most familiar one is a correlation matrix. Instead of repeating the raw data, with its thousands of observations and dozens of variables, a correlation matrix represents just the relations between each variable and each other variable. It is a summary, a simplification of the raw data.
Few of us can read a correlation matrix easily, or recognize a meaningful pattern in it, so we typically go through a second step in looking for structures in numerical data. We create a model that summarizes the relations in the correlation matrix. An ordinary least squares (OLS) regression model is one common example. It translates a correlation matrix into a much smaller regression equation that we can more easily understand and interpret.
A statistical model is more than just a summary derived from raw data, though. It is also a tool for prediction, and it is this second property that makes DM especially useful. Banks accumulate huge databases about customers, including records of who defaulted on loans. If bank analysts can turn those data into a model to accurately predict who will default on a loan, then they can reject the riskiest new loan applications and avoid losses. If Amazon.com can accurately assess your tastes in books, based on your previous purchases and your similarity to other customers, and then tempt you with a well-chosen book recommendation, then the company will make more profit. If a physician can obtain an NMR scan of cell tissue and predict from that data whether a tumor is likely to be malignant or benign, then the doctor has a powerful tool at her disposal.
Our world is awash with digital data. By finding patterns in data, especially patterns that can accurately predict important outcomes, DM is providing a very valuable service. Accurate prediction can inform a decision and lead to an action. If that cell tissue is most likely malignant, then one should schedule surgery. If that person's predicted risk of default is high, then don't approve the loan.
But why do we need DM for this? Wouldn't traditional statistical methods fulfill the same function just as well?
Conventional statistical methods do provide predictive models, but they have significant weaknesses. DM methods offer an alternative to conventional methods, in some cases a superior alternative that is less subject to those problems. We will later enumerate several advantages of DM, but for now we point out just the most obvious one. DM is especially well suited to analyzing very large datasets with many variables and/or many cases—what's known as Big Data.
Conventional statistical methods sometimes break down when applied to very large datasets, either because they cannot handle the computational aspects, or because they face more fundamental barriers to estimation. An example of the latter is when a dataset contains more variables than observations, a combination that conventional regression models cannot handle, but that several DM methods can.
DM not only overcomes certain limitations of conventional statistical methods, it also helps transcend some human limitations. A researcher faced with a dataset containing hundreds of variables and many thousands of cases is likely to overlook important features of the data because of limited time and attention. It is relatively easy, for example, to inspect a half-dozen variables to decide whether to transform any of them, to make them more closely resemble a bell curve or normal distribution. However, a human analyst will quickly become overwhelmed trying to decide the same thing for hundreds of variables. Similarly, a researcher may wish to examine statistical interactions between predictors in a dataset, but what happens when that person has to consider interactions between dozens of predictors? The number of potential combinations grows so large that any human analyst would be stymied.
DM techniques help in this situation because they partly "automate" data analysis by identifying the most important predictors among a large number of independent variables, or by transforming variables automatically into more useful distributions, or by detecting complex interactions among variables, or by discovering what forms of heterogeneity are prevalent in a dataset. The human researcher still makes critical decisions, but DM methods leverage the power of computers to compare numerous alternatives and identify patterns that human analysts might easily overlook (Larose 2005; McKinsey Global Institute 2011; Nisbet, Elder, and Miner 2009).
It follows that DM is very computationally intensive. It uses computer power to scour data for patterns, to search for "hidden" interactions among variables, and to try out alternative methods or combine models to maximize its accuracy in prediction.
THE GOALS OF THIS BOOK
There are many books on DM, so what's special about this one? One can think of the literature on DM as a layer cake. The bottom layer deals with the mathematical concepts and theorems that underlie DM. These are fundamental but are difficult to understand. This book doesn't try to operate at that technically demanding level, but interested readers can get a taste by looking at the online version of the classic text by Hastie, Tibshirani, and Friedman (2009): The Elements of Statistical Learning: Data Mining, Inference, and Prediction (there is a free version at www.stanford.edu/~hastie/local.ftp/Springer/OLD// ESLII_print4.pdf).
Moving upward, the next layer of the DM literature covers computer algorithms that apply those mathematical concepts to data. Critical issues here are how to minimize the time needed to perform various mathematical and matrix operations and choosing efficient computational strategies that can analyze data one case at a time or make the minimum number of passes through a large dataset. Fast, efficient computer strategies are especially critical when analyzing big data containing hundreds of thousands of observations. An inefficient computer program might run for days to accomplish a single analysis. This book doesn't go into the algorithmic level either. Interested readers can consult the books by Tan, Steinbach, and Kumar (2005) and Witten, Eibe, and Hall (2011) listed in the bibliography.
At the top layer of the DM literature one finds books about the use of DM. Several are exhortations to managers and employees to revolutionize their firms by embracing DM or "business analytics" as a business strategy. That's not our goal, however. What this book provides is a brief, nontechnical introduction to DM for people who are interested in using it to analyze quantitative data but who don't yet know much about these methods. Our primary goal is to explain what DM does and how it differs from more familiar or established kinds of statistical analysis and modeling, and to provide a sense of DM's strengths and weaknesses. To communicate those ideas, this book begins by discussing DM in general, especially its distinctive perspective on data analysis. Later, it introduces the main methods or tools within DM.
This book mostly avoids math. It does presume a basic knowledge of conventional statistics; at a minimum you should know a little about multiple regression and logistic regression. The second half of this book provides examples of data analyses for each application or DM tool, walks the reader through the interpretation of the software output, and discusses what each example has taught us. It covers several "tricks" that data miners use in analyses, and it highlights some pitfalls to avoid, or suggests ways to get round them.
After reading this book you should understand in general terms what DM is and what a data analyst might use it for. You should be able to pick out appropriate DM tools for particular tasks and be able to interpret their output. After that, using DM tools is mainly a matter of practice, and of keeping up with a field that is advancing at an extraordinarily rapid pace.
SOFTWARE AND HARDWARE FOR DATA MINING
Large corporations use custom-written computer programs for their DM applications, and they run them on fast mainframes or powerful computer clusters. Those are probably the best computer environments for analyzing big data, but they are out of reach for most of us. Fortunately, there are several products that combine multiple DM tools into a single package or software suite that runs under Windows on a personal computer.
JMP Pro (pronounced "jump pro") was developed by the company that sells the SAS suite of statistical software. You can download a free trial version, and the company provides online tutorials and other learning tools. JMP is relatively easy to use, employing a point-and-click approach. However, it lacks some of the more recent DM analytical tools.
SPSS (Statistical Package for the Social Sciences), now owned by IBM, is one of the oldest and most established software products for analyzing data using conventional statistical methods such as regression, cross-tabulation, t-tests, factor analysis, and so on. In its more recent versions (20 and above), the "professional" version of SPSS includes several data mining methods, including neural network models, automated linear models, and clustering. These are easy to use because they are point-and-click programs and their inputs and outputs are well designed. This may be the best place for a beginner to get a taste of some DM methods.
A more advanced data mining package called IBM SPSS Modeler includes a much larger choice of DM methods. This program is more complicated to learn than regular SPSS: one has to arrange various icons into a process and set various options or parameters. However, Modeler provides a full range of DM tools.
There are other commercial software products for PCs that include some DM tools within their general statistics software. Among these, MathWorks MATLAB offers data mining within two specialized "toolboxes": Statistics and Neural Networks. StatSoft's Statistica package includes an array of DM programs; and XLMiner is a commercial addon for data mining that works with Microsoft's Excel spreadsheet program.
Beyond the commercial software, there are several free data mining packages for PCs.
RapidMiner is an extensive suite of DM programs developed in Germany. It has recently incorporated programs from the Weka DM suite (see below), and also many DM programs written in the R language. As a result, RapidMiner offers by far the largest variety of DM programs currently available in any single software product. It is also free (see http://rapid-i.com for information). The software takes considerable time to master; it uses a flowchart approach that involves dragging icons onto a workspace and linking them into a program or sequence. This idea is familiar to computer programmers, but may take others some time to learn. However, the user does not write commands or code. There is good online documentation, and North (2012) has written an introductory textbook for RapidMiner that has a free online version (http://dl.dropbox.com/u/31779972/ DataMiningForTheMasses.pdf).
Weka is one of the oldest DM suites, and it is also free (see www.cs.waikato.ac.nz/ml/ weka/). Developed in New Zealand, it is exceptionally well documented, including an encyclopedic textbook (Witten, Eibe, and Hall 2011) and online tutorials (www.cs.ccsu. edu/~markov/weka-tutorial.pdf).
Rattle (http://rattle.togaware.com) is a free graphical user interface for a large collection of DM tools available in the R language. (R itself is also a free download.) Rattle is well documented, including a textbook (G. Williams 2011).
TraMineR (http://mephisto.unige.ch/traminer/) is a free suite of specialized programs developed in Switzerland to analyze sequences and longitudinal data. This is not an alternative but a supplement to more general DM software.
No one knows which of these competing software packages will dominate in the years to come, so it is difficult to recommend one that you should invest your time and energy in learning. If ease of use matters most to you, then you might start with SPSS Professional or JMP. On the other hand, ifyou want access to the full palette of DM techniques, then Modeler or RapidMiner might be a good choice.
A CAUTIONARY NOTE ABOUT HARDWARE
Makers of DM software for PCs tend to understate the hardware configuration needed to use their products effectively. DM software pushes Windows-based PCs to their limits. When using ordinary desktop computers to run DM software, one discovers that some analyses run very slowly, and a few "crash" or "hang up," even when datasets are not large. To avoid those frustrations, it is best to use as powerful a PC as possible, one with at least 8 GB of RAM (more than that is even better) and a fast multicore processor (for example an Intel i7). Even then, you may need to go for a coffee break while some programs are left running.
Big data requires large hard drives, but 1- or 2-terabyte drives have become inexpensive options when buying a new PC. For most datasets, smaller drives will suffice. Reading data does not seem to be a bottleneck when data mining on a PC; memory and CPU processing speed appear to be the limiting factors.
Excerpted from Data Mining for the Social Sciences by Paul Attewell, David B. Monaghan, Darren Kwong. Copyright © 2015 Paul Attewell and David B. Monaghan. Excerpted by permission of UNIVERSITY OF CALIFORNIA PRESS.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.