Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

by Alice Zheng, Amanda Casari


View All Available Formats & Editions
Members save with free shipping everyday! 
See details


Feature engineering is a crucial step in the machine-learning pipeline, yet this topic is rarely examined on its own. With this practical book, you’ll learn techniques for extracting and transforming features—the numeric representations of raw data—into formats for machine-learning models. Each chapter guides you through a single data problem, such as how to represent text or image data. Together, these examples illustrate the main principles of feature engineering.

Rather than simply teach these principles, authors Alice Zheng and Amanda Casari focus on practical application with exercises throughout the book. The closing chapter brings everything together by tackling a real-world, structured dataset with several feature-engineering techniques. Python packages including numpy, Pandas, Scikit-learn, and Matplotlib are used in code examples.

You’ll examine:

  • Feature engineering for numeric data: filtering, binning, scaling, log transforms, and power transforms
  • Natural text techniques: bag-of-words, n-grams, and phrase detection
  • Frequency-based filtering and feature scaling for eliminating uninformative features
  • Encoding techniques of categorical variables, including feature hashing and bin-counting
  • Model-based feature engineering with principal component analysis
  • The concept of model stacking, using k-means as a featurization technique
  • Image feature extraction with manual and deep-learning techniques

Product Details

ISBN-13: 9781491953242
Publisher: O'Reilly Media, Incorporated
Publication date: 04/20/2018
Pages: 218
Product dimensions: 6.90(w) x 9.10(h) x 0.60(d)

About the Author

Alice is a technical leader in the field of Machine Learning. Her experience spans algorithm and platform development and applications. Currently, she is a Senior Manager in Amazon's Ad Platform. Previous roles include Director of Data Science at GraphLab/Dato/Turi, machine learning researcher at Microsoft Research, Redmond, and postdoctoral fellow at Carnegie Mellon University. She received a Ph.D. in Electrical Engineering and Computer science, and B.A. degrees in Computer Science in Mathematics, all from U.C. Berkeley.

Table of Contents

Preface vii

1 The Machine Learning Pipeline 1

Data 1

Tasks 1

Models 2

Features 3

Model Evaluation 3

2 Fancy Tricks with Simple Numbers 5

Scalars, Vectors, and Spaces 6

Dealing with Counts 8

Binarization 8

Quantization or Binning 10

Log Transformation 15

Log Transform in Action 19

Power Transforms: Generalization of the Log Transform 23

Feature Scaling or Normalization 29

Miu-Max Scaling 30

Standardization (Variance Scaling) 31

L2 Normalization 32

Interaction Features 35

Feature Selection 38

Summary 39

Bibliography 39

3 Text Data: Flattening, Filtering, and Chunking 41

Bag-of-X: Turning Natural Text into Flat Vectors 42

Bag-of-Words 42

Bag-of-n-Grams 45

Filtering for Cleaner Features 47

Stop words 48

Frequency-Based Filtering 48

Stemming 51

Atoms of Meaning: From Words to n-Grams to Phrases 52

Parsing and Tokenization 52

Collocation Extraction for Phrase Detection 52

Summary 59

Bibliography 60

4 The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf 61

Tf-Idf: A Simple Twist on Bag-of-Words 61

Putting It to the Test 63

Creating a Classification Dataset 64

Scaling Bag-of-Words with Tf-Idf Transformation 65

Classification with Logistic Regression 66

Tuning Logistic Regression with Regularization 68

Deep Dive: What Is Flappening? 72

Summary 75

Bibliography 76

5 Categorical Variables: Counting Eggs in the Age of Robotic Chickens 77

Encoding Categorical Variables 78

One-Hot Encoding 78

Dummy Coding 79

Effect Coding 82

Pros and Cons of Categorical Variable Encodings 83

Dealing with Large Categorical Variables 83

Feature Hashing 84

Bin Counting 87

Summary 94

Bibliography 96

6 Dimensionality Reduction: Squashing the Data Pancake with PCA 99

Intuition 99

Derivation 101

Linear Projection 102

Variance and Empirical Variance 103

Principal Components: First Formulation 104

Principal Components: Matrix-Vector Formulation 104

General Solution of the Principal Components 105

Transforming Features 105

Implementing PCA 106

PCA in Action 106

Whitening and ZCA 108

Considerations and Limitations of PCA 109

Use Cases 111

Summary 112

Bibliography 113

7 Nonlinear Featurization via K-Means Model Stacking 115

k-Means Clustering 117

Clustering as Surface Tiling 119

k-Means Featurization for Classification 122

Alternative Dense Featurization 127

Pros, Cons, and Gotchas 128

Summary 130

Bibliography 131

8 Automating the Featurizer: Image Feature Extraction and Deep Learning 133

The Simplest Image Features (and Why They Don't Work) 134

Manual Feature Extraction: SIFT and HOG 135

Image Gradients 135

Gradient Orientation Histograms 139

SIFT Architecture 143

Learning Image Features with Deep Neural Networks 144

Fully Connected Layers 144

Convolutional Layers 146

Rectified Linear Unit (ReLU) Transformation 150

Response Normalization Layers 151

Pooling Layers 153

Structure of AlexNet 153

Summary 157

Bibliography 157

9 Back to the Feature: Building an Academic Paper Recommender 159

Item-Based Collaborative Filtering 159

First Pass: Data Import, Cleaning, and Feature Parsing 161

Academic Paper Recommender: Naive Approach 161

Second Pass: More Engineering and a Smarter Model 167

Academic Paper Recommender: Take 2 167

Third Pass: More Features = More Information 173

Academic Paper Recommender: Take 3 174

Summary 176

Bibliography 177

A Linear Modeling and Linear Algebra Basics 179

Index 193

Customer Reviews