Predictive learning problems
From my previous post How to Teach a Computer to Distinguish Cats from Dogs
Predictive learning problems constitute the majority of tasks machine learning can
be used to solve today. Applicable to a wide array of situations and data types, in
this section we introduce the two major predictive learning problems: regression and
Suppose we wanted to predict the share price of a company that is about to go public (that is, when a company first starts offering its shares of stock to the public). Following the pipeline discussed in Section 1.1.1, we first gather a training set of data consisting of a number of corporations (preferably active in the same domain) with known share prices. Next, we need to design feature(s) that are thought to be relevant to the task at
The top panels of Fig. 1.7 show a toy dataset comprising share price versus revenue
information for ten companies, as well as a linear model fit to this data. Once the model
is trained, the share price of a new company can be predicted based on its revenue, as
depicted in the bottom panels of this figure. Finally, comparing the predicted price to the
actual price for a testing set of data we can test the performance of our regression model
and apply changes as needed (e.g., choosing a different feature). This sort of task, fitting
a model to a set of training data so that predictions about a continuous-valued variable
(e.g., share price) can be made, is referred to as regression.We now discuss some further
examples of regression.
Example 1.1 The rise of student loan debt in the United States
Figure 1.8 shows the total student loan debt, that is money borrowed by students to pay for college tuition, room, and board, etc., held by citizens of the United States from 2006 to 2014, measured quarterly. Over the eight year period reflected in this plot total student debt has tripled, totaling over one trillion dollars by the end of 2014. The regression line (in magenta) fit this dataset represents the data quite well and, with its sharp positive slope, emphasizes the point that student debt is rising dangerously fast. Moreover, if this trend continues, we can use the regression line to predict that total student debt will reach a total of two trillion dollars by the year 2026.
Example 1.2 Associating genes with quantitative traits
Genome-wide association (GWA) studies (Fig. 1.9) aim at understanding the connections between tens of thousands of genetic markers, taken from across the human genome of numerous subjects, with diseases like high blood pressure/cholesterol, heart disease, diabetes, various forms of cancer, and many others [26, 76, 80]. These studies are undertaken with the hope of one day producing gene-targeted therapies, like those used to treat diseases caused by a single gene (e.g., cystic fibrosis), that can help individuals with these multifactorial diseases. Regression as a commonly employed tool in GWA studies is used to understand complex relationships between genetic markers (features) and quantitative traits like the level of cholesterol or glucose (a continuous output variable).
The machine learning task of classification is similar in principle to that of regression. The key difference between the two is that instead of predicting a continuous-valued output (e.g., share price, blood pressure, etc.), with classification what we aim at predicting takes on discrete values or classes. Classification problems arise in a host of forms. For example, object recognition, where different objects from a set of images are distinguished from one another (e.g., handwritten digits for the automatic sorting of mail or street signs for semi-autonomous and self-driving cars), is a very popular classification problem. The toy problem of distinguishing cats from dogs discussed How to Teach a Computer to Distinguish Cats from Dogs in was such a problem. Other common classification problems include speech recognition (recognizing different spoken words for voice recognition systems), determining the general sentiment of a social network like Twitter towards a particular product or service, as well as determining what kind of hand gesture someone is making from a finite set of possibilities (for use in e.g., controlling a computer without a mouse). Geometrically speaking, a common way of viewing the task of classification is one of finding a separating line (or hyperplane in higher dimensions) that separates the two
classes of data from a training set as best as possible. This is precisely the perspective on classification we took in describing the toy example in Section 1.1, where we used a line to separate (features extracted from) images of cats and dogs. New data from a testing set is then automatically classified by simply determining which side of the line/hyperplane the data lies on. Figure 1.10 illustrates the concept of a linear model or classifier used for performing classification on a 2-dimensional toy dataset.
Example 1.3 Object detection
Object detection, a common classification problem, is the task of automatically identifying a specific object in a set of images or videos. Popular object detection applications include the detection of faces in images for organizational purposes and camera focusing, pedestrians for autonomous driving vehicles,4 and faulty components for automated quality control in electronics production. The same kind of machine learning framework, which we highlight here for the case of face detection, can be utilized for solving many such detection problems.
After training a linear classifier on a set of training data consisting of facial and nonfacial images, faces are sought after in a new test image by sliding a (typically) square window over the entire image. At each location of the sliding window, the image content inside is tested to see which side of the classifier it lies on (as illustrated in Fig. 1.11). If the (feature representation of the) content lies on the “face side” of the classifier the content is classified as a face.
Next up will be Feature designs.