------------------------------------------------------------------------
BIOWULF'S MATHEMATICAL TOOLS:
-----------------------------
MACHINE LEARNING AND THE "SVM" METHOD
-------------------------------------
(PART 1 OF 3)
-------------
screenplay by
Alan B. Scrivener
26 May 2001
first draft
------------------------------------------------------------------------
1. TITLE CARD
YELLOW LETTERS ON BLUE BACKGROUND:
BIOwulf's Mathematical Tools:
Machine Learning and the "SVM" Method
NARRATOR
BIOwulf's mathematical tools: machine learning and the
"SVM" method.
2. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Hi, I'm _____ _____, and in the next twenty minutes I want to
briefly explain some of the mathematical tools used by BIOwulf to
program machine learning, especially the technique known as
"Support Vector Machines," or "SVM" for short.
We're assuming that you know the concepts of algebra;
maybe you took calculus in college and then forgot it all,
but you can still look at an equation like this...
3. CARD
YELLOW LETTERS ON BLUE BACKGROUND:
a x2 + b x + c = 0
NARRATOR (CONTINUING)
... and you realize that x is the variable and a, b and c
are constants.
4. CLOSE-UP: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Extra credit if you recognize it as "the quadratic equation."
I'd also like to apologize in advance to those of you who are
mathematicians, quantitative scientists or engineers-- for you
this information may seem overly basic, at least at first.
We want to make sure we bring everyone along. But hang
in there and I bet you'll learn something yet!
And just so you are clear on the scope of this video, this is
not to inform you of BIOwulf's business plans, key people,
product offerings or marketing message. That information
is communicated in other ways. This video will cover the
following topics...
5. CARD:
YELLOW LETTERS ON BLUE BACKGROUND:
machine learning
state space
combinatorial explosion
correlation
orthogonal
linear
weights
recursion
recursive feature elimination
support vectors
Does it work?
NARRATOR (CONTINUING)
The concepts of:
machine learning,
a state space,
a combinatorial explosion,
what is correlation,
what does orthogonal mean,
linear functions,
weights,
recursion,
recursive feature elimination, and lastly
support vectors.
Followed by a brief discussion of perhaps the most important
question: "Does it work?"
6. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Now we're not going to explain how you can use these mathematical
tools yourself -- we have some papers you should read if you want
to get into that level. We're just going to explain in broad
outlines how they work. This quote from Richard Feynman's
brilliant little book on quantum electrodynamics...
(HOLDS UP 'QED')
-- QED is the title -- explains well what we are trying
to do:
(READING FROM THE BOOK)
"How am I going to explain to you things I don't explain to
my students until they are third year graduate students?
Let me explain it by analogy. The Maya Indians were very
interested in the rising and setting of Venus as a morning
star and as an evening star -- they were very interested
in when it would appear. After some years of observation
they noted that five cycles of Venus were very nearly equal
to eight of their years of 365 days. To make calculations,
the Maya had invented a system of bars and dots to represent
their numbers -- including zero -- and had rules by which to
calculate and predict not only the risings and settings of
Venus, but of other celestial phenomena, such as Lunar
eclipses.
"In those days, only a few Maya priests could do such elaborate
calculations. Now, suppose we were to ask one of them how to
do just one step in the predicting when Venus will next rise
as a morning star -- subtract two numbers. How would the
priest explain what subtraction is?
"He could either teach us the numbers represented by the bars
and dots and the rules for subtracting them, or he could
tell us what he was really doing: 'Suppose we want to
subtract 236 from 584. First count out 584 beans and put
them in a pot. The take out 236 beans and put them to one
side. Finally, count the beans left in the pot. That number
is the result of subtracting 236 from 584.'
"You might say, 'What tedium -- counting beans, putting
them in, taking them out -- what a job!'
"To which the priest would reply, 'That's why we have the
rules for the bars and dots. The rules are tricky, but
they are a much more efficient way of getting the answer
than by counting beans. The important thing is, it makes
no difference as far as the answer is concerned: we
can predict the appearance of Venus by counting beans (which
is slow, but easy to understand) or by using tricky rules
(which is much faster, but you must spend years in school
to learn them).'"
(LOWERS BOOK)
Got it? Okay, let's get started.
7. CARD: YELLOW LETTERS ON BLUE BACKGROUND:
machine learning
8. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR
I want to make sure you aren't confused by the word 'machine'
here. What we mean is a algorithm, or precise technique,
for solving a problem. The mathematical literature is full
of abstract machines, like the Markhov Machine and the Turing
Machine, which are just mathematical models. There usually
isn't a special physical machine built, it's just simulated
by programming a digital computer. In the case of machine
learning, we're talking about writing a computer program
that can learn rules by studying example data.
Lets really spell this out, because this is the problem
definition. I'm going to give you two made-up examples, just
because they're easy to talk about, and then a third example
that is the kind of work BIOwulf actually does with machine
learning.
First, imagine that you've discovered some unusual kind of
rat nobody knows anything about, and you want to know what
to feed it. Fortunately you have some recent dietary
experiment results for these rats. Some students had a bunch
of rats and for each one they randomly chose a particular
canned food from a grocery store and fed it only that food
daily for a month.
(HOLDS UP A CAN WITH NUTRITIONAL CONTENT LABEL)
They added up the nutritional contents
on the sides of the cans...
9. CLOSE UP: NUTRITIONAL LABEL
NARRATOR (CONTINUING)
...you know, the amount of vitamins, minerals, proteins,
fat and so on...
10. CARD: TABLE T-1
| rat # | Carb. | Fat | Cholest. | Sodium | Protein | Vitamin A | Vitamin C | Calcium | Iron | +/- |
| rat #1 | 0 g | 0.5 g | 30 mg | 200 mg | 13 g | 0% | 0% | 0% | 2% | - |
| rat #2 | 20 g | 0 g | 0 mg | 10 mg | 1 g | 0% | 20% | 0% | 2% | - |
| rat #3 | 25 g | 2 g | 0 mg | 45 mg | 0 g | 0% | 0% | 0% | 2% | - |
| rat #4 | 4 g | 0 g | 0 mg | 0 mg | 1 g | 4% | 8% | 2% | 4% | + |
% = per cent daily human requirements
NARRATOR (CONTINUING)
.. and for each rat, they made a list of all of the nutrients
it received, and a notation as to whether it lived or died.
You can see on the list here, each row is a rat. The leftmost
column gives the rat number, the columns in the middle are the
nutrients, and the last column is a plus if the rat lived and
a minus if it died.
11. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR
The challenge then is to figure out from the list -- and only
from the list -- what is the best diet to feed the rats.
And to be able to take a proposed diet and predict whether
a rat will live or die on it.
Secondly, imagine that you are trying to come up with a formula
for deciding whether or not to grant a credit card application.
A credit card company has historical information on people's credit
applications, and then whether the company made a profit on the
customer over a 10 year period. The idea is to use only the
information on the credit application to predict customer
profitability.
12. CARD: TABLE T-2
| customer # | income | debt | credit | equity | savings | +/- |
| customer #1 | $27,000 | $3,600 | $5,400 | $4,900 | $900 | + |
| customer #2 | $100,000 | $500,400 | $50,500 | $700 | $50 | - |
| customer #3 | $65,000 | $5,600 | $4,400 | $10,500 | $5,000 | + |
NARRATOR (CONTINUING)
For each customer, the customer number is listed, followed by
income, debt, credit, equity and savings. The last column on
the right contains a plus of the customer was profitable -- meaning
they ultimately paid their bills-- and a minus if they weren't
-- if they burned the company.
So once again the goal is to come up with a formula, just
using the information on the list, which will predict the
customer's profitability. And I should emphasize, we're not
talking about a programmer or mathematician coming in and
looking at this list and then producing the correct formula.
13. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
We want to produce the formula automatically from the numbers
in the list. In other words, we need a program that gets
trained by the data and then can classify new data.
The third example is not made up, but represents actual
research done by BIOwulf's mathematicians.
We were fortunate to receive 62 frozen tissue samples from
____ hospital that were biopsies of patients with suspected
colon cancer. Along with the samples was a file on each
patient covering 30 years of follow-up testing, so that it
was possible for each sample to know the "right answer" in
advance -- whether the patient got colon cancer or not.
With modern genetic analysis techniques we were able to
subject each sample to 2000 gene probes, to determine for
each of 2000 genes whether that gene was silent, active or
hyperactive in the sample. In each case this was turned
into a number from zero to one.
14. CARD: TABLE T-3
| patient # | income | gene #1 | gene #2 | gene #3 | gene #4 | *** | +/- |
| patient #1 | 0.56 | 0.45 | 0.55 | 0.21 | 0.98 | *** | - |
| patient #2 | 0.66 | 0.89 | 0.47 | 0.82 | 0.38 | *** | + |
| patient #3 | 0.01 | 0.09 | 0.66 | 0.63 | 0.54 | *** | + |
NARRATOR (CONTINUING)
You can see that for each patient we have a row of data on
which genes were expressed, and by how much, and in the rightmost
column a plus sign if the patient got colon cancer and a minus
if they didn't.
15. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
I don't have to remind you that the once again the goal is
to produce a formula -- what we call the Decision Function
-- to sort patients correctly into those who will and won't
get the cancer, using only the information on the list.
Now you might wonder, why is this so important? Why can't
we use additional information? For example, looking at the
rat diets again...
16. CARD: TABLE T-4
| rat # | Carb. | Fat | Cholest. | Sodium | Protein | Vitamin A | Vitamin C | Calcium | Iron | +/- |
| rat #1 | 0 g | 0.5 g | 30 mg | 200 mg | 13 g | 0% | 0% | 0% | 2% | - |
| rat #2 | 20 g | 0 g | 0 mg | 10 mg | 1 g | 0% | 20% | 0% | 2% | - |
| rat #3 | 25 g | 2 g | 0 mg | 45 mg | 0 g | 0% | 0% | 0% | 2% | - |
| rat #4 | 4 g | 0 g | 0 mg | 0 mg | 1 g | 4% | 8% | 2% | 4% | + |
% = per cent daily human requirements
NARRATOR (CONTINUING)
From this admittedly small sample rat #4 is the only survivor.
It might help to know that rat #4 was fed canned string beans.
It is also tempting to make guesses based on human nutritional
needs, or -- even better -- the needs of other known rats.
17. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
But there is a big problem with this. You will recall that
there are 2000 genes represented in the tissue samples.
For some of the experiments we haven't talked about yet
there can be up to 22,000 columns of data.
It was Lenin who said, "Quantity has a quality all its own."
Human intuition and judgment are very good at operating on
dozens or maybe even sometimes hundreds of factors, but
thousands of them overwhelm us.
So we put all the information we think might be relevant in
the table. Age, weight, number of known relatives with
similar cancers, all could be added to the patient data.
But then we have to let a computer program do the learning
and the deciding, because the problem is simply too big for
the human mind. One of the things we want the computer to do
is tell us which columns we can toss out -- which ones aren't
needed to create the decision function.
18. FIGURE F-1
So here in a nutshell is how machine learning is done.
You have some data like the examples above where you know
the right answer. Typically, you break these into two groups,
a training group and a testing group. You run all the
training data through a machine learning program and it
"learns" by coming up with a formula for classifying the
data, which we call a decision function. Built into that
function is the information about which columns n the table
which to throw out.
Next you run the testing data through the decision function
and see if it gets the right answer. When it all seems to
be working correctly you can real data through -- in other
words, data without the pluses and minuses, without the
right answers supplied. That's when you have to wait 30
days and see if the rat dies, or 10 years to see if the
credit card customer pays their bills, or 30 years to see
if the patient gets colon cancer.
19. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
The only other complication is that usually you really don't just
want pluses and minuses, it would be nice to know how sure
you are, so the decision function usually returns a number. If
the number is positive, that's a plus, and if it's negative,
that's a minus, and the magnitude tells us the confidence.
So there you have machine learning in a nutshell -- at least
the problem definition. The hard work comes in trying to
design a program which can turn the training examples into the
decision function, and that's what the rest of this video is
about.
.
For the last 50 years or so the most popular traditional approach
to the machine learning problem has been to use neural networks...
20. FIGURE F-2: PHOTO OF NEURONS
NARRATOR (CONTINUING)
... This involves simulating the actual neurons in the human
nervous system, and this approach has done a pretty good job.
But as we will see, the "SVM" method does better in many cases.
21. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
But before I can explain the SVM method, I need to introduce a
few more concepts.
22. CARD YELLOW LETTERS ON BLUE BACKGROUND:
state space
NARRATOR (CONTINUING)
State space.
23. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Normally when we use the word space we're talking about the
space that fills the universe we live in. We usually call
it a three-dimensional space, because we can specify a point
in it with three coordinates. This GPS...
24. CLOSE-UP: NARRATOR HOLDING UP GPS
NARRATOR (CONTINUING)
... a Global Positioning System receiver, tells me my latitude
north or south, my longitude east or west, and my elevation.
This completely defines my position in 3D space.
Sometimes people say time is the fourth dimension, so the GPS
also tells time, which locates my in the 4th dimension as well.
25. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
But mathematicians like to think about lots of other
spaces besides physical space. A state space is an imaginary
space where each point represents a possible state of a system.
They have lots of uses, but the way we're going to use them
here is to represent rows in out data lists. To start with,
let's pretend we only have two columns of ____ data for each
row. That will allow us to represent each row -- each rat
or credit card applicant or patient -- as a point in a
two-dimensional state space.
26. CARD: FIGURE F-3
NARRATOR (CONTINUING)
Here you can see three points represented, colored red,
green and yellow just to tell them apart. So this is
a graphical display of three rows of data.
27. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
If we add another column, so we have three columns of data,
now each row can be a point in 3D space...
28. ANIMATION A-1
NARRATOR (CONTINUING)
This animation shows the space turning so we can get views
from different angles. One of the problems of the 3D view
is can be hard to see what's behind. Using time as an added
dimension and turning a model can help solve this problem
if we only want to see a few points in the state space.
But if we want to plot every possible point, we can't see
a whole volume of points directly, any more that we can
see inside a stack of pingpong balls..
A more fundamental problem is that -- remember -- we have up
to 22,000 dimensions in our data. We're already hitting some
problems with 3 dimensions, and we've run out. Clearly we can't
directly visualize higher-dimensional state spaces. But there
are some tricks we can do, as we'll see later.
Still another problem is that we want to be able to display
our decision function, the output of the Support Vector Machine
method. It's just a one-dimensional function, but that does
use a dimension.
So the most common approach is like the one in this figure...
29. CARD: FIGURE F-4
NARRATOR (CONTINUING)
... two dimensions are used for the state space and
one is used for the decision function. Here the horizontal
dimensions represent the state space, like latitude and
longitude on a map, and the third dimension, the altitude,
is the decision function. The best choice is the lowest
in this representation, so we're looking for the bottom
of the bowl.
But there is a problem with seeing the whole state
space and decision function at once. Because paper
is flat and screens are flat, it is most common to display
a 2D state space in a 2D image, with color as the third
dimension.
30. FIGURE F-5
NARRATOR (CONTINUING)
That's how the government of Australia produced this image.
Ranges of altitudes are shown, starting at sea level, as blue,
light blue, green, yellow, orange, red, violet and gray for
the highest.
[PAUSE]
31. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Another technique, which we're not going to show,
is to like a topographic map, using lines of constant
value -- called isolines -- to show height or other values.
So let's see some decision function data plotted this way.
32. FIGURE F-6
NARRATOR (CONTINUING)
Here is a decision function for some 2D data, where
dark blue is for the lowest negative values and dark
red is for the highest positive values. The black
line is where the decision function is zero, and so the
data is "right on the line" between +, live rat or good
credit, and -, dead rat or bad credit.
33. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND
NARRATOR (CONTINUING)
Remember, we're trying to show some analogies here by
boiling 22,000 dimensions of data down to 2, and then
using the third for the decision function. And I want to
emphasize that we do this not because it's a great ideas,
but because we don't know of a better way to display
these fundamental concepts. If we knew how to display
the 22,00 dimensions without boiling anything down we'd
certainly prefer it.
One thing this exercise does illustrate is the
importance of reducing dimensions if you can.
Many techniques in machine learning, including SVMs,
begin by attempting to eliminate as many dimensions
-- in other words as many columns of data -- as possible
without sacrificing quality in the decision function.
In fact, recent research by BIOwulf mathematicians
has shown that coming up with which columns to throw out
is actually more critical than coming up with an otherwise
good decision function. So we'll look at tossing out
columns first.
Assume we have some sort of workable program for creating
decision functions from training data. The most
straightforward approach would be to try every possible
subset of columns and each time, produce the decision
function using the training data and then test it using
the testing data. The find the smallest subset that
gives right answers. Simple!
Unfortunately, it's also impossible in a practical sense,
because of something called the combinatorial explosion.
next page
Last update 26-May-2001 09:41 PM by ABS.