------------------------------------------------------------------------

                      BIOWULF'S MATHEMATICAL TOOLS:
                      -----------------------------

                  MACHINE LEARNING AND THE "SVM" METHOD
                  -------------------------------------

                              (PART 1 OF 3)
                              -------------

                              screenplay by

                            Alan B. Scrivener

                                                             26 May 2001
                                                             first draft

------------------------------------------------------------------------

1. TITLE CARD

   YELLOW LETTERS ON BLUE BACKGROUND:

   BIOwulf's Mathematical Tools:
   Machine Learning and the "SVM" Method


         NARRATOR
      BIOwulf's mathematical tools: machine learning and the
      "SVM" method.

2. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Hi, I'm  _____ _____, and in the next twenty minutes I want to
      briefly explain some of the mathematical tools used by BIOwulf to
      program machine learning, especially the technique known as
      "Support Vector Machines," or "SVM" for short.

      We're assuming that you know the concepts of algebra;
      maybe you took calculus in college and then forgot it all,
      but you can still look at an equation like this...

3. CARD

   YELLOW LETTERS ON BLUE BACKGROUND:

   a x2 + b x + c = 0

         NARRATOR (CONTINUING)
      ... and you realize that x is the variable and a, b and c
      are constants.

4. CLOSE-UP: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Extra credit if you recognize it as "the quadratic equation."
      I'd also like to apologize in advance to those of you who are
      mathematicians, quantitative scientists or engineers-- for you
      this information may seem overly basic, at least at first.
      We want to make sure we bring everyone along.  But hang
      in there and I bet you'll learn something yet!

      And just so you are clear on the scope of this video, this is
      not to inform you of BIOwulf's business plans, key people,
      product offerings or marketing message.  That information
      is communicated in other ways.  This video will cover the
      following topics...

5. CARD:

   YELLOW LETTERS ON BLUE BACKGROUND:

   machine learning
   state space
   combinatorial explosion
   correlation
   orthogonal
   linear
   weights
   recursion
   recursive feature elimination
   support vectors
   Does it work?

         NARRATOR (CONTINUING)
      The concepts of:
      machine learning,
      a state space,
      a combinatorial explosion,
      what is correlation,
      what does orthogonal mean,
      linear functions,
      weights,
      recursion,
      recursive feature elimination, and lastly
      support vectors.
      Followed by a brief discussion of perhaps the most important
      question: "Does it work?"

6. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Now we're not going to explain how you can use these mathematical
      tools yourself -- we have some papers you should read if you want
      to get into that level.  We're just going to explain in broad
      outlines how they work.  This quote from Richard Feynman's
      brilliant little book on quantum electrodynamics...
         (HOLDS UP 'QED')
      -- QED is the title -- explains well what we are trying
      to do:

         (READING FROM THE BOOK)
      "How am I going to explain to you things I don't explain to
       my students until they are third year graduate students?
       Let me explain it by analogy.  The Maya Indians were very
       interested in the rising and setting of Venus as a morning
       star and as an evening star -- they were very interested
       in when it would appear.  After some years of observation
       they noted that five cycles of Venus were very nearly equal
       to eight of their years of 365 days.  To make calculations,
       the Maya had invented a system of bars and dots to represent
       their numbers -- including zero -- and had rules by which to
       calculate and predict not only the risings and settings of
       Venus, but of other celestial phenomena, such as Lunar
       eclipses.

      "In those days, only a few Maya priests could do such elaborate
       calculations.  Now, suppose we were to ask one of them how to
       do just one step in the predicting when Venus will next rise
       as a morning star -- subtract two numbers.  How would the
       priest explain what subtraction is?

      "He could either teach us the numbers represented by the bars
       and dots and the rules for subtracting them, or he could
       tell us what he was really doing: 'Suppose we want to
       subtract 236 from 584.  First count out 584 beans and put
       them in a pot.  The take out 236 beans and put them to one
       side.  Finally, count the beans left in the pot.  That number
       is the result of subtracting 236 from 584.'

      "You might say, 'What tedium -- counting beans, putting
       them in, taking them out -- what a job!'

      "To which the priest would reply, 'That's why we have the
       rules for the bars and dots.  The rules are tricky, but
       they are a much more efficient way of getting the answer
       than by counting beans.  The important thing is, it makes
       no difference as far as the answer is concerned: we
       can predict the appearance of Venus by counting beans (which
       is slow, but easy to understand) or by using tricky rules
       (which is much faster, but you must spend years in school
       to learn them).'"

         (LOWERS BOOK)
      Got it?  Okay, let's get started.

7. CARD: YELLOW LETTERS ON BLUE BACKGROUND:

   machine learning

8. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR
      I want to make sure you aren't confused by the word 'machine'
      here.  What we mean is a algorithm, or precise technique,
      for solving a problem.  The mathematical literature is full
      of abstract machines, like the Markhov Machine and the Turing
      Machine, which are just mathematical models.  There usually
      isn't a special physical machine built, it's just simulated
      by programming a digital computer.  In the case of machine
      learning, we're talking about writing a computer program
      that can learn rules by studying example data.

      Lets really spell this out, because this is the problem
      definition.  I'm going to give you two made-up examples, just
      because they're easy to talk about, and then a third example
      that is the kind of work BIOwulf actually does with machine
      learning.

      First, imagine that you've discovered some unusual kind of
      rat nobody knows anything about, and you want to know what
      to feed it.  Fortunately you have some recent dietary
      experiment results for these rats. Some students had a bunch
      of rats and for each one they randomly chose a particular
      canned food from a grocery store and fed it only that food
      daily for a month.
         (HOLDS UP A CAN WITH NUTRITIONAL CONTENT LABEL)
      They added up the nutritional contents
      on the sides of the cans...

9. CLOSE UP: NUTRITIONAL LABEL

         NARRATOR (CONTINUING)
      ...you know, the amount of vitamins, minerals, proteins,
      fat and so on...

10. CARD: TABLE T-1

rat # Carb. Fat Cholest. Sodium Protein Vitamin A Vitamin C Calcium Iron +/-
rat #1 0 g 0.5 g 30 mg 200 mg 13 g 0% 0% 0% 2% -
rat #2 20 g 0 g 0 mg 10 mg 1 g 0% 20% 0% 2% -
rat #3 25 g 2 g 0 mg 45 mg 0 g 0% 0% 0% 2% -
rat #4 4 g 0 g 0 mg 0 mg 1 g 4% 8% 2% 4% +
% = per cent daily human requirements


         NARRATOR (CONTINUING)
      .. and for each rat, they made a list of all of the nutrients
      it received, and a notation as to whether it lived or died.
      You can see on the list here, each row is a rat.  The leftmost
      column gives the rat number, the columns in the middle are the
      nutrients, and the last column is a plus if the rat lived and
      a minus if it died.

11. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR
      The challenge then is to figure out from the list -- and only
      from the list -- what is the best diet to feed the rats.
      And to be able to take a proposed diet and predict whether
      a rat will live or die on it.

      Secondly, imagine that you are trying to come up with a formula
      for deciding whether or not to grant a credit card application.
      A credit card company has historical information on people's credit
      applications, and then whether the company made a profit on the
      customer over a 10 year period.  The idea is to use only the
      information on the credit application to predict customer
      profitability.

12. CARD: TABLE T-2

customer # income debt credit equity savings +/-
customer #1 $27,000 $3,600 $5,400 $4,900 $900 +
customer #2 $100,000 $500,400 $50,500 $700 $50 -
customer #3 $65,000 $5,600 $4,400 $10,500 $5,000 +


         NARRATOR (CONTINUING)
      For each customer, the customer number is listed, followed by
      income, debt, credit, equity and savings.  The last column on
      the right contains a plus of the customer was profitable -- meaning
      they ultimately paid their bills--  and a minus if they weren't
      -- if they burned the company.

      So once again the goal is to come up with a formula, just
      using the information on the list, which will predict the
      customer's profitability. And I should emphasize, we're not
      talking about a programmer or mathematician coming in and
      looking at this list and then producing the correct formula.

13. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      We want to produce the formula automatically from the numbers
      in the list.  In other words, we need a program that gets
      trained by the data and then can classify new data.

      The third example is not made up, but represents actual
      research done by BIOwulf's mathematicians.

      We were fortunate to receive 62 frozen tissue samples from
      ____ hospital that were biopsies of patients with suspected
      colon cancer.  Along with the samples was a file on each
      patient covering 30 years of follow-up testing, so that it
      was possible for each sample to know the "right answer" in
      advance -- whether the patient got colon cancer or not.

      With modern genetic analysis techniques we were able to
      subject each sample to 2000 gene probes, to determine for
      each of 2000 genes whether that gene was silent, active or
      hyperactive in the sample.  In each case this was turned
      into a number from zero to one.

14. CARD: TABLE T-3

patient # income gene #1 gene #2 gene #3 gene #4 *** +/-
patient #1 0.56 0.45 0.55 0.21 0.98 *** -
patient #2 0.66 0.89 0.47 0.82 0.38 *** +
patient #3 0.01 0.09 0.66 0.63 0.54 *** +


         NARRATOR (CONTINUING)
      You can see that for each patient we have a row of data on
      which genes were expressed, and by how much, and in the rightmost
      column a plus sign if the patient got colon cancer and a minus
      if they didn't.

15. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      I don't have to remind you that the once again the goal is
      to produce a formula -- what we call the Decision Function
      -- to sort patients correctly into  those who will and won't
      get the cancer, using only the information on the list.

      Now you might wonder, why is this so important?  Why can't
      we use additional information?  For example, looking at the
      rat diets again...

16. CARD: TABLE T-4

rat # Carb. Fat Cholest. Sodium Protein Vitamin A Vitamin C Calcium Iron +/-
rat #1 0 g 0.5 g 30 mg 200 mg 13 g 0% 0% 0% 2% -
rat #2 20 g 0 g 0 mg 10 mg 1 g 0% 20% 0% 2% -
rat #3 25 g 2 g 0 mg 45 mg 0 g 0% 0% 0% 2% -
rat #4 4 g 0 g 0 mg 0 mg 1 g 4% 8% 2% 4% +
% = per cent daily human requirements


         NARRATOR (CONTINUING)
      From this admittedly small sample rat #4 is the only survivor.
      It might help to know that rat #4 was fed canned string beans.
      It is also tempting to make guesses based on human nutritional
      needs, or -- even better -- the needs of other known rats.

17. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      But there is a big problem with this.  You will recall that
      there are 2000 genes represented in the tissue samples.
      For some of the experiments we haven't talked about yet
      there can be up to 22,000 columns of data.

      It was Lenin who said, "Quantity has a quality all its own."
      Human intuition and judgment are very good at operating on
      dozens or maybe even sometimes hundreds of factors, but
      thousands of them overwhelm us.

      So we put all the information we think might be relevant in
      the table.  Age, weight, number of known relatives with
      similar cancers, all could be added to the patient data.
      But then we have to let a computer program do the learning
      and the deciding, because the problem is simply too big for
      the human mind.  One of the things we want the computer to do
      is tell us which columns we can toss out -- which ones aren't
      needed to create the decision function.

18. FIGURE F-1

_


      So here in a nutshell is how machine learning is done.
      You have some data like the examples above where you know
      the right answer.  Typically, you break these into two groups,
      a training group and a testing group.  You run all the
      training data through a machine learning program and it
      "learns" by coming up with a formula for classifying the
      data, which we call a decision function. Built into that
      function is the information about which columns n the table
      which to throw out.

      Next you run the testing data through the decision function
      and see if it gets the right answer.  When it all seems to
      be working correctly you can real data through -- in other
      words, data without the pluses and minuses, without the
      right answers supplied.  That's when you have to wait 30
      days and see if the rat dies, or 10 years to see if the
      credit card customer pays their bills, or 30 years to see
      if the patient gets colon cancer.

19. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      The only other complication is that usually you really don't just
      want pluses and minuses, it would be nice to know how sure
      you are, so the decision function usually returns a number.  If
      the number is positive, that's a plus, and if it's negative,
      that's a minus, and the magnitude tells us the confidence.

      So there you have machine learning in a nutshell -- at least
      the problem definition.  The hard work comes in trying to
      design a program which can turn the training examples into the
      decision function, and that's what the rest of this video is
      about.
.
      For the last 50 years or so the most popular traditional approach
      to the machine learning problem has been to use neural networks...

20. FIGURE F-2: PHOTO OF NEURONS

_


         NARRATOR (CONTINUING)
      ... This involves simulating the actual neurons in the human
      nervous system, and this approach has done a pretty good job.
      But as we will see, the "SVM" method does better in many cases.

21. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      But before I can explain the SVM method, I need to introduce a
      few more concepts.

22. CARD YELLOW LETTERS ON BLUE BACKGROUND:

   state space

         NARRATOR (CONTINUING)
      State space.

23. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Normally when we use the word space we're talking about the
      space that fills the universe we live in.  We usually call
      it a three-dimensional space, because we can specify a point
      in it with three coordinates.  This GPS...

24. CLOSE-UP: NARRATOR HOLDING UP GPS

         NARRATOR (CONTINUING)
      ... a Global Positioning System receiver, tells me my latitude
      north or south, my longitude east or west, and my elevation.
      This completely defines my position in 3D space.

      Sometimes people say time is the fourth dimension, so the GPS
      also tells time, which locates my in the 4th dimension as well.

25. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      But mathematicians like to think about lots of other
      spaces besides physical space.  A state space is an imaginary
      space where each point represents a possible state of a system.

      They have lots of uses, but the way we're going to use them
      here is to represent rows in out data lists.  To start with,
      let's pretend we only have two columns of ____ data for each
      row.  That will allow us to represent each row -- each rat
      or credit card applicant or patient -- as a point in a
      two-dimensional state space.

26. CARD: FIGURE F-3

NARRATOR (CONTINUING) Here you can see three points represented, colored red, green and yellow just to tell them apart. So this is a graphical display of three rows of data. 27. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) If we add another column, so we have three columns of data, now each row can be a point in 3D space... 28. ANIMATION A-1





NARRATOR (CONTINUING) This animation shows the space turning so we can get views from different angles. One of the problems of the 3D view is can be hard to see what's behind. Using time as an added dimension and turning a model can help solve this problem if we only want to see a few points in the state space. But if we want to plot every possible point, we can't see a whole volume of points directly, any more that we can see inside a stack of pingpong balls.. A more fundamental problem is that -- remember -- we have up to 22,000 dimensions in our data. We're already hitting some problems with 3 dimensions, and we've run out. Clearly we can't directly visualize higher-dimensional state spaces. But there are some tricks we can do, as we'll see later. Still another problem is that we want to be able to display our decision function, the output of the Support Vector Machine method. It's just a one-dimensional function, but that does use a dimension. So the most common approach is like the one in this figure... 29. CARD: FIGURE F-4

NARRATOR (CONTINUING) ... two dimensions are used for the state space and one is used for the decision function. Here the horizontal dimensions represent the state space, like latitude and longitude on a map, and the third dimension, the altitude, is the decision function. The best choice is the lowest in this representation, so we're looking for the bottom of the bowl. But there is a problem with seeing the whole state space and decision function at once. Because paper is flat and screens are flat, it is most common to display a 2D state space in a 2D image, with color as the third dimension. 30. FIGURE F-5


from http://www.agso.gov.au/geophysics/date050397.html


         NARRATOR (CONTINUING)
      That's how the government of Australia produced this image.
      Ranges of altitudes are shown, starting at sea level, as blue,
      light blue, green, yellow, orange, red, violet and gray for
      the highest.

      [PAUSE]

31. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Another technique, which we're not going to show,
      is to like a topographic map, using lines of constant
      value -- called isolines -- to show height or other values.
      
      So let's see some decision function data plotted this way.

32. FIGURE F-6




         NARRATOR (CONTINUING)
      Here is a decision function for some 2D data, where
      dark blue is for the lowest negative values and dark
      red is for the highest positive values.  The black
      line is where the decision function is zero, and so the
      data is "right on the line" between +, live rat or good
      credit, and -, dead rat or bad credit.

33. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      Remember, we're trying to show some analogies here by
      boiling 22,000 dimensions of data down to 2, and then
      using the third for the decision function.  And I want to
      emphasize that we do this not because it's a great ideas,
      but because we don't know of a better way to display
      these fundamental concepts.  If we knew how to display
      the 22,00 dimensions without boiling anything down we'd
      certainly prefer it.

      One thing this exercise does illustrate is the
      importance of reducing dimensions if you can. 
      Many techniques in machine learning, including SVMs,
      begin by attempting to eliminate as many dimensions
      -- in other words as many columns of data -- as possible
      without sacrificing quality in the decision function.
      In fact, recent research by BIOwulf mathematicians
      has shown that coming up with which columns to throw out
      is actually more critical than coming up with an otherwise
      good decision function.  So we'll look at tossing out
      columns first.

      Assume we have some sort of workable program for creating
      decision functions from training data.  The most
      straightforward approach would be to try every possible
      subset of columns and each time, produce the decision
      function using the training data and then test it using
      the testing data.  The find the smallest subset that
      gives right answers.  Simple!

      Unfortunately, it's also impossible in a practical sense,
      because of something called the combinatorial explosion.

next page

Last update 26-May-2001 09:41 PM by ABS.