------------------------------------------------------------------------

                      BIOWULF'S MATHEMATICAL TOOLS:
                      -----------------------------

                  MACHINE LEARNING AND THE "SVM" METHOD
                  -------------------------------------

                              (PART 3 OF 3)
                              -------------

                              screenplay by

                            Alan B. Scrivener

                                                             26 May 2001
                                                             first draft

------------------------------------------------------------------------

70. CARD YELLOW LETTERS ON BLUE BACKGROUND:

   support vectors

         NARRATOR (CONTINUING)
      Now we get to the fundamental tool used by BIOwulf's
      mathematicians: support vectors.

71. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND

         NARRATOR (CONTINUING)
      And it's only fair to warn you, this is the weird part.
      What's weird about it is that it involves higher dimensions.
      Higher even than the 22,000 we mentioned before.

      But first, let me clear up some possible confusion about
      terminology.  We've been throwing a lot of specialized
      mathematical terms around here, sometimes with subtle
      distinctions.  Just like Eskimos have many words for snow,
      mathematicians have many words for dimensions.

72. CARD: TABLE T-6

rat # Carb. Fat Cholest. Sodium Protein Vitamin A Vitamin C Calcium Iron +/-
rat #1 0 g 0.5 g 30 mg 200 mg 13 g 0% 0% 0% 2% -
rat #2 20 g 0 g 0 mg 10 mg 1 g 0% 20% 0% 2% -
rat #3 25 g 2 g 0 mg 45 mg 0 g 0% 0% 0% 2% -
rat #4 4 g 0 g 0 mg 0 mg 1 g 4% 8% 2% 4% +
% = per cent daily human requirements


         NARRATOR (CONTINUING)
      Recall our table of which nutrients the rats were fed.  We
      called the rows in this table data points, points in the
      state space, vectors, and sometimes test data or training
      data.  We called the columns in this tables variables,
      dimensions and features.  They're dimensions when we want
      to represent them in a state space, and features when we
      want to eliminate them with recursive feature elimination.

      Now we're going to call them dimensions again as we do some
      math tricks with them.

73. ANIMATION A-9

[NOTE: THIS FOOTAGE HAS ALREADY BEEN PRODUCED BY BEOWULF]

NARRATOR (CONTINUING) Here is a 2D state space plot of some data. We'd like to be able to find a linear function decision function that cleanly separates the +'s from the -'s. That would mean finding a line that separates them in this plot, but there isn't one. Now you can see as the animation rotates to show a third dimension, that in a 3D state space plot there is a linear decision function that cleanly separates the space, which means in 3D that a plane can separate them, and you see that one can. 74. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) But what if 3 dimensions doesn't cut it? Well, we can just add more. The problem is, as we've mentioned, we can't you what that looks like. The best we can do is use some toothpicks and some miniature marshmallows to evoke the idea. Here I have one marshmallow. It stands for a one dimensional point. It has -- humor me on this -- no height, width or depth. I move the point in some direction and trace out a line segment. We can build that with a toothpick and another marshmallow. [CUT] 75. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR IS HOLDING MODEL OF LINE SEGMENT MADE OF TWO MARSHMALLOWS AND A TOOTHPICK NARRATOR (CONTINUING) So this represents a 1-dimensional line segment which has length, but no width or height. If I move it at right angles the same distance, it traces out a square. I can build that with two more marshmallows and three more toothpicks. [CUT] 76. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR IS HOLDING MODEL OF SQUARE MADE OF FOUR MARSHMALLOWS AND FOUR TOOTHPICKS NARRATOR (CONTINUING) This represents a 2-dimensional square which has length and width but no height. If I once again move it at right angles the same distance it traces out a cube. I can build a cube by adding four marshmallows and eight toothpicks. [CUT] 77. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR IS HOLDING MODEL OF CUBE MADE OF EIGHT MARSHMALLOWS AND TWELVE TOOTHPICKS NARRATOR (CONTINUING) This represents a three-dimensional cube which has length, width and height. Now we'd like to do this trick again but we can't -- we've run out of dimensions. I can even imagine what the next step would be. There are people who claim to be able to visualize higher dimensions, but other people wonder if they really can. Still, there is something we can do. We can study each transition from one dimension to the next, and then say what would happen by analogy. 78. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) So without a visual model to guide us, we can say: "Imagine moving the cube at right angles again, the same distance, and tracing out a hypercube. We 4 dimensions we could build one of those with eight more marshmallows and 20 more toothpicks." Now, I have no idea what that means, and we don't have that extra fourth dimension to try it in. But the point is, a lot can be figured out by analogy. We know how the math works at any dimension, up to infinite dimensions, believe it or not. And luckily, we don;t need to visualize the problem to solve it -- only to explain it! That's why we're better at solving it than explaining it. Okay, I've been jollying you along here with this higher dimension stuff to prepare you for this next statement. It probably won't make perfect sense, but it's the best we can do. The Support Vector Machine -- SVM -- method, works this way. It keeps adding dimensions until it can find a hyperplane to split the state space, which means there is a linear decision function. Then it finds all the data vectors that are closest to this hyper-plane. These are called the support vectors. Then the support vectors are used to do recursive feature elimination and to create the decision function. Let me run through that again. The SVM method keeps adding dimensions until it can find a hyper-plane -- a higher-dimensional plane -- to split the state space, which means there is a linear decision function. Then it finds all the data vectors that are closest to this hyper-plane. These are called the support vectors. Then the support vectors are used to do recursive feature elimination and to create the decision function. Now one thing that's interesting about this is that we started out wanting to throw away columns, and we did, but we also threw away rows! Any vectors which are not support vectors are thrown away and not used for the calculating the decision function. Okay, I know your mind is swirling. Maybe some pictures will help. 79. CARD: FIGURE F-11

NARRATOR (CONTINUING) The following images are taken from an interactive Java Applet written by Hong Zhang of BIOwulf. Now once again we have the advantages and disadvantages of being back in 2 dimensions. It's good because we can see the whole state space at once, but it's bad because it can't show any of the higher dimensional data. The way this Applet works is the user places data points in a white rectangle, black dots for the pluses and open dots for the minuses. Then the Applet chooses the support vectors and computes the decision function using the SVM method. Next the decision function is displayed as color, red being the low negative values and blue the high positive values. The data points chosen to be support vectors are circled. 80. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) This was for a linear decision function. Here's how it looks if we allow a non-linear decision function. 81. CARD: FIGURE F-12

NARRATOR (CONTINUING) You can see that the data won't permit a line to divide the state space, so a curve is needed. In this cave the curve is defined by a third order polynomial. This is an equation with no more complex terms than x cubed. 82. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) Let's take a look at an animated sequence of these. 83. CARD: FIGURE F-13 (f01.gif)

NARRATOR (CONTINUING) We started out with a grid of nine data points. The three upper rightmost were minuses while the rest were pluses. Three points near the dividing line were chosen by the Applet to be support vectors, and the decision function follows a slight curve down the middle, though it bends more elsewhere. Next we added another "plus" data point, near the center, just up and to the right of the center point. We let the Applet compute again, and... 84. CARD: FIGURE F-14 (f02.gif)

NARRATOR (CONTINUING) ... now you can see the Applet has chosen the new point to replace the center point as a support vector, and moved the decision function a little, but there are still only three support vectors. 85. CARD: FIGURE F-15 (f03.gif)

NARRATOR (CONTINUING) Now we add a few more data points, this time minuses, farther to the upper right of the center. We let the Applet compute once again ... 86. CARD: FIGURE F-16 (f04.gif)

NARRATOR (CONTINUING) ... and now you can see the Applet has added one of the new points as a support vector, and dropped the one in the upper left, but still has only three of them. Of course it has moved the decision function again. 87. CARD: FIGURE F-17 (f05.gif)

NARRATOR (CONTINUING) Here's another positive data point added... 88. CARD: FIGURE F-18 (f06.gif)

NARRATOR (CONTINUING) ... and it gets chosen to be a support vector. The decision function nows jumps more dramatically. 89. CARD: FIGURE F-19 (f07.gif)

NARRATOR (CONTINUING) Here's another positive data point added... 90. CARD: FIGURE F-20 (f08.gif)

NARRATOR (CONTINUING) ... and it gets chosen to be a support vector. The decision function is tightening up as we give it finer and finer distinctions. 91. CARD: FIGURE F-21 (f09.gif)

NARRATOR (CONTINUING) Now we've hopped down to region lower and right of center, and put another negative data point near the dividing line. 92. CARD: FIGURE F-22 (f10.gif)

NARRATOR (CONTINUING) ... and it gets chosen to be a support vector. But note that now there are four circled points -- four support vectors. The support vector machine needs more data points to define the decision function. 93. CARD: FIGURE F-23 (f11.gif)

NARRATOR (CONTINUING) Here's another negative data point added, advancing into what was clearly positive territory. 94. CARD: FIGURE F-24 (f12.gif)

NARRATOR (CONTINUING) ... and it gets chosen to be a support vector, replacing the previous point. But notice that the positive data point in the lower right has also bbeen made a support vector, totaling five. The decision function is really tightening up now. 95. CARD: FIGURE F-25 (f13.gif)

NARRATOR (CONTINUING) Once again we add another negative data point in what was previously positive territory. 96. CARD: FIGURE F-26 (f14.gif)

NARRATOR (CONTINUING) Woah! The decision function just got wider and more curved! And the number of support vectors has increased again -- now there are six. Each I time I add a point in the wrong territory I'm violating its assumptions, so it has to make a subtler decision function. 97. CARD: FIGURE F-27 (f15.gif)

NARRATOR (CONTINUING) Okay, we add one last negative data point the same way. 98. CARD: FIGURE F-28 (f16.gif)

NARRATOR (CONTINUING) ... and it gets chosen to be a support vector, along with another point -- now there are eight!. The decision function has twisted and widened to accommodate the more stringent data. 99. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) Well, we could go on like this all day. It is fascinating to watch, and even more fascinating to interact with the Applet yourself. But now lets look at something you don't see when you play with the Applet. The following animation shows those frames we just went through speeded up, with the compute time removed. 100. ANIMATION A-10

NARRATOR (CONTINUING) It's a thing of beauty. Sometimes the eye can pick out patterns in motion that we don't see any other way. Well, enough of that. 101. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) I hope this is giving you a sense of what the Support Vector Machine method does. Unfortunately, we can show you 2 dimensions but all the real benefits occur with much higher dimensions, which we can't show you. So we've decided to do what the folks at Hollywood special effects houses do -- make something up. The following animation isn't designed to be an exact mathematical representation, but instead is an "artist's conception." 102. ANIMATION A-11

[ARTISTS CONCEPTION OF SVM PROCESS] NARRATOR (CONTINUING) The sort of flowering of the data represents going into higher dimensions... and then you see it compacting down again when the support vectors are found. 103. MEDIUM SHOT: NARRATOR IN FRONT OF NEUTRAL BACKGROUND NARRATOR (CONTINUING) Well, I'm sure some of this made perfect sense, and some of it seemed like mumbo jumbo. It's that way to most of us, too. But the proof, they say, is in the pudding. Now that we've done a survey of the mathematical tools used by BIOwulf, let's ask the most important question: 104. CARD YELLOW LETTERS ON BLUE BACKGROUND: Does it work? NARRATOR (CONTINUING) Well, the short answer to this is: it sure does! In the paper "Gene Selection for Colon Classification using Support Vector Machines by Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik, the story is told of these 62 biopsies samples we mentioned earlier. They report that SVM is better than the next best known technique with 95.8% confidence. Furthermore, the SVM used 8 instead of 32 genes to make the classification, after recursive feature elimination, and still had a lower error rate. Remember that this was from frozen tissue samples that were combined with 30 years of follow-up patient data. This stuff can save lives! It was also interesting to find that other methods falsely associated the presence of skin cell genes with cancerous cells, just because cancerous tissue samples tended to have more skin cells while healthy samples had more muscle cells. The SVM method did not make this false correlation. Also, the SVM method did correlate lack cancer with the presence of genes from a tiny parasite, a protozoa named Trypanosoma, indigenous to Africa and South America. To quote the paper: "We thought this may be an anomaly of our gene selection method or an error in the data, until we discovered that clinical studies report that patients infected by Trypanosoma -- a colon parasite -- develop resistance against colon cancer." Clearly we need to put this new tool to work on medical diagnosis right away! Thank you for your time in viewing this explanatory video. For more information about BIOwulf, ______.
previous page

Last update 26-May-2001 09:41 PM by ABS.