- 1. Shift-Invariant Linear Systems
- Shift-Invariant Linear Systems: Definitions
- Convolution
- Convolution and Harmonic Functions
- The Discrete Fourier Series: Definitions
- The Discrete Fourier Series: Properties
- Measuring a Convolution System using Harmonic Functions
- 2. Display Calibration
- An Overview of a CRT Display
- The frame-buffer
- The Display Intensity
- The Display Spectral Power Distribution
- The Calibration Matrix
- Example Calculations
- Color Calibration Tips
- 3. Classification
- A Bayes classifier for an Intensity Discrimination Task
- A Bayes classifier for a Pattern Discrimination Task
- 4. Signal Estimation: A Geometric View
- 5. Motion Flow Field Calculation

# 1. Shift-Invariant Linear Systems

Many of the ideas in this book rely on properties of shift-invariant linear systems. In the text, I introduced these properties without any indication of how to prove that they are true. I have two goals for this section. First, I will sketch proofs of several important properties of shift-invariant linear systems. Second, I will describe *convolution* and the *Discrete Fourier Series*, two tools that help us take advantage of these properties ^{1}.

^{1 }The results in this appendix are expressed using real harmonic functions. This is in contrast to the common practice of using complex notation, specifically Euler’s complex exponential, as a shorthand to represent sums of harmonic functions. The exposition using complex exponentials is brief and elegant, but I believe it obscures the connection with experimental measurements. For this reason, I have avoided the development based on complex notation.

## Shift-Invariant Linear Systems: Definitions

Shift-invariance is a system property that can be verified by experimental measurement. For example, in Chapter 2 I described how to check whether the optics of the eye is shift-invariant by the following measurements. Image a line through the optics onto the retina, and measure the linespread function. Then, shift the line to a new position, forming a new retinal image. Compare the new image with the original. If the images have the same shape, differing only by a shift in position, then the optics is shift-invariant over the measured range.

We can express the empirical property of shift-invariance using the following mathematical notation. Choose a stimulus, and measure the system response, . Now, shift the stimulus by an amount and measure again. If the response is shifted by as well, then the system may be shift-invariant. Try this experiment for many values of the shift parameter, . If the experiment succeeds for all shifts, then the system is shift-invariant.

If you think about this definition as an experimentalist, you can see that there are some technical problems in making the measurements needed to verify shift-invariance. Suppose that the original stimulus and response are represented at distinct points, . If we shift the stimulus three locations so that now the fourth location contains the first entry, the fifth the second, and so forth, how do we fill in the first three locations in the new stimulus? And, what do we do with the last three values, , which have nowhere to go?

Theorists avoid this problem by treating the real observations as if they are part of an infinite, periodic set of observations. They assume that the stimuli and data are part of an infinite periodic series with a period of , equal to the number of original observations. If the data are infinite, the first three entries of the shifted vector are the three values at locations . If the data are periodic with period , these three values are the same as the values at .

The assumption that the measurements are part of an infinite and periodic sequence permits the theorist to avoid the experimentalist’s practical problem. The assumption is also essential for obtaining several of the simple closed-form results concerning the properties of shift-invariant systems. The assumption is not consistent with real measurements since real measurements cannot be made using infinite stimuli: there is always a beginning and an end to any real experiment. As an experimentalist you must always be aware that many theoretical calculations using shift-invariant methods are not valid near the boundaries of data sets, such as near the edge of an image.

Suppose we refer to the finite input as, , and the measured output, , is finite. In the theoretical analysis we extend both of these functions to be infinite and periodic. We will use a hat symbol to denote the extended functions,

(1)

The extended functions and agree with our measurements over the measurement range from . By the periodicity assumption, the values outside of the measurement range are filled in by looking at the values within the measurement range. For example,

## Convolution

Next, we derive some of the properties of linear shift-invariant systems. We begin by describing these properties in terms of the *system matrix* (see Chapter 2). Then, we will show how the simple structure of the shift-invariant system matrix permits us to relate the input and output by a summation formula called *cyclic convolution*. The convolution formula is so important that shift-invariant systems are sometimes called *convolution systems *^{2}.

^{2 }Since we use only cyclic convolution here, I will drop the word “cyclic” and refer to the formula simply as convolution. This is slightly abusive, but conforms to common practice in many fields.

In Chapter 2 I reviewed how to measure the system matrix of an optical system for one-dimensional input stimuli. We measure the image resulting from a single line at a series of uniformly spaced input locations. If the system is shift-invariant, then the columns of the system matrix are shifted copies of one another (except for edge artifacts). To create the system matrix, we extend the inputs and outputs to be periodic functions (Equation~1). Then, we select a central block of size to be the system matrix and we use the corresponding entries of the extended stimulus. For example, if the input stimulus consists of six values, , and the response to this stimulus is the vector, , then the system matrix is

(2)

For a general linear system, we calculate the output using the summation formula for matrix multiplication in Equation~??,

(3)

When the linear system is shift-invariant system, this summation formula simplifies for two reasons. First, because of the assumed periodicity, the summation is precisely the same when we sum over any consecutive integers. It is useful to incorporate this generalization into the summation formula as

(4)

where the notation means that summation can take place over any consecutive integers. Second, notice that for whichever block of values we choose, the typical entry of the system matrix will be

(5)

We can use this relationship to simplify the summation further,

(6)

In this form, we see that the response depends only on the input and the linespread. The summation formula in Equation~6 is called *cyclic convolution*. Hence, we have shown that to compute the response of a shift-invariant linear system to any stimulus, we need measure only the linespread function.

## Convolution and Harmonic Functions

Next, we study some of the properties of the convolution formula. Most important, we will see why harmonic functions have a special role in the analysis of convolution systems.

Beginning with our analysis of optics in Chapter 2, we have relied on the fact that the response of a shift-invariant system to a harmonic function at frequency is also a harmonic function at . In that chapter, the result was stated in two equivalent ways.

- If the input is a harmonic at frequency , the output is a shifted and scaled copy of the harmonic.
- The response to a harmonic at frequency will be the weighted sum of a sinusoid and cosinusoid at the same frequency (Equation~??).

We can derive this result from the convolution formula. Define a new variable, , and substitute into Equation~6. Remember that the summation can take place over any adjacent values. Hence, the substitution yields a modified convolution formula,

(7)

Next, we use the convolution formula in Equation~7 to compute the response to a sinusoidal input . From trigonometry we have that

(8)

Substitute Equation~8 into Equation~7, remembering that and .

(9)

We can simplify this expression to the form

(10)

where

(11)

We have shown that when the input to the system is a sinusoidal function at frequency , the output of the system is the weighted sum of a sinusoid and a cosinusoid, both at frequency . This is equivalent to showing that when the input is a sinusoid at frequency , the output will be a scaled and shifted copy of the input, (see Equation~??). As we shall see below, it is easy to generalize this result to all harmonic functions.

## The Discrete Fourier Series: Definitions

In general, when we measure the response of a shift-invariant linear system we measure output values. When the input is a sinusoid, or more generally a harmonic, we can specify the response using only the two numbers, and , in Equation~10. We would like to take advantage of this special property of shift-invariant systems. To do so, we need a method of representing input stimuli as the weighted sum of harmonic functions.

The method used to transform a stimulus into the weighted sum of harmonic functions is called the *Discrete Fourier Transform* (DFT). The representation of the stimulus as the weighted sum of harmonic functions is called the *Discrete Fourier Series* (DFS). We use the DFS to represent an extended stimulus, .

(12)

We are interested in that part of the extended stimulus that coincides with our measurements. We can express the relationship between the harmonic functions and the original stimulus, , using a matrix equation

(13)

which has the matrix tableau form

(14)

The vectors and contain the coefficients and , respectively. The columns of the matrices and contain the relevant portions of the cosinusoidal and sinusoidal terms, and , that are used in the DFS representation.

The DFS represents the original stimulus as the weighted sum of a set of harmonic functions (i.e., sampled sine and cosine functions). We call these sampled harmonic functions the *basis functions* of the DFS representation. The vectors and contain basis functions *weights* or *coefficients* that specify how much of each basis function must be added in to recreate the original stimulus, .

## The Discrete Fourier Series: Properties

Figure A1.1 shows the sampled sine and cosine functions for a period of . The functions are arrayed in a circle to show how they relate to one another. There are a total of 16 basis functions. But, as you can see from Figure A1.1, they are redundant. The sampled cosinusoids in the columns of repeat themselves (in reverse order); for example, when the cosinusoids for are the same as the cosinusoids for . The sampled sinusoids in the columns of also repeat themselves except for a sign reversal (multiplication by negative one). There are only four independent sampled cosinusoids, and four independent sampled sinusoids. As a result of this redundancy, neither the nor the matrix is invertible.

Nevertheless, the properties of these harmonic basis functions make it simple to calculate the vectors containing the weights of the harmonic functions from the original stimulus, . To compute and , we multiply the input by the basis functions, as in

(15)

We can derive the relationship in Equation~15 from two observations. First, the matrix sum, has a simple inverse. The columns of are orthogonal to one another, so that the inverse of is simply ^{3}.

^{3 }This observation is the basis of another useful linear transform method called theHartley Transform(Bracewell, 1986).

Second, the columns of the matrices and are perpendicular to one another: . This observation should not be surprising since continuous sinusoids and cosinusoids are also orthogonal to one another.

We can use these two observations to derive Equation~15 as follows. Express the fact that and are inverses as follows:

(16)

where is the identity matrix. Then, multiply both sides of Equation~16 by ,

%cn cn’ stim + sn sn’ stim = stim.

(17)

Compare Equation~17 and Equation~13. Notice that the Equations become identical if we make the assignments in Equation~15. This completes the sketch of the proof.

## Measuring a Convolution System using Harmonic Functions

Finally, we show how to predict the response of a shift-invariant linear system to any stimulus using only the responses to unit amplitude cosinusoidal inputs. This result is the logical basis for describing system performance from measurements of the response to harmonic functions. When the system is linear and shift-invariant, its responses to harmonic functions is a complete description of the system; but, this is not true for arbitrary linear systems.

Because cosinusoids and sinusoids are shifted copies of one another, the response of a shift-invariant linear system to these functions is the same except for a shift. From a calculation like the one in Equation~8, except using a cosinusoidal input, we can calculate the following result: If the response to a cosinusoid at is a sum of cosinusoid and sinusoid with weights, , then, the response to a sinusoid at frequency will have weights . Hence, if we know the response to a cosinusoid at , we also know the response to a sinusoid at .

Next, we can use our knowledge of the response to sinusoids and cosinusoid at to predict the response to any harmonic function at . Suppose that the input is a harmonic function , and the output is, say . If the response to a unit amplitude cosinusoid is , then the response to a unit amplitude sinusoid is . Using these two facts and linearity we calculate the coefficients of the response,

(18)

We have shown that if we measure the system response to unit amplitude cosinusoidal inputs, we can compute the system response to an arbitrary input stimulus as follows.

- Compute the DFS coefficients of the input stimulus using Equation~15.
- Calculate the DFS coefficients of the output using Equation~18.
- Reconstruct the output from the coefficients using Equation~12.

You will find this series of calculations used implicitly at several points in the text. For example, we followed this organization when I described measurements of the optical quality of the lens (Chapter 2) and when I described measurements of behavioral sensitivity to spatiotemporal patterns (Chapter 7).

# 2. Display Calibration

Visual displays based on a *cathode ray tube* (CRT) are used widely in business, education and entertainment. The CRT reproduces color images using the principles embodied in the color-matching experiment (Chapter 4).

The design of the color CRT is one of the most important applications of vision science; thus, it is worth understanding the design as an engineering achievement. Also, because the CRT is used widely in experimental vision science, understanding how to control the CRT display is an essential skill for all vision scientists. This appendix reviews several of the principles of monitor calibration.

## An Overview of a CRT Display

Figure A2.1a shows the main components of a color CRT display. The display contains a *cathode*, or *electron gun*, that provides a source of electrons. The electrons are focused into a beam whose direction is deflected back and forth in a raster pattern so that it scans the faceplate of the display.

Light is emitted by a process of absorption and emission that occurs at the faceplate of the display. The faceplate consists of a phosphor painted onto a glass substrate. The phosphor absorbs electrons from the scanning beam and emits light. A signal, generally controlled from a computer, modulates the intensity of the electron beam as it scans across the faceplate. The intensity of the light emitted by the phosphor at each point on the faceplate depends on the intensity of the electrons beam as it scans past that point.

Monochrome CRTs have a single electron beam and a single type of phosphor. In monochrome systems the control signal only influences the intensity of the phosphor emissions. Color CRTs use three electron beams; each stimulates one of three phosphors. Each of the phosphors emits light of a different spectral power distribution. By separately controlling the emissions of the three types of phosphors, the user can vary the spectral composition of the emitted light. The light emitted from the CRT is always the mixture of three primary lights from the three phosphors, usually called the red, green and blue phosphors.

In order that each electron beam stimulate emissions from only one of the three types of phosphors, a metal plate, called a *shadow-mask*, is interposed between the three electron guns and the faceplate. A conventional shadow-mask is a metal plate with a series of finely spaced holes. The relative positions of the holes and the electron guns are arranged, as shown in Figure A2.1b, so that as the beam sweeps across the faceplate the electrons from a single gun that pass through a hole are absorbed by only one of the three types of phosphors; electrons from that gun that would have stimulated the other phosphors are absorbed or scattered by the shadow-mask.

## The frame-buffer

In experiments, the control signal sent to the CRT display is usually created using computer software. There are two principal methods for creating these control signals. In one method, the user controls the three electron beam intensities by writing out the values of three matrices into three separate *video frame-buffers*. Each matrix specifies the control signal for one of the electron guns. Each matrix entry specifies the desired voltage level of the control signal at a single point on the faceplate. Usually, the intensity levels within each matrix are quantized to 8 bits, so this computer display architecture is called *24 bit color*, or *RGB color*.

In a second method, the user writes a single matrix into a single frame-buffer. The value at each location in this matrix is converted into three control signals sent to the display according to a code contained in in a *color look-up table*. This architecture is called *indexed color*. This method is cost-effective because it does away with two of the three frame-buffers. When using this method, the user can only select among 256 (8 bits) colors when displaying a single image.

## The Display Intensity

Calibrating a visual display means measuring the relationship between the frame-buffer values and the light emitted by the display. In this section we will discuss the relationship between the frame-buffer values and the intensity of the emitted light. In the next section we will discuss the relationship between the frame-buffer values and the spectral composition of the emitted light.

We can measure the relationship between the value of a frame-buffer entry and the intensity of the light emitted from the display as follows: Set the frame buffer entries controlling one of the three hosphor display intensities, say within a rectangular region, to a single value. Measure the intensity of the light emitted from the displayed rectangle. Repeat this measurement at many different frame-buffer values for this phosphor, and then for the other two phosphors.

The dashed curve in Figure A2.2 measures the ratio of the intensity at the highest frame-buffer level to the intensity at each of the other frame-buffer levels for the green phosphor. We can summarize the difference using this single ratio, the *relative intensity* because for over most of the range the spectral power distribution of the light emitted at one frame-buffer level is the same as the spectral power distribution of the light emitted from the monitor at maximum except for a scale factor. The insets in the graph show two examples of the spectral power distribution, measured when the frame-buffer was set to 255 and 190. These two curves have the same overall shape; they differ by a scale factor of one half.

We can approximate the curve relating the relative intensity of the light emitted from this CRT display, , and the frame-buffer value, , by a function of the form

where and are two fitting parameters. For most CRTs, the exponent of the power function, , has a value near (see Brainard, 1989; Berns et al., 1993ab).

The nonlinear function relating the frame-buffer values and the relative intensity is due to the physics of the CRT display. While nonlinear functions are usually viewed with some dread, this particular nonlinear relationship is desirable because most users want the frame-buffer values to be linear with the brightness; they don’t care how the frame-buffer values relate to intensity. As it turns out, perceived brightness, , is related to intensity through a power law relationship called *Stevens’ Power Law* (Stevens, 1962; Goldstein, 1989; Sekuler and Blake, 1985), namely,

where is a fitting parameter. Somewhat fortuitously, the nonlinear relationship between frame-buffer values and intensity compensates for the nonlinear relationship between intensity and brightness. To show this, I have replotted Figure 2.2a on a graph whose vertical scale is brightness, that is relative intensity raised to the power. This graph shows that the relationship between the frame-buffer value and brightness is nearly linear. This has the effect of equalizing the perceptual steps between different levels of the frame-buffer and simplifying certain aspects of controlling the appearance of the display.

## The Display Spectral Power Distribution

The spectral power distribution of the light emitted in each small region of a CRT display is the mixture of the light emitted by the three phosphors. Since the mixture of lights obeys superposition, we can characterize the spectral power distributions emitted by a CRT using simple linear methods.

Suppose we measure the spectral power distribution of the red, green and blue phosphors at their maximum intensity levels. We record these measurements in three column vectors, .

By superposition, we know that light from the monitor screen is always the weighted sum of light from the three phosphors. For example, if all three of the phosphors are set to their maximum levels, the light emitted from the screen, call it , will have a spectral power distribution of

More generally, if we set the phosphors to the three relative intensities , the light emitted by the CRT will be the weighted sum

Figure A2.3 is a matrix tableau that illustrates how to compute the spectral power distribution of light emitted from a CRT display. The vector contains the relative intensity of each of the three phosphors. The three columns of a matrix, call it , contain the spectral power distributions of the light emitted by the red, green and blue phosphors at maximum intensity. The spectral power distribution of light emitted from the monitor is the product .

The light emitted from a CRT is different from lights we encounter in natural scenes. For example, the spectral power distribution of the red phosphor, with its sharp and narrow peaks, is unlike the light we see in natural images. Nonetheless, we can adjust the three intensities of the three phosphors on a color CRT to match the appearance of most spectral power distributions, just as we can match appearance in the color-matching experiment by adjusting the intensity of three primary lights (see Chapter 4).

## The Calibration Matrix

In many types of psychophysical experiments, we must be able to specify and control the relative absorption rates in the three types of cones. In this section, I show how to measure and control the relative cone absorptions when the phosphor spectral power distributions and the relative cone photopigment spectral sensitivities are known.

We use two basic matrices in these calculations. We have already created the matrix whose three columns contain the spectral power distributions of the phosphors at maximum intensity. We also create a matrix, , whose three columns contain the cone absorption sensitivities (, , and ), measured through the cornea and lens (see Table~??.) Given a set of relative intensities, , we calculate the relative cone photopigment absorption rates, by the matrix product ^{4}

(19)

We call the matrix , the monitor’s *calibration matrix.* This matrix relates the linear phosphor intensities to the relative cone absorption rates.

As an example, I have calculated the calibration matrix for the monitor whose phosphor spectral power distributions are shown in Figure A2.3, and for the cone absorption spectra listed in Table~??. The resulting calibration matrix is

Each column of the calibration matrix describes the relative cone absorptions caused by emissions from one of the phosphors: The absorptions from the red phosphor are in the first column, and absorptions due to the green and blue phosphors are in the second and third columns, respectively.

Suppose the red phosphor stimulated only the cones, the green the cones, and the blue the cones. In that case, the calibration matrix would be diagonal, and we could control the absorptions in a single cone class by adjusting only one of the phosphor emissions. In practice, the light from each of the CRT phosphors is absorbed in all three cone types and the calibration matrix is never diagonal. As a result, to control the cone absorptions we must take into account the effect each phosphor has on each of the three cone types. This complexity is unavoidable because of the overlap of the cone absorption curves and the need to use phosphors with broadband emissions to create a bright display. Consequently, the calibration matrix is never diagonal.

^{4 }This calculation applies to spatially uniform regions of a display, in which we can ignore the complications of chromatic aberration. Marimont and Wandell (1994) describe how to include the effects of chromatic aberration.

## Example Calculations

Now, we consider two ways we might use the calibration matrix. First, we might wish to calculate the receptor absorptions to a particular light from the CRT. Second, we might wish to know how to adjust the relative intensities of the CRT in order to achieve a particular pattern of cone absorptions.

Using the methods described so far, you can calculate the relative cone absorption rates for any triplet of frame-buffer entries. Suppose the frame-buffer values are set to . The pattern will look yellow since only the red and green phosphors are excited. Assuming that the curves relating relative intensity to frame-buffer level are the same for all three phosphors (Figure A2.2), we find that the relative intensities will be . The product yields the relative cone absorption rates . If we set the frame-buffer to , which appears gray, the relative intensities are and the relative cone absorption rates are .

A second common type of calculation is used in behavioral experiments in which the experimenter wants to create a particular pattern of cone absorptions. To infer the relative display intensities necessary to achieve a desired effect on the cone absorptions, we must use the inverse of the calibration matrix, since

To continue our example, the inverse of the calibration matrix is

Suppose we wish to display a pair of stimuli that differ only in their effects on the cones. Let’s begin with a stimulus whose relative intensities are . Using the calibration matrix, we calculate that the relative cone absorption rates from this stimulus: they are . Now, let’s create a second stimulus with a slightly higher rate of cone absorptions: say, . Using the inverse calibration matrix, we can calculate the relative display intensities needed to produce the second stimulus: they are .

Notice that to create a stimulus difference seen only by the cones, we needed to adjust the intensities of all three phosphor emissions. For example, to increase the cone absorptions we need to increase the intensity of the blue phosphor emissions. This will also cause some increases in the and cone absorptions, and we must compensate for this by decreasing the emissions from the red and green phosphors.

## Color Calibration Tips

For most of us calibration is not, in itself, the point. Rather, we calibrate displays to describe and control experimental stimuli. If your experiments only involve a small number of stimuli, then it is best to calibrate those stimuli exhaustively. This strategy avoids making a lot of unnecessary assumptions, and your accuracy will be limited only by the quality of your calibration instrument.

In some experiments, and in most commercial applications, the number of stimuli is too large for exhaustive calibration. Brainard (1989) calculates that for the most extreme case, in which one has a spatial array with RGB buffers at bits of resolution there are different stimulus patterns. So many patterns, so little time.

Because of the large number of potential stimuli, we need to build a model of the relationship between the frame-buffer entries and the monitor output. The discussion of calibration in this appendix is based on an implicit model of the display. To perform a high quality calibration, you should check some of these assumptions. What are these implicit assumptions, and how close will they be to real performance?

### Spatial Independence

First, we have assumed that the transfer function from the frame buffer values to the monitor output is independent of the spatial pattern in the frame buffer. Specifically, we have assumed that the light measurements we obtain from a region are the same if the surrounding area were set to zero or set to any other intensity level.

In fact, the intensity in a region is not always independent of the spatial pattern in nearby regions (see e.g. Lyons and Farrell, 1989, Naiman and Makous, 1992). It can be very difficult and impractical to calibrate this aspect of the display. As an experimenter, you should choose your calibration conditions to match the conditions of your experiment as closely as possible so you don’t need to model the variations with spatial pattern. For example, if your experiments will use rectangular patches on a gray background, then calibrate the display properties for rectangular patches on a gray background, not on a black or white background.

If you are interested in a harsh test to evaluate spatial independence of a display, try displaying a square pattern consisting of alternating pixels with and . When you step away from the monitor, the pattern will blur; in principle, the brightness of the this pattern should match the brightness of a uniform square of the same size all of whose values are set to the frame-buffer value whose relative intensity is . You can try the test using alternating horizontal lines, or alternating vertical lines, or random dot arrays.

### Phosphor Independence

Brainard (1989) points out that the spatial independence assumption reduces the number of measurements to approximately . Still too many measurements to make; but, at least we have reduced the problem so that there are fewer measurements than the number of atoms in the universe.

It is our second assumption, *phosphor independence*, that makes calibration practical. We have assumed that the signals emitted from three phosphors can be measured independently of one another. For example, we measured the relative intensities for the red phosphor and assumed that this curve is the same no matter what the state of the green and blue phosphor emissions. The phosphor independence assumption implies that we need to make only measurements, the relative intensities of each of the three phosphors, to calibrate the display. (Once the spectral power distributions of the phosphors are known; or equivalently, the entries of the calibration matrix).

Before performing your experiments, it is important to verify phosphor independence. Measure the intensity of the monitor output to a range of, say, red phosphor values when green frame-buffer is set to zero. Then measure again when the green frame-buffer is set to its maximum value. The relative intensities of the red phosphor you measure should be the same after you correct for the additive constant from the green phosphor. In my experience, this property does fail on some CRT displays; when it fails your calibration measurements are in real trouble. I suspect that phosphor independence fails when the power supply of the monitor is inadequate to supply the needs of the three electron guns. Thus, when all three electron guns are being driven at high levels, the load on the power supply exceeds the compliance range and produces a dependence between the output levels of the different phosphors. This dependence violates the assumptions of our calibration procedure and makes calibration of such a monitor very unpleasant. Get a new monitor.

For further discussion of calibration, consult some papers in which authors have described their experience with specific display calibration projects (e.g. Brainard, 1989; Cowan and Rowell, 1986; Post, 1989; Berns et al., 1993ab).

# 3. Classification

Many visual judgments are classifications: an edge is present, or not; a part is defective, or not; a tumor is present, or not. The threshold and discrimination performances reviewed in Chapter 7 are classifications as well: I saw it, or not; the stimulus was this one, or that. This appendix explains some of the basic concepts in classification theory and their application to understanding vision.

We will explore the issues in classification by analyzing some simple behavioral decisions, the kind that take place in many vision experiments. Suppose that during an experimental trial, the observer must decide which of two visual stimuli, or , is present. Part of the observer’s decision will be based on the pattern of cone absorptions during the experimental trial. We can list the cone absorptions in a vector, , whose values represent the number of absorptions in each of the cones. Because there are statistical fluctuations in the light source, and variability in the image formation process, the pattern of cone absorptions created by the same stimulus varies from trial to trial. As a result, the pattern of cone absorptions from the two different stimuli may sometimes be the same, and perfect classification may be impossible.

What response strategy should the subject use to classify the stimuli correctly as often as possible? A good way to lead your life is this: *When you are uncertain and must decide, choose the more likely alternative.*

We can translate this simple principle into a statistical procedure by the following set of calculations. Suppose that we know the probability of the signal being when we observe . This is called the *conditional probability*, , which is read as “the probability of given .” Suppose we also know the conditional probability that the signal is is . The observer should decide that the signal is the one that is more likely given the data, namely

(20)

We call the expression in Equation~20 a *decision rule*. The decision rule in Equation~20 is framed in terms of the likelihood of the stimuli ( or ) given the data (). This formulation of the probabilities runs counter to our normal learning experience. During training, we know that stimulus has been presented, and we experience a collection of cone absorptions. Experience informs us about probability of the data given the stimulus, , not the probability of the stimulus given the data, . Hence, we would like to reformulate Equation~20 in terms of the way we acquire our experience.

*Bayes Rule* is a formula that converts probabilities derived from experience, , into the form we need for the decision-rule, . The expression for Bayes Rule is

(21)

where is the probability the experimenter presents signal (in this case one-half) and is the probability of observing quanta. As we shall see, this second probability turns out to be irrelevant to our decision.

The probabilities on the right hand side of Bayes Rule are either estimated from our experience, , or they are a structural part of the experimental design . The probability that stimulus or is presented is called the *a priori* probability, or *base rate*. The probability is called the *likelihood* of the observation given the hypothesis. The probability, is called the *a posteriori* probability. Bayes Rule combines the base rate and the likelihood of the observation to form the a posteriori probability of the hypothesis.

We can use Bayes Rule to express the decision criterion in Equation~20 in this more convenient form.

(22)

Since the probability of observing the data, , divides both

of sides of this inequality, this probability is irrelevant to the

decision. Therefore, we can re-write Equation 22

as

(23)

The term on the left hand side of the Equation~23 is called the *likelihood ratio*. The quantity on the right is called the *odds ratio*. The formula tells us that we should select when the likelihood ratio, which depends on the data, exceeds the odds ratio, which depends on the a priori knowledge. A system that uses this formula to classify the stimuli is called a *Bayes classifier*.

## A Bayes classifier for an Intensity Discrimination Task

In this section we will devise a Bayes classifier for a simple experimental decision. Suppose that we ask an observer to decide whether we have presented one of two brief visual stimuli; the two stimuli are identical in all ways but their intensity. We suppose that the observer decides which stimulus was presented based on the total number of cone absorptions during the trial, , and without paying attention to the spatial distribution of the cone absorptions.

Across experimental trials, the number of absorptions from will vary. There are two sources of this variability. One source of variability is in the stimulus itself. Photons emitted by a light source are the result of a change in the energy level of an electron within the source material. Each electron within the material has some probability of changing states and yielding some energy in the form of a photon. This change in excitation level is a purely statistical event and cannot be precisely controlled. The variability in the number of emitted photons, then, is inherent in the physics of light emission and cannot be eliminated.

A second source of variability is in the observer. On different experimental trials, the position of the eye, accommodative state of the lens, and other optical factors will vary. As these image formation parameters vary, the chance that photons are absorbed within the photopigment will vary. These statistic fluctuations are unavoidable as well.

The physical variability is easy to model, while the biological variability is quite subtle. For this illustrative calculation, we ignore the effects of biological variability and consider only the stimulus variability. In this case, the number of absorptions will follow the *Poisson distribution*. The formula for the Poisson probability distribution describes the probability of emitting quanta given a mean level of ,

(24)

The value is called the *rate-parameter* of the Poisson distribution. The variance of the Poisson distribution is also .

Figure A3.1 shows the probability distributions of the number of cone absorptions from the two stimuli, and . For this illustration, I have assumed that the mean absorptions from the two stimuli are and , respectively. Since the a priori stimulus probabilities are equal, the observer will select stimulus whenever exceeds . For this pair of distributions, this occurs whenever the observation is less than 9.

Now, suppose the experimenter presents with probability and with probability . In this case, the subject can be right three quarters of the time simply by always guessing . Given the strong a priori odds for , the Bayes classifier will only choose if the likelihood exceeds three to one. From the distributions in Figure A3.1 we find this occurs when there are at least 13 absorptions. The Bayes classifier uses the data and the a priori probabilities to make a decision ^{5}.

^{5 }In a classic paper, Hecht, Schlaer and Pirenne (1942) measured the smallest number of quanta necessary to reliably detect a signal. In the most sensitive region of the retina, under the optimized viewing conditions, they found that 100 quanta at the cornea, which corresponds to about 10 absorptions, are sufficient. The quanta are absorbed in an area covered by several hundred rods; hence, it is quite unlikely that two quanta will be absorbed by the same rod. They concluded, quite correctly, that a single photon of light can produce a measurable response in a single rod.To find further support for their observation, Hecht et al. analyzed how the subject’s responses varied across trials. They assumed that all of the observed variability was due to the stimulus, and none was due to the observer. Their account is repeated in a wonderful didactic style in Cornsweet’s book. I don’t think this assumption is justified; a complete treatment of the data must take into account variability intrinsic to the human observer (e.g., Nachmias and Kocher, 1970).

## A Bayes classifier for a Pattern Discrimination Task

To discriminate between different spatial patterns, or stimuli located at different spatial positions, the observer must use the spatial pattern of cone absorptions. Consequently, pattern discrimination must depend on comparisons of a vector of measurements, , not just a single value. In this section, we develop a Bayes classifier that can be applied to a pattern discrimination task.

Again, for illustrative purposes we assume that the variability is due entirely to the stimulus. Moreover, we will assume that the variability in light absorption at neighboring receptors is statistically independent. Independence means that the probability of absorptions at the receptor and at the is

We can extend this principle across all of the receptors to write the probability of observing given signal , namely

where is the likelihood of observing given the stimulus .

Finally, we must specify the spatial pattern of two stimuli. The expected number of absorptions at the cone will depend on the spatial pattern of the stimulus. We call the mean intensity of the stimulus at location , and , respectively.

We are ready to compute the proper Bayes classifier. The general form of the decision criterion is

(25)

If the two stimuli are presented in the experiment equally often, we can re-write this equation as

(26)

Next, we can use independence to write

(27)

Now, we substitute the Poisson formula, take the logarithm of both sides, and regroup terms to obtain

(28)

Equation~28 can be interpreted as a very simple computational procedure. First, notice that the Equation contains two terms. The first term is the weighted sum of the photopigment absorptions,

(29)

where . This is a very familiar computation; it is directly analogous to the calculation implemented by a linear neuron whose receptive field sensitivity at position is (see Chapter 5).

The second term is the difference between the total number of photopigment absorptions expected from each stimulus.

This term acts as a normalizing criterion to correct for the overall response to the two stimuli. The decision rule in Equation~28 compares a weighted sum of cone absorptions to the normalizing criterion. If the first term exceeds the second, then choose response , else .

We have learned that a Bayes classifier for pattern discrimination can be implemented using the results of simple linear calculations, like the calculations represented in the outputs of some peripheral neurons. In fact, making a decision like a Bayes classifier is equivalent to comparing the response of a linear neuron with a criterion value. If response of the linear neuron with receptive field defined by exceeds the criterion value, then choose stimulus , otherwise choose .

Figure A3.2 shows an example of a Bayes classifier computed for a simple pattern discrimination task. The two spatial patterns to be discriminated are shown in panels (a) and (b). The image in (a) is the retinal image of a line segment. Figure A3.2b shows the retinal image of a pair of line segments, separated by 30 sec of arc. The grid marks on the image are spaced by 30 sec of arc, essentially the same spacing as the cone inner segments. The curves below the images measure the intensity of the stimuli across a horizontal section.

Figure A3.2c shows the weights of the Bayes classifier for discriminating the two images; the weights were computing using Equation~29. In this image the gray level represents the value of the weight that should be applied to the individual cone absorptions: a medium gray intensity corresponds to a zero weight, lighter values are positive weights, and darker values are negative weights. The spatial pattern of weights bears an obvious resemblance to the spatial patterns of receptive fields in the visual cortex.

The Bayes classifier specifies how to obtain the optimal discrimination performance when we know the signal and the noise. Apart from the optimality of the Bayes classifier itself, observers in behavioral experiments can never know the true signal and noise perfectly. The performance of the Bayes classifier, therefore, must be equal or superior to human performance. In nearly all cases where a Bayes classifier analysis has been carried out, the Bayesian classifier performance is vastly better than human observer performance. Human observers usually fail to extract a great deal of information available in the stimulus. In general, then, the Bayes classifier does not model human performance (Banks et al., 1987).

Why, then, is the Bayes classifier analysis important? The Bayes classifier analysis defines what information is available to the observer. Defining the stimulus is an essential part of a good experimental design. The Bayes classifier defines what the observer can potentially do, and this serves as a good standard to use in evaluating performance. The Bayes classifier helps us to understand the task; only if we understand the task, can we understand the observer’s performance.

The real power of this approach is that the ideal discriminator [Bayes classifier] measures all the information available to the later stages of the visual system. I would like to suggest that measuring the information available in discrimination stimuli with a model of the sort I have described here should be done as routinely as measuring the luminances of the stimuli with a photometer. In other words, we should not only use a light meter but we should also use the appropriate information meter. It is simply a matter of following the first commandment of psychophysics: “know thy stimulus.” (Geisler, 1987, p. 30)

# 4. Signal Estimation: A Geometric View

This book is filled with calculations of the general form

(30)

This formula is called the *dot product* of the vectors and . It is so important that it is often given a special notation, such as . Every time we multiply a matrix by a vector, , we compute the dot product with the rows of . In this section, I want to discuss some geometric intuitions connected with the dot product operation ^{6}.

^{6 }Physicists often refer to Equation~30 as the scalar product because the result is a scalar; mathematicians often refer to the dot product as an inner product and write it using angle brackets .

Although we used the dot product calculation many times, I decided not to use the dot product notation in the main portion of the book because the notation treats the two vectors symmetrically; but, in most applications the vectors and had an asymmetrical role. The vector was usually a stimulus, say the wavelength composition of a light or a one-dimensional spatial pattern, and the vector was part of the sensory apparatus, say a photopigment wavelength sensitivity or a ganglion cell receptive field. Because the physical entities we described were not symmetric, I felt that the asymmetric notation, , was more appropriate.

The scalar value of the dot product between two vectors is equal to

where is the vector length and is the angle between the vectors. To simplify the discussion in the remainder of this section, we will assume that the vector has unit length.

Figure A4.1a is a geometric representation of the dot product operation. The unit vector and the signal vector are drawn as arrows extending from the origin. A dashed line is drawn from the tip of (the signal) at right angles to the vector (the sensor). The length of the vector from the origin to point of intersection between the perpendicular dashed line and is called the *projection* of onto . Because we have assumed has unit length, the length of the projection, , is equal to the dot product, .

By inspecting Figure A4.1b, you can see that there are many signal vectors, , that have the same dot product with . Specifically, all of the vectors whose endpoints fall along a line that is perpendicular to , have the same dot product with . Hence, any signal represented by a vector whose endpoint is on this line will cause the same response in the sensor represented by the vector .

The geometric intuition extends smoothly to vectors with more than two dimensions. Figure A4.2a shows the dot product between a pair of three-dimensional vectors. In three-dimensions, all of the vectors whose endpoints fall on a plane perpendicular to the unit-length vector have the same dot product with that vector (Figure A4.2b). In four or more dimensions the set of signals with a common dot product value fall on a *hyperplane*.

Figures A4.1b and A4.2b illustrate the information that is preserved and that is lost in a dot product calculation. When we measure the dot product of a pair of two-dimensional vectors, we learn that the signal must have fallen along a particular line; when we measure a three-dimensional dot-product, we learn that the signal must have fallen within a certain plane. Hence, each dot product helps to define the set of possible input signals.

By combining the results from several dot products, we can estimate the input signal. We can use the geometric representation of the dot product in Figure A4.1 to develop some intuitions about how to estimate the signal from a collection of sensor responses. This is a *signal estimation* problem.

First, imagine that the response of each linear neuron is equal to the dot product of a vector describing the spatial contrast pattern of a stimulus with a vector describing the neuron’s receptive field. Further, suppose that the input stimuli are drawn from a set of simple spatial patterns, namely the weighted sums of two cosinusoidal gratings,

We can represent each of these spatial contrast patterns using a two-dimensional vector, , whose entries are the weights of the cosinusoids. We can represent the sensitivity of each linear neuron to these spatial patterns using a two-dimensional vector, , whose two entries define the neuron’s sensitivity to each of the cosinusoidal terms. Because the neurons are linear, we can compute the neuron’s response to any pattern in the set from the linear calculation in the dot product, namely, .

Figure A4.3a shows geometrically how to use two responses to identify uniquely the two-dimensional input stimulus. Suppose that the response of the neuron with receptive field is . Then, we can infer that the stimulus must fall along a line perpendicular to the vector and intersecting at a distance from the origin ^{7}. If the response to the second neuron is , we can draw a second line that describes a second set of possible stimulus locations. The true stimulus must be on both lines, so it is located at the intersection of the two dashed lines.

When the sensor responses have some added noise, which is always the case in real measurements, some sensor combinations provide more precise estimates than others. Figure A4.3b shows a pair of sensors whose vectors are nearly orthogonal to one another. Because of the sensor noise, the true identity of the signal is uncertain. The lightly shaded bands drawn around the perpendicular dashed lines show the effect of sensor noise on the estimated region from the individual measurements. The darkly shaded region is the intersection of the bands, showing the likely region containing the signal. For these orthogonal sensors, the dark band that is likely to contain the signal is fairly small.

Figure A4.3c shows the same representation but for a pair of sensors represented nearly parallel vectors. Such a pair of sensors is said to be *correlated*. When the sensors are correlated in this way, the shaded region can be quite large in the direction perpendicular to the sensor vectors. In the presence of noise, correlated sensors provide a poor estimate of the signal.

^{7 }In general, we would place the perpendicular line at a distance of , but we have assumed that has unit length.

## Matrix equations

Ordinarily, we use matrix multiplication to represent a collection of dot products. Many aspects of the signal estimation problem can be expressed and solved using methods developed for matrix algebra.

To see the connection between matrix multiplication and the signal estimation problem, consider the following example. Write the dot product of a two-dimensional signal and a sensor in the tableau format,

Now, suppose we have two sensors and two responses. Then, we can expand the tableau by adding more rows to represent the new measurements, as in

(31)

where is a vector that represents the sensitivity of the sensor. In the usual signal estimation problem, we know the properties of the sensors represented in the rows of the matrix and we know the responses in . We wish to estimate the signal in . The pair of sensor vectors represented in the rows of the matrix, call it , are the counterpart of the vectors shown in Figure A4.3a. Our ability to estimate the signal from the resposnes depends on the correlation between the rows of the matrix.

The matrix tableau in Equation~31 shows a special case in which there are two sensors and two unknown values of the signal. In general, we might have more sensors or more unknowns in the signal. Having more sensors would force us to add more rows to the matrix; having more unknowns in the signal would force us to change the lengths of the sensor and signal vectors. Each of these changes might change the general shape of the matrix tableau. We can write the general case, without specifying the number of sensors or unknowns, using the concise representation

(32)

Again, in the (linear) signal estimation problem we know the responses () and the sensors in the rows of and we wish to estimate the signal ().

Many topics we have reviewed in the text can be framed as linear estimation problems within the matrix Equation~32. For example, you might imagine that the rows of the matrix are retinal ganglion receptive fields, the responses are neural firing rates, and the input is a spatial contrast pattern (Chapter 5). The brain may then need to estimate various properties of the contrast pattern from the neural firing patterns. Or, you might imagine that the rows of the matrix represent the spectral responsivities of the three cone types, the responses are the cone signals, and the signal is a spectral power distribution (Chapters 4 and 9). Or, you might imagine that the each row of the matrix represents the receptive field of a space-time oriented neuron, the output is the neuron’s response, and the signal are the two velocity components of the local motion flow field (Chapter 10).

There are many matrix algebra tools that are useful for solving matrix equations like Equation~32. To choose the proper tool for a problem, one must take into account a variety of specific properties of the measurement conditions. For example, in some cases there are more sensors than there are unknown entries in the vector . We can represent this estimation problem using matrix tableau as

(33)

The shape of the tableau makes it plain that there are more responses in the vector than there are unknowns in the vector . This type of estimation problem is called *over-constrained*. When there is noise in the measurements, no exact solution to the over-constrained problem may exist; that is, there may be no vector such that the equality in Equation~33 is perfectly satisfied. Instead, we must define an error criterion and try to find a “best” solution, that is a solution that minimizes the error criterion.

In general, the best solution will depend on the noise properties and the error criterion. In some cases, say when the noise is Gaussian and the error is the sum of the squared difference between the observed and predicted responses, there are good closed form solutions to the linear estimation problem. That is, one can find an estimate of the signal without using any search algorithms, but by direct computation. If one has other error criteria or noise, a search may be necessary.

A full discussion of the problems involved in signal estimation and matrix algebra would take us far beyond the scope of this book, but there are several excellent textbooks that explain these ideas (Strang, 1993; Lawson and Hanson, 1974, Vetterling, et al., 1992). Also, the manuals for several computer packages, such as *Matlab* and *Mathematica*, contain useful information about using these methods and further references. To see how some of these tools have been applied to vision science, you might consult some of the following references (Brainard, 1994, Heeger and Jepson, 1992; Marimont and Wandell,1992; Nielsen and Wandell, 1988; Thomas et al., 1994; Tomasi and Kanade, 1991, Wandell, 1987)

# 5. Motion Flow Field Calculation

The motion flow field describes how an image point changes position from one moment in time to the next, say, as an observer changes viewpoint. I reviewed several properties of the motion flow field in Chapter 10, and I also reviewed how to use the information in a motion flow field to estimate scene properties, including observer motion and depth maps.

In this appendix I describe how to compute a motion flow field. There are various reasons one might need to calculate a motion flow field. For example, motion flow fields are used as experimental stimuli to analyze human performance (e.g., Royden et al., 1992; Warren and Hannon, 1990). Also, algorithms designed to estimate depth maps from motion flow fields must be tested using artificial motion flow fields (e.g., Tomasi and Kanade, 1992, Heeger and Jepson, 1992).

To calculate the motion flow field, we will treat the eye as a pinhole camera and we will define the position of the points in space relative to the pinhole. Based on this framework, we will derive two formulae. First, we will derive the *perspective projection* formula. This formula describes how points in space project onto locations in the image plane of the pinhole camera. Second, we will derive how translating and rotating the pinhole camera position, i.e., the *viewpoint*, changes the point coordinates in space. Finally, we combine these two formulae to predict how changing the viewpoint changes the point locations in the image plane, thus creating the motion flow field.

## Imaging Geometry and Perspective Projection

We base our calculations on ray-tracing from points in the world onto an image plane in a pinhole camera (see Chapter 2). To simplify some of the graphics calculations, it is conventional to place the pinhole at the origin of the coordinate frame and the imaging plane in the positive quadrant of the coordinate frame. Figure A5.1a shows the geometric relationship between a point in space, the pinhole, and the image plane.

We define the pinhole to be the origin of the imaging coordinate frame. The image plane is parallel to the X-Y plane at a distance along the Z axis. A ray from passes towards the pinhole and intersects the image plane location . Since the third coordinate, , is the same for all points in the image plane, we can describe the image plane location using only the first two coordinates, .

Figures A5.1bc show the geometric relationship between a point in space and its location in the image plane from two different views: looking down the Y axis and X axis, respectively. From both views, we can identify a pair of similar triangles that relate the position of the point in three space and the image point position. There are two equations that relate the point coordinates, , the distance from the pinhole to the image plane , and the image plane coordinates ,

(34)

## Imaging Geometry, Camera Translation and Rotation

The coordinate frame is centered at the pinhole. Hence, when the pinhole camera moves, each point in the world is assigned a new position vector. Here, we calculate the change in coordinates of the points for each motion of the pinhole optics. We will use the new coordinate to calculate the change in the point’s image position, and then the motion flow field vectors.

At each moment in time we can describe the motion of the camera using two different terms. First, there is a translational component that describes the direction of the pinhole motion. Second, there is a rotational component, that describes how the camera rotates around the pinhole. We use two vectors to represent the camera’s velocity. The vector represents the translational velocity. Each term in this vector describes the velocity of the camera in one of the three spatial dimensions. The vector represents the angular velocity. Each term in this vector describes the rotation of the camera with respect to one of the three spatial dimensions. The six values in these vectors completely describe the rigid motion of the camera.

We can compute the change in the coordinate,, given the translation, rotation, and point coordinates as follows.

These formulae apply when the rotation and translation are small quantities, measuring the instantaneous change in position.

## Motion Flow

Finally, we compute motion flow by using Equation~34 to specify how the change in position in space maps into a change in image position. The original point mapped into an image position ; after the observer moves the new coordinate, maps into a new image position, . The resulting motion flow is the difference in the two image positions, .

Equation~35 defines how the motion of the pinhole camera and the depth map combine to create the motion flow field (Heeger and Jepson, 1992).

(35)

The term is the value of for the point in three space with image position . The entries of the matrices and depend only on known quantities, namely the distance from the pinhole to the image plane and the image position, .

Equation~35 expresses a relationship we have seen already: The local motion flow field is the sum of two vectors (see Figure~??). One vector describes a component of the motion flow field caused by viewpoint translation, . The second vector describes a component of the motion flow field caused by viewpoint rotation, . Only the first component, caused by translation, depends on the distance to the point, . The rotational component is the same for points at any distance from the viewpoint. Hence, all of the information concerning the depth map is contained in the motion flow field component that is caused by viewpoint translation.