Probability vs Parameters

We usually suppose that a neuron conveys analog information: scalars, real numbers, smoothly-varying values. But we don't know what that value represents, nor do we even know what we would like it to represent if we could design the whole system ourselves.

In an ideal world--one designed by theorists for purity of purpose and clarity of implementation--what kind of information would those numbers convey? Would each scalar represent a property or parameter of the input (like hue or angle), or represent the probability that the input has some property?

What do these codes mean?

Parameters
For example, the output of a neuron using the "parameter" interpretation might directly encode X-position, Y-position, angle, force, hue, or musical pitch. One pair of numbers says where it is ("x=5 inches from the left, y=8 inches from the bottom"), another number gives its hue ("wavelength = 550 nm", i.e. green), and so on.

In the example below, the pixel image is represented by just three numbers: the angle of the line segment, its hue, and its position.

Probabilities: The alternative encoding-- the "probability" approach to analog signals--is the natural approach for neural feature-detectors. There are a bunch of separate detectors looking at the input, and each one asks how much the input overlaps with what it's looking for; each unit independently encodes the relative presence or absence of some feature. One unit might encode "horizontal-ness," another "vertical-ness," and so on.

In the example below, the same pixel image is compared to a dozen different templates, and the templates with the most overlap--those which are most similar to the input--will signal the strongest output values.

Advantages and disadvantages

The parameter encoding is definitely the most compact: as long as only one value needs to be signalled, a single number here can carry the same information (say X-position) as would be carried by a multitide of location-detectors scattered across the range of X. In addition, the relation between possible values of a single number automatically includes their natural ordering: it is implicitly clear that 2.1 lies between 1.0 and 3.3, whereas separate detectors at 2.1, 1.0, and 3.3 would need to have their natural order or rank deduced and encoded elsewhere in the system. Hence the structure of the signal is compactly and naturally represented (as long as there is but a single "object").

There are a few main disadvantages to the parameter encoding. One is that it is hard to learn directly from unsupervised data (even discovering the underlying dimensions of the data is hard; finding the proper paramterization and interpolation scheme is even harder). Another disadvantage is that is only works by making very explicit assumptions about the input data (e.g. the world is made up of only straight lines)... or, equivalently, it only works when such structure exists and can be discovered. A third problem is that simultaneously representing more than one object or hypothesis (say multiple lines at once) doesn't work, due to the "binding problem"; and a final difficulty is that the probability/confidence/likelihood of any representation is completely absent: you can't tell a blurred or dim bar from a crisp one. The probability code, on the other hand, has several virtues: first, that a large number of "objects" can be encoded simultaneously; second, that it makes very few a priori assumptions about the structure of the input data (you can represent curved lines with it too); third, that information about both value and probability is encoded in a single kind of signal (there is no need to have one channel for "values" and a distinct channels for "probabilities"); and finally, that new feature-combinations can be discovered from examining the coincidences between active units, Hebb-style.

But the probability approach has serious costs too: first, it requires far more units to represent the same information; second, the natural ordering of scalar quantities like position and hue is never accomplished, but is set aside to be discovered elsewhere; and third, that the input is not necessarily grouped into distinct "objects" or hypotheses, but can be left undifferentiated, again to be discovered elsewhere. So this method can represent more of the world in a more general way, but it does so by learning less about how the world is actually constructed.

Conclusion: Broadly speaking, the "parameter" approach is far more efficient, far closer to low-dimensional "reality," and tackles most of the hard problems up-front. The "probability" approach is easier, more general, and more flexible, but leaves a truly compact, meaningful representation to be discovered by other circuits yet unknown.