Types of invariance
'Invariance' is when different patterns are mapped onto a single output dimension. For example, many different images of dogs might all activate one 'dog-detector' (one of several output dimensions), while still activating other output dimensions of size, furriness, or position.
But there can be many reasons to map different patterns onto a single dimension. Those different reasons correspond to different types of invariance, as discussed below.
If one pattern is mapped to a certain output dimension, then a nearly identical pattern should clearly be mapped to the same output. For example, if one picture of a dog is mapped to 'dog,' then the same picture with a single pixel changed should be mapped to 'dog' as well. Somehow, the system needs a minimal way of blurring very similar patterns together; at the very least, that can be done by the natural graininess of the input sensors and filters, and may also be done more explicitly at higher processing stages.
Sometimes, very different patterns should be mapped to a common dimension, just because we know a priori when designing the system that this should happen. Because the patterns are quite different, the system needs an explicit, algorithmic mapping to discover the invariance; and because the the designers know of this algorithm in advance, they (or Nature) can hard-wire it into the sytem. As a result, the system can discover invariances among patterns unlike any it has seen before, because the mapping algorithm is applied to all patterns, without waiting for any experience or learning. For example, all of the following could be accomplished by hard-wired invariance:
Note that the four examples above are rather restricted: the first three involve transformations of time alone, and the final one only works for dimensions coding a very specific kind of value (intensity, which is distinct from other common scalars like probability or position). The more general case--hard-wiring invariance among different 'spatial' patterns--is nearly impossible, because the patterns themselves have to be learned, and the system does not know a priori which dimensions go into which patterns. Time is the main exception, since it is the only 'dimension' which the system can access directly and separately everywhere; intensity is an exception too, but only at the outermost sensory layers.
So in order to find the most general kind of invariances (like 'dogs'), we need to explore arbitary mappings.
This is what people usually think of by 'invariance': utterly dissimilar images (e.g. various views of various kinds of dogs) all get mapped to a single category. Arbitrary mappings like this can be defined easily enough if there is a supervisor (or 'teacher') to indicate which pattern goes into which category... but a natural learning system needs to figure out the categories on its own, without a teacher. What are the naturally available signals from which it might learn these invariances?
Here are some possible natural clues for learning invariances. They are somewhat vague, and each is doubtless used somewhere yet probably not used everywhere.
This is an extremely general approach, potentially useful for all sensory modalities and levels of abstraction. It could be used to group together similarly-oriented contours in early vision (as occurs in 'complex cells'), yet also to group together utterances from a single speaker, or life-experiences in a single country. The key is that time is a unique sensory dimesion, and thus provides unique cues to grouping and organizing the myriad other 'spatial' dimensions.
While this works well at discovering objects whose transformations vary quickly, it cannot work when a tranformation is fixed for awhile. For example, you cannot use temporal contiguity to learn pitch-invariance--that a tune in one key is 'the same' as in a different key--because you never get to hear the two keys right after each other. Nor can you use this trick to discover that a word spoken by one voice is the same as that word spoken by someone else, since you don't get two different people to say the same word in quick succession. This specific failure occurs because temporal contiguity cannot be used as a cue for patterns which stretch across time; if the pattern takes up time, the cue cannot do so simultaneously.