The original hypothesis was that image processing in sequential steps (saccades) should improve object recognition. The first step is to retrieve salient locations of an image and process that region with high resolution (while the other parts were blurred). It can give three advantages,
https://www.google.com/chart?cht=tx&chf=bg,s,FFFFFF00&chco=000000&chl=accumulated\+representation\+%3D\+f(accumulated\+representation%2C\+new\+saccade\+feature%2C\+vector\+movement)
The second step is to encode the feature into the neural network activity. We took an unsupervised approach and used linear activation with selecting k most active cells (k-winners-take-all, kWTA), which is a biologically plausible way. The result is a sparse binary representation of a feature. The movement to the next salient point is also encoded in a binary vector that is related to previous and current feature representation. It learns the rules “if I see the feature A, and make the move B, then I will see the feature C”. The hope was that this coincidence in a very high dimensional space of all rules would have good generalization capacity and good recognition accuracy. However, we could not make it work, largely because of the problem of unsupervised feature encoding into the network. So, we decided to leave this project aside for a while and to deal with the encoding problem.
See preprint draft
See Github repository