Untitled

The original hypothesis was that image processing in sequential steps (saccades) should improve object recognition. The first step is to retrieve salient locations of an image and process that region with high resolution (while the other parts were blurred). It can give three advantages,

  1. simpler features are much easier to classify.
  2. salient features are encoded in local coordinates thus there should be no problems with a translation invariance.
  3. Local features are encoded in the context of the global object (in a sequential manner), that can encode spatial relationships of features:

https://www.google.com/chart?cht=tx&chf=bg,s,FFFFFF00&chco=000000&chl=accumulated\+representation\+%3D\+f(accumulated\+representation%2C\+new\+saccade\+feature%2C\+vector\+movement)

The second step is to encode the feature into the neural network activity. We took an unsupervised approach and used linear activation with selecting k most active cells (k-winners-take-all, kWTA), which is a biologically plausible way. The result is a sparse binary representation of a feature. The movement to the next salient point is also encoded in a binary vector that is related to previous and current feature representation. It learns the rules “if I see the feature A, and make the move B, then I will see the feature C”. The hope was that this coincidence in a very high dimensional space of all rules would have good generalization capacity and good recognition accuracy. However, we could not make it work, largely because of the problem of unsupervised feature encoding into the network. So, we decided to leave this project aside for a while and to deal with the encoding problem.

See preprint draft

active-vision.pdf

See Github repository

https://github.com/KyivAIGroup/ActiveVision