Generation and Comprehension of unambiguous object descriptions

Spatial positions need bounding box

Evaluation of the mid level vectors because it could be an encoding - model will communicate better with itself in its own encoding.

Training is semi supervised in this paper - this is like - bootstrapping?

Examples show interesting visual features that can encode ‘behind’. Assumption is that the training dataset must have such words.

Hard ground truth is the most confusing caption.

Mutual information is used in the paper.

N - hyper param from UNC dataset

Applications

Guide attention models

Dividing regions

Harms captioning and VQA
Big LSTM should be able to understand different regions
Residual networks
- Force the network to remember the image
- Image vector is fed again and again

Are they training word embeddings on their own ? Their dims are different from the standard word embeddings.

P(sentence

region, image)

Fine tuning on the MS COCO should give the last layer with 80 dims but the paper says that it is 1000 dims, why?

Interesting to note that the weak labels did not mess up the actual result labels

A possible idea that comes to mind is, can unambiguous object descriptions be used for an image retrieval task?

What is the one thing about this image that defines it uniquely
Something that is different from all other images?
However, what may be unique in an image may not be unique over the entire dataset
- This might be a problem while approaching the image retrieval task.

24 June 2016