Convolutional Neural Network for Visual Tracking

Convolutional Neural Network for Visual Tracking

Image Question Answering 2015.09.24 Computer Vision Lab. Hyeonwoo Noh News Increasing interest in Image Question Answering NIPS 2015 Visalogy: Answering Visual Analogy Questions Fereshteh Sadeghi*, University of Washington; Ross Girshick, Microsoft Research; Larry Zitnick, Microsoft Research; Ali F arhadi, University of Washington Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Haoyuan Gao, Baidu; Junhua Mao, UCLA; Jie Zhou, Baidu; Zhiheng Huang, Baidu; Lei Wang, Baidu; Wei Xu*, Baidu Exploring Models and Data for Image Question Answering Mengye Ren*, University of Toronto; Ryan Kiros, U. Toronto; Richard Zemel, University of Toronto ICCV 2015 Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski*, Marcus Rohrbach, Mario Fritz (Oral) VQA: Visual Question Answering Stanislaw Antol*, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, Larry Zitnick, Devi Parikh Visual Madlibs: Fill in the blank Description Generation and Question Answering Licheng Yu*, Eunbyung Park, Alex Berg, Tamara Berg Contents Recap Necessity of Image Question Answering Dataset

State of the art methods Important Questions Strategy for Solving Image QA Sub-problems in Image QA Progress Experiments on multi-domain classification Discussion Recap: Necessity of Image Question Answering If we want to build a machine that understand the scene for general pur pose, which task we should solve? Image Classification? Object Detection? person Action Recognition? eating apple Semantic Segmentation? apple

person person There are lots of other informations.. Baby is wearing hat. The color of the hat is sky blue. Its sunny day. Limitation of Current Approach We couldnt define separate task for each different level of understanding. (We c ouldnt annotate training data as well.) We couldnt run different algorithm whenever we need different level of underst anding Recap: How could we define Image Q/A task? Task is defined by the question. What is the person doing? Tennis What is the animal? Elephant What color is the umbrella? How many people are their? Red Two We could collect only questions that we are interested in

Synthetic Dataset Carefully collected question / answers annotated by human Recap: Datasets for Image Question Answering DAQUAR-894/37 [1] COCO-QA [2] FM-IQA [3] VQA [4] [1] Malinowski Mateusz and Fritz Mario, A multi-world approach to question answering about rea l-world scenes based on uncertain input. In NIPS, 2014 [2] Mengye Ren, Ryan Kiros, and Richard Zemel, Exploring Models and Data for Image Question A nswering. In NIPS 2015 [3] Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. In NIPS 2015 [4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zit nick, and Devi Parikh, VQA: Visual Question Answering. In ICCV 2015 Recap: Dataset Overview

DAQUAR-894/37 [1] COCO-QA [2] FM-IQA [3] VQA [4] Image Dataset NYUD v2 MSCOCO MSCOCO MSCOCO How to collect Human annotation Auto-generation (from captions) Human annotation

Human annotation Number of Images 795 / 654 (train / test) 78,736 / 38,948 (train / test) 120,360 123,287 Number of Question 6794 / 5674 [894] 3876 / 594 [37] (train / test) 78,736 / 38,948 (train / test) 250,569

369,861 Number of Answers 6794 / 5674 [894] 3876 / 594 [37] (train / test) 78,736 / 38,948 (train / test) 250,569 3,698,610 Release Yes Yes Not yet Beta v0.9 release Full release: 2015/9

Reported score exist exist Human evaluation No reported score Reason why number of images are different for each coco based dataset: They are still collecting [3,4] or throw away some examples while generating Q/A automatically[2] DAQUARE-894/37: [894] contains 894 categories, [37] contains 37 classes (for object question) Recap: Summary on Dataset and Evaluation DAQUAR Too little data. Objective is relatively clear (number, object, color, {object, color}) COCO-QA Clear question type and lots of data helps checking improvement of the algorith m in subdivided problems (object, color, location, number) Syntactically wrong or illogical question / answer might be problem. FM-IQA As it contains many answers with several words, automatic evaluation is difficult, but might be useful to extend the model to generate multiple answers by fine-tu ning.

Translated question / answer pairs might be useful. VQA Various problem should be solved to perform well in this dataset Comparison of overall score will be possible (Annual Challenge) Recap: How to solve QA state of the arts Result on DAQUAR-37 Result on COCO-QA ACCURACY WUPS 0.9 WUPS 0.0 Mult-world[1] 0.1273 0.1810 0.5147 0.8278 IMG[2]

- - - 0.4758 0.8234 BOW[2] 0.3267 0.4319 0.8130 0.5331 0.6391 0.8825 LSTM[2]

0.3273 0.4350 0.8162 2VIS + BLSTM[2] 0.5509 0.6534 0.8825 VIS+LSTM[2] 0.3441 0.4605 0.8223 IMG+BOW[2] 0.5592 0.6678

0.8899 2VIS + BLSTM[2] 0.3578 0.4683 0.8215 QA-CNN[5] 0.5495 0.6536 0.8858 IMG+BOW[2] 0.3417 0.4499 0.8162

without multimodal[5] QA-CNN[5] 0.3966 0.4419 0.8214 0.5314 0.6433 0.8841 No img[5] 0.3270 0.4432 0.8098 Ensemble[2] (IMG+BOW+

2VIS+BLSTM) 0.3694 0.4815 0.8268 Ensemble[2] (IMG+BOW+ 2VIS+BLSTM) Human[1] 0.6027 0.6104 0.7896 0.5784 0.6790 0.8952 ACCURACY

WUPS 0.9 WUPS 0.0 IMG[2] 0.4302 0.5864 0.8585 BOW[2] 0.3752 0.4854 LSTM[2] 0.3676 VIS+LSTM[2] [1] Malinowski Mateusz and Fritz Mario, A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, 2014 [2] Mengye Ren, Ryan Kiros, and Richard Zemel, Exploring Models and Data for Image Question Answering. In NIPS 2015.

[5] Ma, Lin, Lu, Zhengdong, and Li, Hang. Learning to answer questions from image using convolutional neural network. . arXiv preprint arXiv:1506.00333, 2015. Contents Recap Necessity of Image Question Answering Dataset State of the art methods Important Questions Strategy for Solving Image QA Sub-problems in Image QA Progress Experiments on multi-domain classification Discussion Important Questions - 1 Could these methods solve Image Question Answering perfectl y? Why these model couldnt solve Image QA perfectly? How could you know? What solving Image QA means? How can we objectively deci de whether the algorithm solved Image QA or not? Important Questions - 2

When we can say that Image QA is solved perfectly? Reaching human level accuracy on real world image QA dataset. How could we see the progress? My opinion: Quantitative result on real world Image QA dataset Why? Remind main reason why we have to solve Image QA But isnt it hard to understand the source of improvement in re al world Image QA? Yes, but we can use sub-tasks (synthetic dataset or special purpose Q A dataset) to understand characteristics of algorithms. Note: Improvement in sub-task should lead to improvement in realworld Image QA dataset Contents Recap Necessity of Image Question Answering Dataset State of the art methods Important Questions Strategy for Solving Image QA Sub-problems in Image QA Progress Experiments on multi-domain classification Discussion

Strategy for Solving Image QA Objective Improve accuracy on real-world Image QA dataset. Strategy Find out Sub-problem Generate subtask Analyze previous methods Evaluate on realworld data Make Improvement Sub-problems in Image QA Classification with Complex Setting. Multi-domain classification Classification with input/output connection Zero-shot learning Novel Computer Vision Tasks Reference problem Spatial relation problem

Visual semantic role labeling [6] Weakly-supervised learning to count Data Efficiency Problem Operation compositionality Image QA task compositionality Natural Language Understanding. Extracting operation and inputs from question [6] Gupta, Saurabh, and Jitendra Malik. "Visual Semantic Role Labeling." arXiv preprint arXiv:1505.04474 (2015). Classification with complex setting Multi-domain classification Classification on multiple dataset at the same time Performance decrease due to more candidates for answer Some images might have more than one answer according to task What is this?: Person What is she doing?: Jumping Object classification: Person Action classification: Jumping object classifier action classifier

unified classifier Classification with complex setting object classifier Yes No yes/no classifier Elephant Lion Dog Cat hippo Elephant Lion Dog Cat hippo Classification with input/output connection Relation between classification and yes/no question

Standard classification : no input class output Yes/No classification : class input yes/no output If we regard two task separately, we might need more data, more par ameter What is this? Elephant Is this Elephant? yes Classification with complex setting Zero shot learning cat Real world Image QA contains too many classes Could we generalize more with finite number of labeled data? dogs cat cats dog penguin bear bears Are these penguins?

Is this kittie? Novel Computer Vision Task Reference Problem Even in the same image, answer differs according to the question. This task is different from multi-domain classification In multi-domain classification, if we separate each sub-task, it is solva ble. In this case even a separate sub-task cannot be solved with simple cl assification. what color is the cup? What color is the teapot? What color is the spoon? Novel Computer Vision Task Spatial Relation Problem Lots of questions require spatial relation reasoning to answer Spatial Relation in image QA is not directly related to coordinate in i mage. It is related to object class, object direction, 3d configurations of obje cts etc. What is behind the horse? What is in front of the bed? What is beside the cat? Novel Computer Vision Task

Visual semantic role labeling [6] Inferring semantic relations between objects A new dataset for this specific is released as well[6]. What is the woman holding? What is the man riding? What is the man throwing? [6] Gupta, Saurabh, and Jitendra Malik. "Visual Semantic Role Labeling." arXiv preprint arXiv:1505.04474 (2015). Novel Computer Vision Task Weakly supervised learning to count Counting could be solved by object detection. Object detection require bounding box annotation but, there are too many classes in QA task. Weakly supervised detection is difficult but, weakly supervised learni ng to count might be easier (we only have to count discriminative parts) How many people? How many dishes? How many snowboards? Data Efficiency Problem Operation Compositionality

Operation have to be generalized to any class (object). Otherwise, we need following number of data for every class (object) . ( data learn class ) ( data train operation ) ( number of solvable operations ) If our model could solve some operation, model could perform that operation to any class (object) training How many A? How many B? How many C? Is this A? Is this B? Is this C? Is this D? testing How many D? Data Efficiency Problem Image QA task compositionality If we can solve some Image QA tasks, we should be able to solve thei r compositions. Otherwise, we have to solve every composition of operations separat ely. We might also need separate dataset for each composition of operati ons reference problem spatial relation classification

What the man on the horse is doing? Natural Language Understanding In Image QA task, information about the task is given by Natur al Language. We have extract information (sequence of operations, inputs) f rom Natural Language. Open question: What is required information for each tasks? How can we learn to extract those information from Question? Contents Recap Necessity of Image Question Answering Dataset State of the art methods Important Questions Strategy for Solving Image QA Sub-problems in Image QA Progress Experiments on multi-domain classification Discussion Progress: multi-domain classification

Synthetic dataset If we merge two separate classification task, will performance decrease? Experiment on synthetic data: Simple nave approach works well 0,1,2,3,4,5, 6,7,8,9 Airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck mnist cifar Airplane, automobile, 0,1,2,3,4,5, bird, cat, deer, dog, 6,7,8,9 frog, horse, ship, truck unified Accuracy Separate network

Simple Unified network Mnist 99.62 % 99.66 % Cifar 10 88.61 % 88.40 % Mnist + Cifar 10 94.12 % 94.03 % Progress: multi-domain classification Synthetic dataset More difficult problem: what if same image have different labe ls according to the tasks?

Nave Ideal Task For task 1: one have to label mnist 0~9 For task 2: one have to label mnist to number Number, Airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck 0,1,2,3,4,5, 6,7,8,9 Task 1, 2 naive 0,1,2,3,4,5, 6,7,8,9 Task 1 feature score 87.66% 94.16%

94.16% Number, Airplane, automobile, Airplane, automobile, 0,1,2,3,4,5, bird, cat, deer, dog, bird, cat, deer, dog, 6,7,8,9 frog, horse, ship, truck Number frog, horse, ship, truck ideal Task 2 Task feature Task 1 Task 2 Progress: multi-domain classification Synthetic dataset - analysis Simple hypothesis on how task feature works so well Training classifier: softmax classification

( )= exp ( ) exp ( ) We can apply element-wise multiplication of vector to softma x ( )= exp ( ) exp ( ) exp ( ) = exp ( + ) = exp ( ) exp ( ) exp ( ) exp ( + ) In ideal case, for 0~9 are 1 and others are 0 for task 1, and for 0~9 are 0 and others are 1 for task 2. If is trained appropriately, this behavior could be approximate d.

Progress: multi-domain classification Synthetic dataset with NLP More difficult: what if you have to learn question type from na tural language? I define 3 tasks and several questions for each tasks. Use lstm to generate task feature from question sentence Airplane, automobile, 0,1,2,3,4,5, bird, cat, deer, dog, 6,7,8,9 frog, horse, ship, truck Number score Task feature Gt task Lstm task feature 94.01 %

94 % Task feature Trained task1 task2 Which Object Is this task3 Task and question definition what is this number? (task 1: 1~10) which number is this? (task 1: 1~10) read this number (task1: 1~10) how do you call this digit? (task 1: 1~10) how do you call this number? (task 1: 1~10) what is this? (task 2: number, cls 1~10) what is this object? (task 2: number, cls 1~10) which object is this? (task 2: number, cls 1~10) how do you call this object? (task 3: 1~10, cls 1~10) what is the name of this object? (task 3: 1~10, cls 1~10) what is the class of this image? (task 3: 1~10, cls 1~10)

which class is this? (task 3: 1~10, cls 1~10) Progress: multi-domain classification Real dataset - COCOQA Relation to the current state-of-the art [IMG+BOW] IMG+BOW concatenate fc7 feature from image and BOW feature and feed it to softmax classifier (If BOW is changed to LSTM, this is same as our question feature mo del) So, we implemented this model and evaluate on COCO-QA score IMG+BOW IMG+LSTM 55.92% 55.42% Progress: multi-domain classification Real dataset - COCOQA Why IMG+BOW is better than IMG+LSTM and other models? My hypothesis: lack of training data (question) in Image QA dataset: hard to generalize When validation accuracy achieve highest score, training accuracy exceed mo re than 90% accuracy

Solution: Utilize information from large language corpus 1. use word2vec to cluster answers as similar types 2. use pre-trained sentence embedding model for question feature extraction Progress: multi-domain classification Answer Type Clustering Clustering answers to using word2vec vector Word2vec: word vector trained by skip-gram Similar words have similar vectors K-means clustering with (k = 40) is used yellow carrots Rail Four Buses flag turkey Airplane

Two Scooter green Chicken Airplanes Three Wheel brown Bananas Terminal One Wheelchair flags

Carrot Freight Ten Bicycles black Vegetable Trains Seven Trailer sign Cheese Ship Eight Motorcycles

purple Drink Airliner Five Highway blue Chocolate Locomotive Nine Tram gray Apples Shuttle Wagon

signs Vegetables Engine Ramp white lamb Bicycle Airport Six Interesting clustering results Progress: multi-domain classification Answer Type Clustering Problem for Answer Type Clustering: Prediction of answer cluster from question is very difficult When we use answer cluster number as answer type and each answer type define separate tasks. When we know ground truth answer cluster number: 59.34% Answer type classification accuracy with lstm: 61.83%

accuracy on training data: 92.40 % (overfeating) Expected total prediction accuracy: 59.34 * 0.6183 = 36.69% We need better question feature extraction method for better generalization Progress: multi-domain classification Skip thoughts vector as question feature Skip thoughts vector: similar to word2vec, but sentence level Given sentence , predict and Use GRU (Gated recurrent unit) for sentence embedding This sentence embedding could be used for answer type detection ta sk in Q/A task in NLP Progress: multi-domain classification Skip thoughts vector as question feature Answer type classification score score LSTM Rand Init GRU Skipthought Fixed GRU Rand Init GRU Skipthought Finetuning

61.83 % 67.81 % 70.05 % 71.34 % Total task score with answer type classification score Answer type detection + fixed Answer type detection + fine-tune Same architecture, End-to-End 40.62 % 43.11 % 43.37 % Skipthought vector as question feature, answer type regularization

score IMG+LSTM IMG+GRU rand init IMG+GRU rand + word vector IMG+GRU skipthought IMG+GRU skipthought+ ans type loss (0.2) IMG+GRU skipthought+ ans type loss (0.5) 55.42%

55.83 % 56.75 % 57.27 % 57.16 % 56.73 % Contents Recap Necessity of Image Question Answering Dataset State of the art methods Important Questions Strategy for Solving Image QA Sub-problems in Image QA Progress Experiments on multi-domain classification Discussion Discussion Sharing simple idea on solving some sub-tasks Yes/no question score selection layer

Reference problem synthetic dataset, attention or decoupled net Weakly supervised learning to count connection to weakly supervis ed detection, using recurrent attention model. Simple idea on solving Yes/No question Using classification result for yes/no answering Use task selection feature for final answering score score feature yes/no feature yes/no Simple idea on solving reference problem Use synthetic data if there is enough number of data, would the model solve reference problem? How to solve? Spatial Transformer Network[7], Attention model[8] Decoupled deep neural network

1 2 5 What is the color of 5? What is segmentation for person? [7] Spatial transformer network: http://arxiv.org/abs/1506.02025 [8] Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." arXiv preprint arXiv:1502.03044 (2015). Simple idea on weakly supervised learning to c ount Use synthetic data How to solve? Recurrent attention model[9] 1 1 1 How many 1? [9] Ba, Jimmy, Volodymyr Mnih, and Koray Kavukcuoglu. "Multiple object recognition with visual attention." arXiv preprint arXiv:1412.7755 (2014) VIS+LSTM Multi-World

Preliminary BOW with word embedding Counting word frequency: adding one-hot vectors of each word what is the color of the flower BOW 011110111 000000001 000000010 000010000 000000100 001000000 000100000 010000000 BOW with word embedding: adding word-vectors of each word what is the color of the flower 0.1 0.2 1.3 0.4 0.1 0.8

0.3 0.4 0.2 0.1 0.4 1.1 0.5 0.6 0.5 0.7 0.1 0.2 0.3 0.4 0.4 0.3 1.2 0.1 3.1 0.1 0.1 0.1

BOW 4.5 2.3 4.1 3.2 This BOW representation could be trained by setting word-vector for each work as parameters Preliminary RNN, LSTM Solution to Vanishing gradient LSTM Vanishing gradient is due to weight recursion among hidden units Instead of updating based on , update Constant error carousel[10] (gradient could be kept) 1 c g (input is updated by addition) (to keep magnitude of cell value) i

1 c f g i 1 c f g (to filter gradient) i 1 o c h [10] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):17351780. Gated Recurrent Unit - GRU

Recently Viewed Presentations

  • kimboal.ba.ttu.edu

    kimboal.ba.ttu.edu

    Garrett Blevins, Devin Mcbryde, Ryan Pollard, Devin Roberson. Cirque du Soleil. Dominated the Circus industry and achieved a level of revenues that took Ringling Bros. and Barnum & Bailey more than 100 Years to attain.
  • Strategies of gain and strategies of waste: What determines ...

    Strategies of gain and strategies of waste: What determines ...

    Strategies of gain and strategies of waste: What determines the success of development interventions? Andrés Rodríguez-Pose & Callum Wilkie. Session IV: Supporting Lagging regions - Managing Economic Efficiency / Spatial Equity tradeoffs. The World Bank, Washington, D.C., 29 September 2017
  • Caring Families Reading Bears Marketplace BACKGROUND Families needs

    Caring Families Reading Bears Marketplace BACKGROUND Families needs

    Operational Definition - Self reported data (via discussions, Wednesday meeting) . In meeting we have to tease out Avondale families (idea add to the promise card commitment sheet). Reading (doesn't have to be a book, any literacy activity signs, cereal,...
  • homophones - Primary Resources

    homophones - Primary Resources

    This is followed by picture clues (some with sound) and then finally by the homophones. When is the earliest you can guess the words? Can you spell the homophones on your white board before they are revealed? In most cases...
  • Radiation's Interaction with Matter

    Radiation's Interaction with Matter

    Interaction between β- or β+ and an orbital e- is interaction between 2 charged particles of similar mass. βs of either charge lose energy in large number of ionization and/or excitation events, similar to α. Due to smaller size/charge, lower...
  • Sakai Technical Overview Part II Charles Severance June

    Sakai Technical Overview Part II Charles Severance June

    UM = 8X Dell PowerEdge 2650, dual 2.4-3.2 GHz CPU, 4 GB RAM Database Server UM = SunFire V480, Quad 900 MHz CPU, 20GB RAM, 4 StorEdge 3310 SCSI RAID Arrays w/ 12 73Gb disks (Oracle) File Server (optional) IU...
  • Main Title 32pt - Sandia National Laboratories

    Main Title 32pt - Sandia National Laboratories

    ELECTRONS PER BOND. So the lattice has four electrons and two bonds per atom. That means there are exactly two electrons per bond, as depicted by these blue circles, and, from chemistry, we know that this is a good and...
  • 25 I am the resurrection and the life.

    25 I am the resurrection and the life.

    8So, departing quickly from the tomb with fear and great joy, they ran to tell His disciples the news. 9Just then Jesus met them and said, "Good morning! " They came up, took hold of His feet, and worshiped Him....