官网:GQA: Visual Reasoning in the Real World
问题减少 强的语言偏置,很多都是根据场景语义图进行构建
consistency, validity, plausibility, grounding and distribution scores
Our starting point in creating the GQA dataset is the Visual Genome Scene Graph annotations [ 20 ] that cover 113k images from COCO [ 23 ] and Flickr [ 36 ]. 2 The scene graph serves as a formalized representation of the image: each node denotes an object , a visual entity within the image, like a person, an apple, grass or clouds. It is linked to a bounding box specifying its position and size, and is marked up with about 1–3 attributes , properties of the object: e.g., its color, shape, material or activity. The objects are connected by relation edges, representing actions (verbs), spatial relations (prepositions), and comparatives.
The GQA dataset consists of 22,669,678 questions over 113,018 images, which cover wide range of reasoning skills and vary in length and number of required inference-steps (fifigure 6 ). The dataset has a vocabulary size of 3097 words and 1878 possible answers. While smaller than natural language datasets, further investigation reveals that it covers 88.8% and 70.6% of VQA questions and answers respectively, corroborating its wide diversity. A wide selection of dataset visualizations is provided in the supplementary.
We associate each question with two types: structural and semantic. The structural type is derived from the fifinal operation in the question’s functional program. It can be (1) verify for yes/no questions, (2) query for all open questions, (3) choose for questions that present two alternatives to choose from, e.g . “Is it red or blue?”; (4) logical which involve logical inference, and (5) compare for comparison questions between two or more objects. The semantic type refers to the main subject of the question: (1) object : for existence questions, (2) attribute : consider the properties or position of an object, (3) category : related to object identification within some class, (4) relation : for questions asking about the subject or object of a described relation (e.g . “what is the girl wearing?” ), and (5) global : about overall properties of the scene such as weather or place. As shown in fifigure 6 , the questions’ types vary at both the semantic and structural levels.
validity 有效性是指回答的问题要在问题类型的范围内,不能所答非所问。 例如问颜色,回答是对错。
plausibility 合理性,则回答要求更高,具备一些常识性的知识。不能有违背常识性的问题,例如大象会说话,吃披萨等...
这里是由于GQA 有scenegraph 文件中只有目标的名字,没有字典目标名对应目标id。
所以我们首先需要建立一个字典。这里采用vg的字典做基础。将gqa的场景图json 进行整理
这里采用lxmert 的split 图片的方式,将scenegraph 根据其split 方式进行分割,因为GQA官网没有分割好的文件可下载!!!
将lxmert 的github 中的train.json,valid.json,testdev.json进行抽取imageid 并将与scenegraph 中的keys 比对,求取是否每个图片都有其人工标注的scenegraph.
通过实验发现train 的所有graph 都在scenegraph 中,但是额外的2000多个scenegraph 无从考证。
下面考虑scenegraph 的val_sceneGraphs.json集合
最终发现testdev 没有 scenegraph
统计一下train 的目标类别数据分布
['stop sign,stopsign', 'microwave,microwave oven', 'refrigerator,fridge', 'television,tv', 'sailboat,sail boat', 'racket,racquet', 'headboard,head board', 'tennis racket,tennis racquet', 'skateboard,skate board', 'hot dog,hotdog', 'surfboard,surf board', 'fire hydrant,hydrant', 'suitcase,suit case', 'donut,doughnut', 'sidewalk,side walk', 'stove top,stovetop', 'nightstand,night stand', 'donuts,doughnuts', 'lamp post,lamppost', 'fire truck,firetruck', 'tail light,taillight', 'hot dogs,hotdogs', 'tshirt,t-shirt,t shirt', 'streetlight,street light']
下面进行各个图片的object num 个数的统计,以设计一个合适的图片框个数
即 统计objects 数组的长度即可
当然官网的trainscenegraph 的文件共有这些图片生成了
将objnumtrain 中的数据筛选成和lxmert 一致的train 图片集
平均为16张 ,大于50个的总共180张图,大于40个的766个。
下面将各个object 的目标的坐标改为x1,y1,x2,y2,以及object 坐标,为其关系的提取做准备
整理成json 格式,这里只展示最终的状态:
- import numpy as np
- a=[1,2,3]
- np.diag(a)
valscenegraph 的关系类型总共有295种
trainscenegraph 总共有296 种类型,但是我看了大部分还是重复。有效的一般在150种左右
而vg fasterrcnn 总共提供了20个类型
其中下面这些不再vg的20 类中
{'categoryThis', 'existAttrNotC', 'activityWho', 'verifyAttrAnd', 'weatherChoose', 'materialVerify', 'materialChoose', 'companyVerifyC', 'existAttrOr', 'existAnd', 'categoryThisChoose', 'directOf', 'typeVerifyC', 'typeVerify', 'weather', 'sameRelate', 'activityChoose', 'positionQuery', 'companyVerify', 'weatherVerifyC', 'existRelSC', 'objThisChoose', 'typeChoose', 'company', 'weatherVerify', 'sameGender', 'verifyAttrKC', 'categoryAttr', 'sameAnimalsC', 'chooseAttr', 'categoryThatChoose', 'categoryRelO', 'existThatOr', 'relS', 'categoryThat', 'existAttrOrC', 'categoryRelS', 'diffAnimalsC', 'diffAnimals', 'twoSameMaterial', 'verifyMaterialAnd', 'placeChoose', 'sameAnimals', 'materialVerifyC', 'existRelS', 'existThatNotC', 'stateChoose', 'dir', 'existOrC', 'relVerifyCop', 'relVerify', 'positionVerifyC', 'comparativeChoose', 'twoSameC', 'twoDifferent', 'existMaterialC', 'existAndC', 'twoCommon', 'diffGender', 'locationVerifyC', 'sameGenderC', 'positionVerify', 'material', 'locationChoose', 'sameMaterialRelate', 'place', 'twoSameMaterialC', 'existMaterialNot', 'existAttr', 'relChooser', 'relVerifyCo', 'verifyAttr', 'how', 'existOr', 'verifyAttrs', 'verifyAttrsC', 'verifyAttrC', 'placeVerifyC', 'companyChoose', 'existC', 'existAttrNot', 'existMaterialNotC', 'categoryRelOChoose', 'twoDifferentC', 'existThatOrC', 'category', 'existThat', 'verifyAttrThis', 'twoSame', 'existThatNot', 'activity', 'relVerifyCr', 'verifyAttrCThis', 'state', 'existMaterial', 'exist', 'existAttrC', 'positionChoose', 'relO', 'directWhich', 'existRelSRC', 'existThatC', 'placeVerify', 'locationVerify', 'verifyAttrK'}
{"02930152": {"semantic": [{"operation": "select", "dependencies": [], "argument": "sky (2486325)"}, {"operation": "verify color", "dependencies": [0], "argument": "dark"}], "entailed": ["02930160", "02930158", "02930159", "02930154", "02930155", "02930156", "02930153"], "equivalent": ["02930152"], "question": "Is the sky dark?", "imageId": "2354786", "isBalanced": true, "groups": {"global": null, "local": "06-sky_dark"}, "answer": "yes", "semanticStr": "select: sky (2486325)->verify color: dark [0]", "annotations": {"answer": {}, "question": {"2": "2486325"}, "fullAnswer": {"2": "2486325"}}, "types": {"detailed": "verifyAttr", "semantic": "attr", "structural": "verify"}, "fullAnswer": "Yes, the sky is dark."}, "07333408": {"semantic": [{"operation": "select", "dependencies": [], "argument": "wall (722332)"}, {"operation": "filter color", "dependencies": [0], "argument": "white"}, {"operation": "relate", "dependencies": [1], "argument": "_,on,s (722335)"}, {"operation": "query", "dependencies": [2], "argument": "name"}], "entailed": [], "equivalent": ["07333408"], "question": "What is on the white wall?", "imageId": "2375429", "isBalanced": true, "groups": {"global": "", "local": "14-wall_on,s"}, "answer": "pipe", "semanticStr": "select: wall (722332)->filter color: white [0]->relate: _,on,s (722335) [1]->query: name [2]", "annotations": {"answer": {"0": "722335"}, "question": {"4:6": "722332"}, "fullAnswer": {"1": "722335", "5": "722332"}}, "types": {"detailed": "relS", "semantic": "rel", "structural": "query"}, "fullAnswer": "The pipe is on the wall."}, "07333405": {"semantic": [{"operation": "select", "dependencies": [], "argument": "pipe (722335)"}, {"operation": "verify color", "dependencies": [0], "argument": "red"}], "entailed": ["07333406"], "equivalent": ["07333405"], "question": "Is that pipe red?", "imageId": "2375429", "isBalanced": true, "groups": {"global": null, "local": "06-pipe_red"}, "answer": "no", "semanticStr": "select: pipe (722335)->verify color: red [0]", "annotations": {"answer": {}, "question": {"2": "722335"}, "fullAnswer": {"2": "722335"}}, "types": {"detailed": "verifyAttrC", "semantic": "attr", "structural": "verify"}, "fullAnswer": "No, the pipe is white."}, "15736264": {"semantic": [{"operation": "select", "dependencies": [], "argument": "clock (746851)"}, {"operation": "filter height", "dependencies": [0], "argument": "tall"}, {"operation": "choose size", "dependencies": [1], "argument": "large|small"}], "entailed": ["15736259", "15736258", "15736267", "15736253", "15736252", "15736251", "15736257", "15736256", "15736255", "15736254", "15736291", "15736249"], "equivalent": ["15736264"], "question": "Is the tall clock small or large?", "imageId": "2368326", "isBalanced": true, "groups": {"global": "size", "local": "10c-clock_size"}, "answer": "large", "semanticStr": "select: clock (746851)->filter height: tall [0]->choose size: large|small [1]", "annotations": {"answer": {}, "question": {"2:4": "746851"}, "fullAnswer": {"1": "746851"}}, "types": {"detailed": "chooseAttr", "semantic": "attr", "structural": "choose"}, "fullAnswer": "The clock is large."}
关于官网提交 leaderboard可以选择代码形式提交,比按钮体验好多了!按钮基本灰面提交状态不变!!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。