值得再读:√ (有开源代码,和我想研究的方向符合)


0.1 逐句翻译

We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map.
The contributions of this work are three-fold.

First, we develop a new large-scale visual localization method targeted for indoor environments.

The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with textureless indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders.

Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization.

Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario.

Third, we demonstrate that our method signifi- cantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.

0.2 总结

  • 使用密集匹配姿态估计(常见的是局部特征)
  • 使用虚拟试图合成抵挡遮挡

1. Introduction


Autonomous navigation inside buildings is a key ability of robotic intelligent systems [24, 39]. Successful navigation requires both to localize a robot and to determine a path to its goal.
在建筑物内自主导航是机器人智能系统的一项关键能力 [24, 39]。成功的导航既需要定位机器人,也需要确定通往目标的路径。

One approach to solving the localization problem is to build a 3D map of the building and then use a camera1 to estimate the current position and orientation of the robot (Figure 1).
解决定位问题的一种方法是绘制建筑物的三维地图,然后使用摄像头1 估算机器人的当前位置和方向(图 1)。

Imagine also the benefit of an intelligent indoor navigation system that helps you find your way, for exam-ple, at Chicago airport, Tokyo Metropolitan station or the CVPR conference center. Besides intelligent systems, the visual localization problem is also highly relevant for any type of Mixed Reality application, including Augmented Reality [16, 44, 72].
想象一下智能室内导航系统的好处,例如在芝加哥机场、东京大都会车站或 CVPR 会议中心,它可以帮助您找到方向。除智能系统外,视觉定位问题还与任何类型的混合现实应用(包括增强现实)高度相关 [16, 44, 72]。


Due to the availability of datasets, e.g., obtained from Flickr [38] or captured from autonomous vehicles [19, 43], large-scale localization in urban environments has been an active field of research [6, 9, 14, 15, 19, 20, 27, 29, 34, 38, 44, 53–57, 65–67, 75, 79, 80].
由于数据集的可用性,例如从 Flickr [38] 或自动驾驶车辆 [19, 43] 获取的数据集,城市环境中的大规模定位已成为一个活跃的研究领域 [6, 9, 14, 15, 19, 20, 27, 29, 34, 38, 44, 53-57, 65-67, 75, 79, 80]。

In contrast, indoor localization [11, 12, 39, 58, 59, 64, 69, 74] has received less attenion in the last years.
相比之下,室内定位 [11, 12, 39, 58, 59, 64, 69, 74] 在过去几年受到的关注较少。

At the same time, indoor localization is, in many ways, a harder problem than urban localization: 1) Due to the short distance to the scene geometry, even small changes in viewpoint lead to large changes in image appearance.
与此同时,室内定位在许多方面比城市定位更难:1) 由于与场景几何形状的距离很短,即使视角的微小变化也会导致图像外观的巨大变化。

For the same reason, ocluders such as humans or chairs often have a stronger impact compared to urban scenes.

Thus, indoor localization approaches have to handle significantly larger changes in appearance between a query and reference images.

  1. Large parts of indoor scenes are textureless and textured areas are typically rather small.
  2. 室内场景的大部分区域都没有纹理,而且纹理区域通常很小。

As a result, feature matches are often clustered in small regions of the images, resulting in unstable pose estimates [29].
因此,特征匹配往往集中在图像的小区域,导致姿态估计不稳定 [29]。

  1. To make matters worse, buildings are often highly symmetric with many repetitive elements, both on large (similar corridors, rooms, etc.) and small (similar chairs, tables, doors etc.) scale.
  2. 更糟糕的是,建筑物通常高度对称,有许多重复的元素,无论是大的(相似的走廊、房间等)还是小的(相似的椅子、桌子、门等)都是如此。

While structural ambiguities also cause problems in urban environments, they often only occur in larger scenes [9, 54, 67].
虽然结构模糊也会在城市环境中造成问题,但它们往往只出现在较大的场景中 [9, 54, 67]。

  1. The appearance of indoor scenes changes considerably over the course of a day due to the complex illumination conditions (indirect light through windows and active illumination from lamps).

  2. 由于光照条件复杂(窗户的间接光照和灯具的主动光照),室内场景的外观在一天中会发生很大变化。

  3. Indoor scenes are often highly dynamic over time as furniture and personal effects are moved through the environment. In contrast, the overall appearance of building facades does not change too much over time.

  4. 随着家具和个人物品在环境中的移动,室内场景通常会随着时间的推移而高度动态化。相比之下,建筑物外墙的整体外观不会随时间发生太大变化。


This paper addresses these difficulties inherent to indoor visual localization by proposing a new localization method.

Our approach starts with an image retrieval step, using a compact image representation [6] that scales to large scenes.

Given a shortlist of potentially relevant database images, we apply two progressively more discriminative geometric verification steps: (i) We use dense matching of CNN descriptors that capture spatial configurations of higher-level structures (rather than individual local features) to obtain the correspondences required for camera pose estimation. (ii) We then apply a novel pose verification step based on virtual view synthesis that can accurately verify whether the query image depicts the same place by dense pixel-level matching, again not relying on sparse local features.


Historically, the datasets used to evaluate indoor visual localization were restricted to small, often room-scale, scenes.

Driven by the interest in semantic scene understanding [10, 23, 78] and enabled by scalable reconstruction techniques [28, 47, 48], large-scale indoor datasets covering multiple rooms or even whole buildings are becoming available [10, 17, 23, 64, 74, 76–78].
随着人们对语义场景理解的兴趣[10, 23, 78]以及可扩展重建技术的发展[28, 47, 48],覆盖多个房间甚至整栋建筑的大规模室内数据集开始出现[10, 17, 23, 64, 74, 76-78]。

However, most of these datasets focus on reconstruction [76,77] and semantic scene understanding [10, 17, 23, 78] and are not suitable for localization.
然而,这些数据集大多侧重于重建[76,77]和语义场景理解[10, 17, 23, 78],并不适用于定位。

To address this issue, we create a new dataset for indoor localization that, in contrast to other existing indoor localization datasets [10, 26, 64], has two important properties.
为了解决这个问题,我们创建了一个新的室内定位数据集,与其他现有的室内定位数据集[10, 26, 64]相比,该数据集具有两个重要特性。

First, the dataset is large-scale, capturing two university buildings.

Second, the query images are acquired using a smartphone at a time months apart from the date of capture of the reference 3D model.
其次,查询图像是使用智能手机获取的,获取时间与参考 3D 模型的获取时间相差数月。

As a result, the query images and the reference 3D model often contain large changes in scene appearance due to the different layout of furniture, occluders (people), and illumination, representing a realistic and challenging indoor localization scenario.
因此,由于家具、遮挡物(人)和光照的布局不同,查询图像和参考 3D 模型的场景外观往往会有很大的变化,这代表了一种现实而又具有挑战性的室内定位场景。


Contributions. Our contributions are three-fold.

First, we develop a novel visual localization approach suitable for large-scale indoor environments.

The key novelty of our approach lies in carefully introducing dense feature extraction and matching in a sequence of progressively stricter verification steps.

To the best of our knowledge, the present work is the first to clearly demonstrate the benefit of dense data association for indoor localization.

Second, we create a new dataset suitably designed for large-scale indoor localization that contains large variation in appearance between queries and the 3D database due to large viewpoint changes, moving furniture, occluders or changing illumination.

The query images are taken at a different time from the reference database, using a handheld device, and at different moments of the day, to capture enough variability, bridging the gap to realistic usage scenarios.

The code and data are publicly available on the project page [1].

Third, the proposed method shows a solid improvement over existing state-ofthe-art results, showing an absolute improvement of 17– 20% in the percent of correctly localized queries within a 0.25 – 0.5 m error, which is of high importance for indoor localization.
第三,与现有的最先进的结果相比,所提出的方法有了明显的改进,在0.25 - 0.5 m的误差范围内,正确定位查询的百分比绝对提高了17 - 20%,这对室内定位非常重要。

2. Related work

We next review previous work on visual localization.

第一段 (介绍基于图像检索的定位)

Image retrieval based localization. Visual localization in large-scale urban environments is often approached as an
image retrieval problem.

The location of a given query image is predicted by transferring the geotag of the most similar image retrieved from a geotagged database [6, 9, 18, 35, 54, 66, 67].

This approach scales to entire cities thanks to compact image descriptors and efficient indexing techniques [7, 8, 22, 31, 33, 49, 63, 70] and can be further improved by spatial re-ranking [51], informative feature selection [21, 22] or feature weighting [27, 32, 54, 67].

Most of the above methods are based on image representations using sparsely sampled local invariant features.

While these representations have been very successful, outdoor image-based localization has recently also been approached using densely sampled local descriptors [66] or (densely extracted) descriptors based on convolutional neural networks [6, 35, 40, 75].

However, the main shortcoming of all the above methods is that they output only an approximate location of the query, not an exact 6DoF pose.


Visual localization using 3D maps.

Another approach is to directly obtain 6DoF camera pose with respect to a pre- built 3D map.
另一种方法是根据预先构建的三维地图直接获取 6DoF 摄像机姿态。

The map is usually composed of a 3D point cloud constructed via Structure-from-Motion (SfM) [2] where each 3D point is associated with one or more local feature descriptors.

The query pose is then obtained by feature matching and solving a Perspective-n-Point problem (PnP) [14, 15, 20, 29, 34, 38, 53, 55].
然后通过特征匹配和解决透视-点问题(PnP)来获取查询姿势[14, 15, 20, 29, 34, 38, 53, 55]。

Alternatively, pose estimation can be formulated as a learning problem, where the goal is to train a regressor from the input RGB(D) space to camera pose parameters [11, 34, 59, 73]. While promising, scaling these methods to large-scale datasets is still an open challenge.
另外,姿势估计也可以表述为一个学习问题,目标是训练一个从输入 RGB(D)空间到相机姿势参数的回归器 [11、34、59、73]。虽然这些方法前景广阔,但将其推广到大规模数据集仍是一项公开挑战。

第三段(介绍室内3D地图 )

Indoor 3D maps. Indoor scene datasets [50, 52, 62, 68] have been introduced for tasks such scene recognition, classification, and object retrieval.

With the increased availability of laser range scanners and time-of-flight (ToF) sensors, several datasets include depth data besides RGB images [5, 10, 23, 26, 36, 60, 78] and some of these datasets also provide reference camera poses registered into the 3D point cloud [10, 26, 78], though their focus is not on localization.

Datasets focused specifically on indoor localization [59, 64, 69] have so far captured fairly small spaces such as a single room (or a single floor at largest) and have been constructed from densely-captured sequences of RGBD images.

More recent datasets [17, 76] provide larger scale (multi-floor) indoor 3D maps containing RGBD images registered to a global floor map.

However, they are designed for object retrieval, 3D reconstruction, or training deep-learning architectures.

Most importantly, they do not contain query images taken from viewpoints far from database images, which are necessary for evaluating visual localization.


To address the shortcomings of the above datasets for large-scale indoor visual localization, we introduce a new dataset that includes query images captured at a different time from the database, taken from a wide range of viewpoints, with a considerably larger 3D database distributed across multiple floors of multiple buildings.

Furthermore, our dataset contains various difficult situations for visual localization, e.g., textureless and highly symmetric office scenes, repetitive tiles, and repetitive objects that confuse the existing visual localization methods designed for outdoor scenes. The newly collected dataset is described next.

3. The InLoc dataset for visual localization 用于视觉定位的InLoc数据集


Our dataset is composed of a database of RGBD images geometrically registered to the floor maps augmented with a
separate set of RGB query images taken by hand-held devices to make it suitable for the task of indoor localization
(Figure 2). The provided query images are annotated with manually verified ground-truth 6DoF camera poses (reference poses) in the global coordinate system of the 3D map.


Database. The base indoor RGBD dataset [76] consists of 277 RGBD panoramic images obtained from scanning two
buildings at the Washington University in St. Louis with a Faro 3D scanner.
数据库基础室内 RGBD 数据集 [76] 由 277 幅 RGBD 全景图像组成,这些图像是用 Faro 3D 扫描仪扫描圣路易斯华盛顿大学的两座建筑后获得的。

Each RGBD panorama has about 40M 3D points in color.
每张 RGBD 全景图像都有大约 4000 万个彩色 3D 点。

The base images are divided into five scenes: DUC1, DUC2, CSE3, CSE4, and CSE5, representing five floors of the mentioned buildings, and are geometrically registered to a known floor plan [76].
基础图像分为五个场景:DUC1、DUC2、CSE3、CSE4 和 CSE5 分别代表上述建筑的五个楼层,并与已知的平面图进行几何注册[76]。

The scenes are scanned sparsely on purpose, to cover a larger area with a small number of scans to reduce the required manual work, as well as due to the long operating times of the high-end scanner used. The area per scan varies between 23.5 and 185.8 m2 .
由于使用的高端扫描仪工作时间较长,为了以较少的扫描次数覆盖较大的面积,减少所需的人工工作,我们特意对场景进行了稀疏扫描。每次扫描的面积在 23.5 至 185.8 平方米之间

This inherently leads to critical view changes between query and database images when compared with other existing datasets [64, 69, 74]2
与其他现有数据集[64、69、74]2 相比,这必然会导致查询图像和数据库图像之间的重要视图变化。


For creating an image database suitable for indoor visual localization evaluation, a set of perspective images is generated by following the best practices from outdoor visual localization [19, 66, 79].

We obtain 36 perspective RGBD images from each panorama by extracting standard perspective views (60◦ FoV) with a sampling stride of 30◦ in yaw and ±30◦ in pitch directions, resulting in 10K perspective images in total (Table 1).

Our database contains significant challenges, such as repetitive patterns (stairs, pillars), frequently appearing building structures (doors, windows), furniture changing position, people moving across the scene, and textureless and highly symmetric areas (walls, floors, corridors, classrooms, open spaces).


Query images. We captured 356 photos using a smartphone camera (iPhone 7), distributed only across two floors, DUC1 and DUC2.
查询图片。我们使用智能手机摄像头(iPhone 7)拍摄了 356 张照片,仅分布在 DUC1 和 DUC2 两个楼层。

The other three floors in the database are not represented in the query images, and play the role of confusers at search time, contributing to the buildingscale localization scenario.

Note that these query photos are taken at different times of the day, to capture the variety of occluders and layouts (e.g., people, furniture) as well as illumination changes.


Reference pose generation. For all query photos, we estimate 6DoF reference camera poses w.r.t. the 3D map. Each
query camera reference pose is computed as follows:

(i) Selection of the visually most similar database images.

For each query, we manually select one panorama location which is visually most similar to the query image using the perspective images generated from the panorama.

(ii) Automatic matching of query images to selected database images.

We match the query and perspective images by using affine covariant features [45] and nearestneighbor search followed by Lowe’s ratio test [42].
我们使用仿射协变特征[45]和最近邻搜索,然后使用Lowe’s ratio检验[42]来匹配查询图像和透视图像。

(iii) Computing the query camera pose and visually verifying the reprojection.

All the panoramas (and perspective images) are already registered to the floor plan and have pixel-wise depth information.

Therefore, we compute query pose via P3P-RANSAC [25], followed by bundle adjustment [3], using correspondences between query image points and scene 3D points obtained by feature matching.

We evaluate the obtained poses visually by inspecting the reprojection of edges detected in the corresponding RGB panorama into the query image (see examples in figure 3).


(iv) Manual matching of difficult queries to selected database images.

Pose estimation from automatic matches often gives inaccurate poses for difficult queries which are, e.g., far from any database image.

Hence, for queries with significant misalignment in reprojected edges, we manually annotate 5 to 20 correspondences between image pixels and 3D points and apply step (iii) on the manual matches.

(v) Quantitative and visual inspection. For all estimated poses, we measure the median reprojection error, computed as the distance of the reprojected 3D database point to the nearest edge pixel detected in the query image, after removing correspondences with gross errors (with distance over 20 pixels) due to, e.g., occlusions.
(v) 定量和目测。对于所有估计姿势,我们测量重投影误差中值,计算方法是重投影三维数据库点到查询图像中检测到的最近边缘像素的距离,然后剔除由于遮挡等原因造成的严重误差(距离超过 20 像素)的对应点。

For query images that have under 5 pixels median reprojection error, we manually inspect the reprojected edges in the query image and finally accept 329 reference poses out of the 356 query images.
对于中位重投影误差低于 5 像素的查询图像,我们会手动检查查询图像中的重投影边缘,最终在 356 张查询图像中接受了 329 个参考姿势。

4. Indoor visual localization with dense matching and view synthesis 基于密集匹配和视图合成的室内视觉定位

We propose a new method for large-scale indoor visual localization. We address the three main challenges of indoor


(1) Lack of sparse local features. Indoor environments are full of large textureless areas, e.g., walls, ceilings, floors and windows, where sparse feature extraction methods detect very few features.

To overcome this problem, we use multi-scale dense CNN features for both image description and feature matching.

Our features are generic enough to be pre-trained beforehand on (outdoor) scenes, avoiding costly re-training, e.g., as in [11, 34, 73], of the localization machine for each particular environment.


(2) Large image changes. Indoor environments are cluttered with movable objects, e.g., furniture and people, and 3D structures, e.g., pillars add concave bays, causing severe occlusions when viewed from a close distance.

The most similar images obtained by retrieval may therefore be visually very different from a query image.

To overcome this problem, we rely on dense feature matches to collect as much positive evidence as possible.

We employ image descriptors extracted from a convolutional neural network that can match higher-level structures of the scene rather than relying on matching individual local features.

In detail, our pose estimation step performs coarse-to-fine dense feature matching, followed by geometric verification and estimation of the camera pose using P3P-RANSAC.


(3) Self-similarity. Indoor environments are often very self-similar, e.g., due to many symmetric and repetitive elements on a large and small scale (corridors, rooms, tiles, windows, chairs, doors, etc.).
3) 自相似性。室内环境通常具有很强的自相似性,例如,由于存在许多对称和重复的大小元素(走廊、房间、瓷砖、窗户、椅子、门等)。

Existing matching strategies count the positive evidence, i.e., how much of the image (or how many inliers) have been matched, to decide whether two images match.

This is, however, problematic as large textureless areas can be matched well, hence providing strong (incorrect) positive evidence.

To overcome this problem, we propose to count also the negative evidence, i.e., what portion of the image does not match, to decide whether two views are taken from the same location.

To achieve this, we perform explicit pose estimate verifi-cation based on view synthesis.

In detail, we compare the query image with a virtual view of the 3D model rendered from the estimated camera pose of the query.
具体来说,我们将查询图像与根据查询图像的相机姿态估计值渲染的 3D 模型虚拟视图进行比较。

This novel approach takes advantage of the high quality of the RGBD image database and incorporates both the positive and negative evidence by counting matching and non-matching pixels across the entire query image.
这种新颖的方法利用了 RGBD 图像数据库的高质量,通过计算整个查询图像中匹配和不匹配的像素,同时纳入了正反两方面的证据。

As shown by our experiments, this approach is orthogonal to the choice of local descriptors.

The proposed verification by view synthesis is consistently showing a significant improvement regardless of the choice of features used for estimating the pose


The pipeline of InLoc has the following three steps.

Given a query image, (1) we obtain a set of candidate images by finding the N best matching images from the reference image database registered to the map.

(2) For these N retrieved candidate images, we compute the query poses using the associated 3D information that is stored together with the database images.

(3) Finally, we re-rank the computed camera poses based on verification by view synthesis.

The three steps are detailed next.

4.1. Candidate pose retrieval 候选位姿检索

As demonstrated by existing work [6, 35, 66], aggregating feature descriptors computed densely on a regular grid mitigates issues such as a lack of repeatability of local features detected on textureless scenes, large-illumination changes, and a lack of discriminability of image description, dominated by features from repetitive structures (burstiness).

As already mentioned in section 1, these problems are also occurring in large-scale indoor localization, which motivates our choice of using an image descriptor based on dense feature aggregation.

Both query and database images are described by NetVLAD [6] (but other variants could also be used), normalized L2 distances of the descriptors are computed, and the poses of the N best matching images from the database are chosen as candidate poses.

In section 5, we compare our approach with the state-of-the-art image descriptors based on local feature detection and show benefits of our approach for indoor localization.

4.2. Pose estimation using dense matching 基于密集匹配的姿态估计


A severe problem in indoor localization is that standard geometric verification based on local feature detection [51,54] does not work on textureless or self-repetitive scenes, such as corridors, where robots (and also humans) often get lost.

Motivated by the improvements in candidate pose retrieval with dense feature aggregation (Section 4.1), we use features densely extracted on a regular grid for verifying and re-ranking the candidate images by feature matching and pose estimation.

A possible approach would be to match DenseSIFT [41] followed by RANSAC-based verification.

Instead of tailoring DenseSIFT description parameters (patch sizes, strides, scales) to match across images with significant viewpoint changes, we use an image representation extracted by a convolutional neural network (VGG-16 [61]) as a set of multi-scale features extracted on a regular grid that describes more higher-level information with a larger receptive field (patch size).


We first find geometrically consistent sets of correspondences using the coarser conv5 layer containing high-level
information. Then we refine the correspondence by search

