赞
踩
An autonomous driving simulator based on Nerf with three features: instance-aware, modular, and realistic.
As for current autonomous driving algorithms, training the corner cases is helpful for their performance bottleneck.
Existing autonomous driving simulation methods have their own limitations, such as CARLA, AADS, and GeoSim.
Recently, Neural Scene Graph (NSG) decomposes dynamic scenes into learned scene graphs and learns latent representations for category-level objects. However, its multi-plane-based representation for background modeling cannot synthesize images under large viewpoint changes.
The input to the system consists of a set of RGB-images { I i } N \{\mathcal{I}i\}^N {Ii}N, sensor poses { T i } N \{\mathcal{T}i\}^N {Ti}N (calculated using IMU/GPS signals), and object tracklets (including 3D bounding boxes { B i j } N × M \{\mathcal{B}_{ij}\}^{N\times M} {Bij}N×M, categories { t y p e i j } N × M \{ \mathrm{type}_{ij}\} ^{N\times M} {typeij}N×M, and instance IDs { i d x i j } N × M ) \{\mathrm{idx}_{ij}\}^{N\times M}) {idxij}N×M). N N N is the number of input frames and M M M is the number of tracked instances { O j } M \{\mathcal{O}_j\}^M {Oj}M across the whole sequence. An optional set of depth maps { D i } N \{\mathcal{D}_i\}^N {Di}N and semantic segmentation masks.
Architectures: It supports various NeRF backbones, which can be roughly categorized into two hyper-classes: MLP-based methods, or grid-based methods and this paper gives a formal exposition of grid-based methods(Instant-ngp).
Foreground Nodes: Similar to NSG, it exploits latent codes to encode
instance features and shared category-level decoders to encode class-wise priors.
It uses the standard volume rendering process to render pixel-wise properties:
c ^ ( r ) = ∑ P i T i α i c i + ( 1 − a c c u m ) ⋅ c s k y , T i = exp ( − ∑ k = 1 i − 1 σ k δ k ) d ^ ( r ) = ∑ P i T i α i t i + ( 1 − a c c u m ) ⋅ i n f s ^ ( r ) = ∑ P i T i α i s i + ( 1 − a c c u m ) ⋅ s s k y . \mathbf{\hat{c}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{c}_{i}+(1-\mathrm{accum})\cdot\mathbf{c}_{\mathrm{sky}}, T_{i}=\exp(-\sum_{k=1}^{i-1}\sigma_{k}\delta_{k})\\\hat{d}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}t_{i}+(1-\mathrm{accum})\cdot\mathrm{inf}\\\mathbf{\hat{s}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{s}_{i}+(1-\mathrm{accum})\cdot\mathbf{s}_{\mathrm{sky}}. c^(r)=Pi∑Tiαici+(1−accum)⋅csky,Ti=exp(−k=1∑i−1σkδk)d^(r)=Pi∑Tiαiti+(1−accum)⋅infs^(r)=Pi∑Tiαisi+(1−accum)⋅ssky.
where P i ∈ s o r t e d ( { P i b g , o b j } ) P_{i}\in sorted(\{P_{i}^{\mathrm{bg,~obj}}\}) Pi∈sorted({Pibg, obj}), α i = 1 − exp ( − σ i δ i ) \alpha_{i}=1-\exp(-\sigma_{i}\delta_{i}) αi=1−exp(−σiδi), δ i = t i + 1 − t i \delta_{i}=t_{i+1}-t_{i} δi=ti+1−ti , a c c u m = ∑ P i T i α i accum= \sum_{P_{i}}T_{i}\alpha _{i} accum=∑PiTiαi , c s k y \mathbf{c}_{\mathrm{sky}} csky is the rendered color from the Sky model, i n f inf inf is the upper bound distance, and s s k y s_{\mathrm{sky}} ssky is the one-hot semantic logits of the sky category.
As for the segment, we assign a one-hot vector to every object.
s
k
o
b
j
−
j
[
l
]
=
{
σ
k
o
b
j
−
j
i
f
l
=
c
a
t
e
g
o
r
y
o
f
j
′
s
i
n
s
t
a
n
c
e
0
o
t
h
e
r
w
i
s
e
,
for l in category
.
\left.\mathbf{s}_{k}^{\mathrm{obj-j}}[l]=\left\{
Sky Model
However, blending the sky color
c
s
k
y
c_{sky}
csky with background and foreground
rendering (Eq. 4) leads to potential inconsistency. Therefore, we introduce a BCE(Binary Cross Entropy) semantic regularization to alleviate this issue:
L s k y = B C E ( 1 − a c c u m , S s k y ) . \mathcal{L}_{\mathrm{sky}}=\mathrm{BCE}(1-\mathrm{accum},\mathcal{S}_{\mathrm{sky}}). Lsky=BCE(1−accum,Ssky).
Resolving Conflict Samples:
Due to the fact that our background and foreground sampling are done independently, there is a chance that background samples fall within the foreground bounding box, causing incorrect classification of foreground samples as background samples.
The ambiguity is NOT observed in NSG [21] as NSG only samples a few points on the ray-plane intersections, and is unlikely to have much background truncated samples.
To address this issue, we devise a regularization term that minimizes the
density sum of background truncated samples to minimize their influence during the rendering process as:
L a c c u m = ∑ P i ( t r ) σ i , \mathcal{L}_{\mathrm{accum}}=\sum_{P_{i}^{(\mathrm{tr})}}\sigma_{i}, Laccum=Pi(tr)∑σi,
where { P i ( tr ) } \{P_i^{(\text{tr})}\} {Pi(tr)} denotes background truncated samples.
L = λ 1 L c o l o r + λ 2 L d e p t h + λ 3 L s e m + λ 4 L s k y + λ 5 L a c c u m , \mathcal{L}=\lambda_1\mathcal{L}_{\mathrm{color}}+\lambda_2\mathcal{L}_{\mathrm{depth}}+\lambda_3\mathcal{L}_{\mathrm{sem}}+\lambda_4\mathcal{L}_{\mathrm{sky}}+\lambda_5\mathcal{L}_{\mathrm{accum}}, L=λ1Lcolor+λ2Ldepth+λ3Lsem+λ4Lsky+λ5Laccum,
where L d e p t h \mathcal{L}_{\mathrm{depth}} Ldepth is from Depth-supervised NeRF and L s e m \mathcal{L}_{\mathrm{sem}} Lsem is from SemanticNeRF.
Photorealistic Rendering
Dataset: KITTI+VKITTI
Instance-wise Editing
The blessing of modular design
Ablation Results
Unlike prior works [26SUDS, 15 Panoptic Neural Fields, 21NSG] that evaluate their method on a short sequence of 90 images, we use the full sequence from the dataset for all evaluations.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。