One encoder. Every view of the road

A single shared encoder trained jointly on depth, pose, 3D scene flow and four kinds of segmentation — then frozen, its latent space steers a car better than ImageNet pretraining. First-author paper with MIT CSAIL, accepted to ICRA 2026.

venue: ICRA 2026 · accepted
collaboration: Capgemini × MIT CSAIL
role: First author
domain: Autonomous driving
year: 2024

Pipeline diagram: input frames feed one shared encoder trained on segmentation, depth, pose, 3D flow and motion mask; its frozen weights then predict steering angle. — One shared encoder is trained across every perception task at once. Its weights are then frozen and reused — unchanged — to predict the steering angle.

Grid of Cityscapes scenes with panoptic, instance and semantic segmentation, depth, 3D scene flow and motion masks for each. — One encoder, many outputs — panoptic / instance / semantic segmentation, depth, 3D scene flow and motion masks on Cityscapes, all read from a single shared latent space.

01OVERVIEW

Estimating steering angle straight from a camera is starved of labelled data — train an encoder on that one task and it overfits. A human driver doesn't work that way: we read depth, motion, layout, and who's about to move, all at once. So we trained a single encoder to do the same thing.

Together with MIT CSAIL (Wei Xiao, Tsun-Hsuan Wang, Ramin Hasani, Daniela Rus) and the Capgemini Engineering Hybrid Intelligence team, I built one Swin-Transformer encoder trained jointly on depth, camera pose, 3D scene flow, and semantic, instance, panoptic and motion segmentation. A multi-scale pose network sharpens the depth signal, and knowledge distillation from several pretrained backbones keeps the joint training stable.

The shared encoder holds its own against task-specific specialists — on par with OneFormer for panoptic segmentation and competitive on depth across KITTI and Cityscapes. The headline result: frozen, its latent space predicts steering more accurately than the same encoder fine-tuned, and than an ImageNet-pretrained one. The paper was accepted to ICRA 2026.

02WHAT I BUILT

Unified multi-task encoder

One Swin-Transformer encoder trained jointly across depth, pose, 3D scene flow and four segmentation tasks — a single latent space that carries every cue a driver uses.

Multi-scale pose network

A pose branch that reads features at several scales, which in turn sharpens the self-supervised depth the rest of the system depends on.

Distillation from multi-backbone teachers

Knowledge distilled from several pretrained backbones to anchor and stabilise the joint training, so no single task collapses the shared representation.

Frozen latent space for steering

Left frozen, the encoder's dense representation predicts steering angle better than its own fine-tuned version and than ImageNet pretraining — the transfer result at the heart of the paper.

03STACK

Perception tasks

depthpose3D scene flowpanoptic segmotion segsteering

Method

Swin Transformermulti-task learningknowledge distillationfrozen-encoder transfer

Data · tools

KITTICityscapesPyTorch

04REFERENCES

↗ paper (arXiv)↗ project page ↗ code