RESEARCH DIRECTION

Self-Supervised Representation Learning

Humans and animals acquire common sense from observation, not annotation. We study self-supervised objectives — predictive and joint-embedding formulations — that let systems learn multi-level representations of the world directly from unlabeled streams, predicting in representation space rather than pixel space, at multiple time horizons simultaneously.

Prediction in representation space

Pixel-perfect prediction wastes capacity on irrelevant detail — the flutter of every leaf. Joint-embedding predictive architectures sidestep this by predicting abstract representations of future states rather than the states themselves, with energy-based formulations handling the irreducible uncertainty. The model learns what is predictable and represents the rest as latent variables, not noise to be painted.

Hierarchies of abstraction

Reasoning at one timescale is a pathology. We build representation stacks in which each level abstracts and predicts over a longer horizon than the one below — millisecond dynamics at the bottom, task and goal structure at the top — so that planning can decompose long-horizon problems into short-horizon subgoals natively.

Labels as scaffolding, not foundation

Supervised objectives remain useful for evaluation and steering, but a system whose representations depend on labeled data inherits the coverage limits of its annotators. Our foundations are trained on prediction; supervision is applied sparingly, at the top, where it is cheapest and most meaningful.

WORKING PRINCIPLES

How we hold this work to account.

Abstraction over pixels

Predict the representation, not the rendering.

Multiple horizons

Every level of the hierarchy owns its own timescale.

Observation is the curriculum

The world supplies more supervision than any dataset.

CONTINUE EXPLORING

More research directions.

ALL RESEARCH