SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (2024)

WeiWu
Tsinghua University
wuwei@senseauto.com
&XiaoxinFeng11footnotemark: 1
SenseTime Research
fengxiaoxin@senseauto.com
ZiyanGao11footnotemark: 1
SenseTime Research
gaoziyan@senseauto.com
&YuhengKan
SenseTime Research
kanyuheng@senseauto.com
Equal contribution

Abstract

Data-driven autonomous driving motion generation tasks are frequently impacted by the limitations of dataset size and the domain gap between datasets, which precludes their extensive application in real-world scenarios. To address this issue, we introduce SMART, a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens. These tokens are then processed through a decoder-only transformer architecture to train for the next token prediction task across spatial-temporal series. This GPT-style method allows the model to learn the motion distribution in real driving scenarios. SMART achieves state-of-the-art performance across most of the metrics on the generative Sim Agents challenge, ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the NuPlan dataset for training and WOMD for validation, SMART achieved a competitive score of 0.71 on the Sim Agents challenge. Lastly, we have collected over 1 billion motion tokens from multiple datasets, validating the model’s scalability. These results suggest that SMART has initially emulated two important properties: scalability and zero-shot generalization, and preliminarily meets the needs of large-scale real-time simulation applications. We have released all the code to promote the exploration of models for motion generation in the autonomous driving field.

1 Introduction

In the context of autonomous driving, leveraging vectorized maps and vehicle trajectory data facilitates various motion generation tasks, including motion planning hu2023planning ; cheng2023rethinking ; huang2023gameformer ; huang2023dtpp ; cui2021lookout , motion prediction wilson2023argoverse ; ettinger2021large ; song2020pip , and Sim Agents gulino2024waymax . Previous research cui2019multimodal ; chen2023interaction ; nayakanti2023wayformer has predominantly employed encoder networks to represent driving scenes and decoder networks to generate multi-modal motions. These generated motions are then directly regressed to continuous trajectory distributions using Gaussian chai2019multipath or Laplace zhou2023qcnext mixture loss functions. While this framework demonstrates strong performance in prediction tasks that prioritize regression accuracy, it often underperforms in motion generative tasks that emphasize the safety and reasonableness of driving behavior, such as planning caesar2021NuPlan or Sim Agents montali2024waymo . The primary reasons for this underperformance are as follows: First, the framework does not represent future interactions between the motions of different agents, leading to inconsistent scene-level forecasting. Second, the model generates multi-modal motion by initializing multiple intention queries in the decoder, which is typically limited by GPU memory, resulting in a fixed number of motion modalities. Consequently, it is uncertain whether the generated modalities sufficiently represent the diversity of future behaviors. Thirdly, these models struggle to generalize across different datasets, requiring new data collection for training in new urban environments or maps.

The advent of autoregressive large language models (LLMs) floridi2020gpt ; touvron2023llama has ushered in a new era in artificial intelligence. Drawing inspiration from this, some studies in the driving motion generation domainphilion2023trajeglish ; seff2023motionlm , have tokenized agent trajectories into discrete motion tokens and employed a Next Token Prediction (NTP) task based on cross-entropy loss for autoregression. These models continue to utilize an encoder-decoder architecture, encoding continuous vectorized map and historical trajectory data with an encoder, and decoding discrete tokens solely in the decoder module. Compared to continuous distribution regression methods, the autoregressive paradigm of NTP has the following advantages: the model adopts a step-by-step next token prediction, allowing it to model interactions between agents’ motions at each time step, and the number of modalities is not limited, leading to better diversity in generative tasks.

However, existing NTP-based motion models still fail to address the aforementioned issues of generalizability and scalability, which have a critical impact on industrial applications. Generalizability means achieving satisfactory results across diverse datasets through zero-shot and few-shot learning, while scalability involves improving model performance as dataset size or model parameters increase, following scaling laws defined by hoffmann2022training . This shortfall is due to two main factors: First, current model architectures lack generalizability under the constraints of limited data scale. Due to the high cost of acquiring extensive driving data, open-source datasets typically cover only a few hundred hours of driving in specific urban areas, with significant domain gaps caused by perceptual and regional differences. Second, unlike tasks involving the serialization of a single dimension, motion generation requires the serialization of both the temporal dimension of trajectories and the spatial interactions between maps and agents. To tackle these challenges, this paper introduces the SMART model: Scalable Multi-Agent Real-Time Motion Generation via Next-token Prediction. The model incorporates a tokenizer for map data and proposes an autoregressive prediction task for the next road token prediction to enhance the model’s spatial comprehension. Subsequently, a GPT-style approach is adopted, tokenizing agent trajectories across the entire time series to establish a decoder-only transformer model. The decoder-only transformer allows SMART to compute the next token for the upcoming frame at the current moment during inference, eliminating the need to re-encode historical motion tokens with each inference, which significantly improves inference efficiency for real-time interactive autonomous driving simulation.

In summary, our contributions to the community include:(1) We propose a novel framework for motion generation, incorporating a tokenization scheme for both vectorized road and agent trajectories and utilizing a decoder-only transformer for training on the next token prediction task. This approach offers new insights into the design of motion generation algorithms for autonomous driving. (2) In the field of driving motion generation, we have pioneered a focus on the model’s zero-shot generalizability across different datasets. Notably, the model trained solely on the NuPlan dataset performed well on the WOMD test dataset, despite the lack of overlap between the map areas of these two datasets. An empirical validation of SMART models’ scalability emulates the appealing properties of large fundamental models. (3) SMART achieves state-of-the-art performance across most metrics in the generative Sim Agents challenge, ranking 1th on the WOMD leaderboards111https://waymo.com/open/challenges/2024/sim-agents/. Furthermore, SMART’s single-frame inference time is within 15ms, meeting the real-time requirements for interactive simulation in autonomous driving.

2 Related work

2.1 Properties of auto-regressive large models

Scalability and zero-shot generalization

Power-law scaling laws kaplan2020scaling ; floridi2020gpt ; radford2019language mathematically describe the relationship between the growth of model parameters, dataset sizes, computational resources, and the performance improvements of machine learning models, providing several distinct benefits. Firstly, they enable the extrapolation of a larger model’s performance by scaling up model size, data size, and computational cost. Secondly, the scaling laws have demonstrated a consistent and non-saturating increase in performance, corroborating their sustained advantage in enhancing model capabilities. Zero-shot generation refers to the ability of models to generate predicted motions for time series from unseen datasets. Previous work orozco2020zero ; jin2022domain on zero-shot generation typically involves training on a single time series dataset and testing on a different dataset. In this study, we utilize the NuPlan dataset for training SMART models and the WOMD validation dataset for testing. Existing methods in the autonomous driving field sima2023drivelm ; tian2024drivevlm often rely on LLMs or VLMs to assist in decision-making and planning to enhance generalizability and interpretability. However, no studies have attempted to directly construct a foundational model for the driving motion field to validate scalability and zero-shot generalizability.

2.2 Tokenizer in continuous domains

Language models touvron2023llama ; touvron2023llama2 rely on Byte Pair Encoding or WordPiece algorithms for text tokenization. Visual generation modelsyu2024language ; yu2022scaling based on language models also necessitate the encoding of 2D images into 1D token sequences. Early endeavors VQVAE van2017neural have demonstrated the ability to represent images as discrete tokens, although the reconstruction quality was relatively moderate. In the driving motion domain, MotionLMseff2023motionlm used a simple uniform quantization of axis-aligned deltas between consecutive waypoints of agent trajectories.

2.3 Driving motion generation

Our work builds heavily on recent advancements in driving motion generation. A comprehensive range of generative models has been applied to this problem, including continuous motion distribution regression salzmann2020trajectron++ ; amirloo2022latentformer ; suo2021trafficsim , diffusion models zhong2023guided ; jiang2023motiondiffuser , and discrete autoregressive models philion2023trajeglish ; seff2023motionlm . MotionDiffuser jiang2023motiondiffuser is a diffusion-based representation method for modeling the joint distribution of future trajectories across multiple agents, leveraging a simple predictor design and PCA compression for efficient, top-performing multi-agent motion prediction. While these diffusion-based models produce multi-modal future trajectories of individual agents, they only capture the marginal distributions of possible agent movements and do not model interactions among agents’ future motions. Typical distribution regression models use parametric continuous distributions such as Gaussian shi2023mtr++ or Laplace zhou2023qcnext to model the future motion distribution. A limitation of these models is the uncertainty of whether the Gaussian or Laplace mixture distribution is flexible enough to represent the distribution over future states. Additionally, to generate multi-modal future motions, these models often need to incorporate motion goal candidates gu2021densetnt or learnable latent embeddings varadarajan2022multipath++ as multi-modal queries in the decoder module, resulting in significant memory usage and increased inference time. MotionLM seff2023motionlm treats multi-agent motion prediction in autonomous vehicles as a language modeling task, generating interactive trajectories through a simplified autoregressive process without requiring complex optimizations and latent anchor embeddings. On this basis, Trajeglish philion2023trajeglish targets multi-agent offline closed-loop simulation.

3 Method

In this section, we introduce SMART, an autoregressive generative model for dynamic driving scenarios. While both language and agent motions are sequential, they differ in their representation—natural language consists of words from a finite vocabulary, whereas agent motions are continuous real-valued data. This distinction necessitates the unique design outlined in Sec.3.1 for agent motion and road vector tokenizer, including the construction of vocabulary and the tokenization of motion sequences. Sec.3.2 provides a comprehensive description of the model’s architecture. Sec.3.3 elaborates on the training tasks designed for the proposed model to learn the distribution of the motion token within the temporal sequence and the distribution of the road token within the spatial sequence.

3.1 Tokenization

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (1)

Agent Motion tokenization

To apply discrete sequence modeling in continuous domains, prior works typically follow one of two approaches: either use a pre-trained tokenizer, such as VQVAE van2017neural or VQGAN esser2021taming , to encode continuous features into discrete tokens, or normalize continuous features and divide continuous values into discrete slots at equal intervals ansari2024chronos ; seff2023motionlm . For the former approach, establishing a latent vocabulary often requires a large amount of raw data to train the tokenizer; otherwise, the tokenizer itself will be biased towards the pre-training dataset. Since our work aims to enable the model to generalize effectively when trained on a small number of data samples, SMART opts to discretize explicit trajectory and map features. Specifically, similar to philion2023trajeglish , we segment the continuous trajectories of all agents in the dataset into trajectory sets by fixed time intervals t=0.5s𝑡0.5𝑠t=0.5sitalic_t = 0.5 italic_s. Then, we cluster the trajectory sets using the k-disks algorithm. As shown in Figure1(b), the sampled trajectories serve as our final agent motion token vocabulary Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

As shown in Figure1(a), the blue box represents the tokens obtained after discretizing the ground truth trajectory. At every 0.5-second interval, a search is conducted within the token vocabulary for candidate tokens, from which an appropriate(closest) token is selected to represent the current moment. Note that to prevent matching errors that may occur during the tokenization process of the agent motion sequence, we implement a rolling matching approach for the entire continuous motion sentence in a given period T𝑇Titalic_T. This implies that the token for the next time step is matched by referring to the position of the token currently matched, rather than relying on the actual correct position. However, due to the transformer decoder must perform sequential inference step by step, this approach inevitably leads to out-of-distribution issues due to compounding errorsrawte2023survey . Especially, in the field of autonomous driving, these accumulated errors may result in collisions and off-map events9636795 . To address this issue, we introduce noise into the tokenization process to enable the model to simulate distribution shifts during training. Specifically, we perturb the currently matched token by selecting one from the top-k tokens closest to the ground truth token in the vocabulary. Then, in the next time step, we match the motion token based on the perturbed vehicle state. This data augmentation method allows the model to effectively handle issues such as distribution shifts and accumulated errors, thereby enhancing robustness in generative tasks. Finally, the agent motion token is represented as ANA×NT×FA𝐴superscriptsubscript𝑁𝐴subscript𝑁𝑇subscript𝐹𝐴A\in\mathbb{R}^{N_{A}\times N_{T}\times F_{A}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where NAsubscript𝑁𝐴N_{A}italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT denotes the total number of agents, and NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT represents the number of time steps, with a feature size of FAsubscript𝐹𝐴F_{A}italic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, containing coordinates, heading, and shapes.

Road vector tokenization

To enhance the model’s generalization capabilities, we have applied a similar tokenization process to road vectors as we did with agent motion. Each road vector is a directed lane segment with features including start and end positions, length, turn direction, and other semantics from the dataset. To obtain fine-grained inputs for the road network, all road vectors are segmented into tokens spanning no longer than 5 meters in length. Unlike the motion sequence, the tokenization process of the road sentence does not have a time-series dependency. As shown in Figure1(c), the tokenization of the road sentence is performed in parallel, directly tokenizing all the original road vector segments. The road vector token is represented as RNR×FR𝑅superscriptsubscript𝑁𝑅subscript𝐹𝑅R\in\mathbb{R}^{N_{R}\times F_{R}}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT × italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT denotes the total number of road vectors, and FAsubscript𝐹𝐴F_{A}italic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT represents the token features.

3.2 Model Architecture

Figure2 illustrates the simple but expressive model architecture of SMART. The model comprises an encoder for road map encoding and a motion decoder that predicts a category distribution based on motion token embeddings.

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (2)

RoadNet: road token encoder

We employ multi-head self-attention (MHSA) to model the relationships among road tokens, after which the updated road token encodings will assist motion token decoding. For the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT road token, we derive a query from its embedding risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and let it attend to the neighboring tokens rjRisubscript𝑟𝑗subscript𝑅𝑖{r_{j}}\in R_{i}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

ri=subscript𝑟superscript𝑖absent\displaystyle r_{i^{\prime}}=italic_r start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =MHSA(q(ri),k(rj,RPEij),v(rj,RPEij),jRi\displaystyle\text{MHSA}\left(q(r_{i}),\right.\left.k(r_{j},\text{RPE}_{ij}),v%(r_{j},\text{RPE}_{ij}\right),\quad j\in R_{i}MHSA ( italic_q ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_k ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_v ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_j ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(1)

where Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the neighbor set of the road tokens. To incorporate spatial awareness for map encoding, we generate the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT key/value vector from the concatenation of rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the relative positional embedding RPEijsubscriptRPE𝑖𝑗\text{RPE}_{ij}RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPTcui2023gorela .

MotionNet: factorized agent motion decoder

Prevailing methods for encoding agents prioritize capturing the temporal dynamics of an agent’s movements, followed by the integration of agent-map and agent-agent interactions, as highlighted by shi2022motion . Factorized attention effectively captures detailed agent-map interactions across temporal scales ngiam2021scene . In our work, we leverage a factorized Transformer architecture with multi-head cross-attention (MHCA) to decode complex road-agent and agent-agent relationships along the time series. Akin to query-centric methodologies zhou2023query , we utilize relative positional embeddings to differentiate between agents’ local coordinate frames, enabling symmetric encoding. Take the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT agent at time step t𝑡titalic_t as an example. Denoted as Eq.2a, given the query derived from the agent motion token’s embedding eitsuperscriptsubscript𝑒𝑖𝑡e_{i}^{t}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we employ temporal attention by computing the key and value based on which are the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT agent’s token embeddings from time step tτ𝑡𝜏t-\tauitalic_t - italic_τ to time step t1𝑡1t-1italic_t - 1 and the corresponding relative positional embeddings.

ei=MHSA(q(eit),k(eitτ+RPEit,tτ),v(eitτ,RPEit,tτ)),0<τ<tformulae-sequencesubscript𝑒superscript𝑖MHSA𝑞superscriptsubscript𝑒𝑖𝑡𝑘superscriptsubscript𝑒𝑖𝑡𝜏superscriptsubscriptRPE𝑖𝑡𝑡𝜏𝑣superscriptsubscript𝑒𝑖𝑡𝜏superscriptsubscriptRPE𝑖𝑡𝑡𝜏0𝜏𝑡e_{i^{\prime}}=\text{MHSA}\left(q(e_{i}^{t}),k(e_{i}^{t-\tau}+\text{RPE}_{i}^{%t,t-\tau}),v(e_{i}^{t-\tau},\text{RPE}_{i}^{t,t-\tau})\right),\quad 0<\tau<titalic_e start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = MHSA ( italic_q ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_k ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_τ end_POSTSUPERSCRIPT + RPE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_t - italic_τ end_POSTSUPERSCRIPT ) , italic_v ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_τ end_POSTSUPERSCRIPT , RPE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_t - italic_τ end_POSTSUPERSCRIPT ) ) , 0 < italic_τ < italic_t(2a)
ei=MHCA(q(eit),k(rj,RPEij),v(rj,RPEij)),jNiformulae-sequencesubscript𝑒superscript𝑖MHCA𝑞superscriptsubscript𝑒𝑖𝑡𝑘subscript𝑟𝑗subscriptRPE𝑖𝑗𝑣subscript𝑟𝑗subscriptRPE𝑖𝑗𝑗subscript𝑁𝑖e_{i^{\prime}}=\text{MHCA}\left(q(e_{i}^{t}),k(r_{j},\text{RPE}_{ij}),v(r_{j},%\text{RPE}_{ij})\right),\quad j\in N_{i}italic_e start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = MHCA ( italic_q ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_k ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_v ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) , italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(2b)
ei=MHSA(q(eit),k(ejt,RPEijt),v(ejt,RPEijt)),jNiformulae-sequencesubscript𝑒superscript𝑖MHSA𝑞superscriptsubscript𝑒𝑖𝑡𝑘superscriptsubscript𝑒𝑗𝑡superscriptsubscriptRPE𝑖𝑗𝑡𝑣superscriptsubscript𝑒𝑗𝑡superscriptsubscriptRPE𝑖𝑗𝑡𝑗subscript𝑁𝑖e_{i^{\prime}}=\text{MHSA}\left(q(e_{i}^{t}),k(e_{j}^{t},\text{RPE}_{ij}^{t}),%v(e_{j}^{t},\text{RPE}_{ij}^{t})\right),\quad j\in N_{i}italic_e start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = MHSA ( italic_q ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_k ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_v ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , RPE start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(2c)

Likewise, in Eq.2b and Eq.2c, the key and value for agent-map and agent-agent attention are derived from road token rj,jNisubscript𝑟𝑗𝑗subscript𝑁𝑖r_{j},j\in N_{i}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and agents’ motion token ejt,jNisuperscriptsubscript𝑒𝑗𝑡𝑗subscript𝑁𝑖e_{j}^{t},j\in N_{i}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the neighborhood respectively, where the neighbor set Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined by a distance threshold of 50 meters. We stack the temporal, the agent-agent, and the agent-map attention sequentially as one fusion block and repeat such blocks K𝐾Kitalic_K times.

3.3 Spatial-temporal next token prediction

In the training stage, we train SMART to understand the temporal and spatial relationships in the traffic scene. This is achieved with two next token prediction tasks on RoadNet and MotionNet, the model is optimized with the summation of the two tasks’ objectives.

Road vector next token prediction

As shown in Figure2(b), the road vector NTP task targets RoadNet to learn the spatial structure of road vector inputs. Unlike agent motions, road vectors form a graph rather than a sequence, making it challenging to apply next token prediction tasks directly. To address this issue, we extract the original topological information of roads and model the road vector tokens with sequential relationships based on their predecessor-successor connections. As depicted in Figure2(b), in the pre-training NTP task, the subsequent road vector token is predicted using the preceding road token based on the road topology. This approach requires RoadNet to understand the connectivity and continuity among unordered road vectors.

Motion next token prediction

Motion NTP task targets MotionNet to understand not only the temporal dependencies in agents’ motions but also the spatial dependencies between agent-map and agent-agent. SMART is trained to minimize the cross entropy between the distribution of the ground truth token label and the predicted distribution. Formally, the loss function for a single tokenized motion sentence is given by:

loss(θ)=t=1Ti=1Va(ait+1==aigtt+1)log(p(ait+1|ei1:t,rj),\displaystyle loss(\theta)=-\sum_{t=1}^{T}\sum_{i=1}^{V_{a}}\left(a_{i}^{t+1}=%=a_{i^{gt}}^{t+1})log(p(a_{i}^{t+1}|e_{i}^{1:t,r_{j}}),\right.italic_l italic_o italic_s italic_s ( italic_θ ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = = italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) italic_l italic_o italic_g ( italic_p ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ,(3)

where pθ(ait+1|ei1:t,mj)subscript𝑝𝜃conditionalsuperscriptsubscript𝑎𝑖𝑡1superscriptsubscript𝑒𝑖:1𝑡subscript𝑚𝑗p_{\theta}(a_{i}^{t+1}|e_{i}^{1:t},m_{j})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the categorical distribution predicted by the model parameterized by θ𝜃\thetaitalic_θ, e1:tsubscript𝑒:1𝑡e_{1:t}italic_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT is the historical tokenized agent motion embeddings, ait+1Asuperscriptsubscript𝑎𝑖𝑡1𝐴a_{i}^{t+1}\in Aitalic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∈ italic_A is the next predicted agent motion token and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the tokenized nearby road vector series. Note that SMART performs autoregression via classificationtorgo1996regression . Opting for a categorical output distribution offers a key advantage: it imposes no restrictions on the structure of the output distribution, allowing the model to learn arbitrary distributions, including multimodal ones. This flexibility is especially valuable for a fundamental model, as agent and road tokens from diverse datasets may follow distinct output distribution patterns.

4 Experiments

To validate the generalizability and scalability of the SMART model, we conducted extensive experiments and trained models across various scales. On the official WOMD Sim Agents Challenge (WOSAC), we employed the SMART 8 Million parameters (8M) model, which was exclusively trained on the WOMD dataset. Concurrently, the SMART 8M model was also utilized for generalization experiments and ablation studies. In the scale law experiments, we integrated additional datasets and trained on models of multiple scales. For all experiments, the testing datasets employed the split validation dataset from WOMD. Detailed hyperparameters for the SMART architecture can be found in Section A.1. In the following sections, Section 4.1 presents the results of rollouts generated by SMART on the WOSAC benchmark montali2024waymo . Evaluations of SMART’s generalizability and scalability are detailed in Sections 4.2 and 4.3, respectively. Finally, an ablation analysis of our design methods is conducted in Section 4.4.

4.1 Comparison for motion generation task

Performance comparison

We compare proposed SMART with existing motion generation approaches including diffusion modelsguo2023scenedm , continuous distribution regression models wang2023multiverse ; shi2023mtr++ , and next token autoregressive modelphilion2023trajeglish . Because the Sim Agents challenge metrics were changed twice, to compare it more broadly with the previous methodology, we test the performance of our model using both the WOMD Sim Agents 2023 and 2024 Benchmarkmontali2024waymo . As shown in Table1 and Table2, SMART achieves not only the best Realism Meta metric but also a high prediction precision. Most of the improvement is because SMART models interaction between agent-agent and map-agent significantly better than prior work. For further detailed comparisons, please refer to A.2.

Method
Realism
Meta metric\uparrow
Kinematic
metrics\uparrow
Interactive
metrics\uparrow
Map-based
metrics\uparrow
minADE \downarrow
SMART 8M0.65870.41900.80140.85231.7453
Trajeglishphilion2023trajeglish 0.64510.41660.78450.82161.5712
MVTEwang2023multiverse 0.64480.42020.76660.83871.6770
VPD-PRIOR0.63150.42610.72330.83301.3400
QCNeXtzhou2023qcnext 0.45380.31090.56540.50511.0830
MultiPathvaradarajan2022multipath++ 0.47660.17920.63800.68662.0517
Method
Realism
Meta metric\uparrow
Kinematic
metrics\uparrow
Interactive
metrics\uparrow
Map-based
metrics\uparrow
minADE \downarrow
SMART 96M0.75640.47680.79860.86181.5500
SMART 8M0.75110.44450.80530.85711.5435
BehaviorGPT0.74730.43330.79970.85931.4147
GUMP0.74310.47800.78870.83591.6041
MVTE0.73020.45030.77060.83811.6770
VBD0.72000.41690.78190.81371.4743
TrafficBOTv1.50.69880.43040.71140.83601.8825

Efficiency comparison

SMART also demonstrates remarkable speed in multi-agent motion generation. Previous encoder-decoder models seff2023motionlm ; shi2023mtr++ suffer from high computational costs, as the model requires multiple query embeddings in the decoder module to generate multi-modal motions. Benefiting from the advantages of the decoder-only transformer architecture, SMART only needs to compute the next token for the upcoming frame at the current moment during inference, without the need to re-encode historical motion tokens. By reusing the token embeddings computed in previous observation time horizons, the complexity of the agent motion decoder is reduced to O(NANT)+O(NANR)+O(NA2)𝑂subscript𝑁𝐴subscript𝑁𝑇𝑂subscript𝑁𝐴subscript𝑁𝑅𝑂superscriptsubscript𝑁𝐴2O(N_{A}N_{T})+O(N_{A}N_{R})+O(N_{A}^{2})italic_O ( italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + italic_O ( italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) + italic_O ( italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In contrast, for encoder-decoder models like makansi2019overcoming , besides the computational load of the encoder module, additional computations of O(NA2NM)+O(NANMNR)𝑂superscriptsubscript𝑁𝐴2subscript𝑁𝑀𝑂subscript𝑁𝐴subscript𝑁𝑀subscript𝑁𝑅O(N_{A}^{2}N_{M})+O(N_{A}N_{M}N_{R})italic_O ( italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) + italic_O ( italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) are required for generating multi-modalities of trajectories, where NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT represents the number of modalities. The average single-step inference time of SMART is influenced by the number of map tokens and agent motion tokens, fluctuating between 5 to 20 ms, and averaging under 10 ms. Thus, it significantly meets the current needs of interactive real-time online simulation in autonomous driving.

4.2 Generalization

Zero-shot generalization on different dataset

Zero-shot generation is the ability of models to generate motions for time series from different datasets. In this work, we use the training data from NuPlan dataset to train SMART models and the test data from WOMD validation dataset.As shown in Table3, SMART* still achieves good performance in the overall metrics. Due to significant differences in the accuracy of the calibrated ground truth values for agent position and heading between different datasets, there may be a larger gap in the agent kinematic metrics, resulting in lower scores. However, SMART* demonstrated excellent generalization in the metrics of agent interaction and drivable map. It is worth mentioning that the size of the two datasets does not differ greatly, so the SMART model can have good generalization ability based on a small number of data training.

Method
Kinematic
metrics\uparrow
Interactive
metrics\uparrow
Map-based
metrics\uparrow
minADE \downarrow
SMART0.45370.80340.85141.5127
SMART*0.41610.78530.79702.3041
SMART**0.43100.80870.85591.5671

Zero-shot generalization on unseen scenarios

Multiple map scenarios as shown in Figure 3 are present only in the WOMD but not in the NuPlan dataset. Without modifications to the network architecture or tuning parameters, SMART trained only on NuPlan has achieved decent results in these scenarios, substantiating the generalization ability of SMART.

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (3)

4.3 Scalability

Prior research kaplan2020scaling ; touvron2023llama have established that scaling up large language models (LLMs) leads to a predictable decrease in test loss L𝐿Litalic_L. This trend correlates with parameter counts N𝑁Nitalic_N, training tokens T𝑇Titalic_T, following a power-law:

log(L)=βlog(X)+α𝑙𝑜𝑔𝐿𝛽𝑙𝑜𝑔𝑋𝛼\displaystyle log(L)=\beta log(X)+\alphaitalic_l italic_o italic_g ( italic_L ) = italic_β italic_l italic_o italic_g ( italic_X ) + italic_α(4)

where X𝑋Xitalic_X can be any of N𝑁Nitalic_N, T𝑇Titalic_T. The exponent α𝛼\alphaitalic_α reflects the smoothness of power-law, and L𝐿Litalic_L denotes the reducible loss normalized by irreducible loss. The data sources for validating scaling laws are detailed in the A.3. Overall, we trained models across four sizes, ranging from 1M to 100M parameters, on a training set containing 2.2M scenarios (or 1B motion tokens under 0.5s agent motion tokenization).

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (4)

Scaling laws with model parameters

We investigate the test loss trend as the model size increases. We assessed the final test cross-entropy loss L𝐿Litalic_L on the validation set of 100,000 traffic scenarios. The results are plotted in Figure4, where we observed a clear power-law scaling trend for Loss L𝐿Litalic_L as a function of model size N𝑁Nitalic_N. The power-law scaling laws can be expressed as:

log(L)=0.157log(X)+1.52𝑙𝑜𝑔𝐿0.157𝑙𝑜𝑔𝑋1.52\displaystyle log(L)=-0.157log(X)+1.52italic_l italic_o italic_g ( italic_L ) = - 0.157 italic_l italic_o italic_g ( italic_X ) + 1.52(5)

These results verify the strong scalability of SMART, providing valuable insights into how model performance scales with dataset size.

4.4 Ablation

In this study, we aim to verify the effectiveness of each component of SMART. Results are reported in Table4. The initial model, denoted as M1𝑀1M1italic_M 1, is constructed on the architecture depicted in Sec.3.2, employing solely agent tokenization. The introduction of the road vector tokenization in M2𝑀2M2italic_M 2, which tokenized the road vector states into discrete tokens, results in marked improvements over M1𝑀1M1italic_M 1 in the generalization capability. Comparing models M1𝑀1M1italic_M 1 and M2𝑀2M2italic_M 2 reveals that when trained solely on the WOMD dataset, the tokenization of road vectors results in a certain reduction in overall metrics. We speculate that discretized map tokens may lose some fine-grained geometric information about roads. M3𝑀3M3italic_M 3 incorporates noised agent motion tokenization, designed to address cumulative errors and distributional shifts during inference. This modification leads to enhancements in both the interaction metric and the map-based metric.

SMARTTrain on WOMDTrain on NuPlan
Model
Number

RVT

NAT

NRVT

kinematics

interactive

map

kinematics

interactive

map

M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT0.4590.8270.8570.3760.5930.603
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTsquare-root\surd0.4360.8070.8400.3890.6960.724
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTsquare-root\surdsquare-root\surd0.4480.8090.8480.4130.7500.743
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTsquare-root\surdsquare-root\surdsquare-root\surd0.4530.8030.8510.4160.7850.797

5 Conclusions

In this paper, we have introduced SMART, a novel paradigm for autonomous driving motion generation that leverages vectorized map and agent trajectory data, processed through a decoder-only transformer architecture in a GPT-style framework. We have observed that SMART emulates two critical properties: scalability and zero-shot generalization, which are essential for advancing large models. We believe that our findings and the release of all codes will encourage further exploration and development of models for motion generation in the autonomous driving field, ultimately contributing to more reliable autonomous driving systems.

Limitations

In this work, we primarily focus on the design of the learning paradigm and maintain a relatively simple design for the discrete token vocabulary. We believe that iterating SMART with an advanced tokenizermentzer2023finite or sampling technique can further improve the performance. Although we have collected training data from multiple datasets, we are still limited by the dataset size when validating the model’s scalability, restricting us to models with a maximum scale of 100 million parameters. Given the focus of this work on generalization and scaling laws, a large number of hyperparameter ablation experiments remain to be verified, including the time granularity of agent motion tokens and the size of the token vocabulary. As a motion generation model, the ability of SMART to migrate to planning and prediction tasks still needs to be verified, and this is our top priority for future work.

References

  • [1]Elmira Amirloo, Amir Rasouli, Peter Lakner, Mohsen Rohani, and Jun Luo.Latentformer: Multi-agent transformer-based interaction modeling and trajectory prediction.arXiv preprint arXiv:2203.01880, 2022.
  • [2]AbdulFatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, SyamaSundar Rangapuram, SebastianPineda Arango, Shubham Kapoor, etal.Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815, 2024.
  • [3]Holger Caesar, Juraj Kabzan, KokSeang Tan, WhyeKit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari.nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021.
  • [4]Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov.Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449, 2019.
  • [5]Yongli Chen, Shen Li, Xiaolin Tang, Kai Yang, Dongpu Cao, and Xianke Lin.Interaction-aware decision making for autonomous vehicles.IEEE Transactions on Transportation Electrification, 2023.
  • [6]Jie Cheng, Yingbing Chen, Xiaodong Mei, Bowen Yang, BoLi, and Ming Liu.Rethinking imitation-based planner for autonomous driving.arXiv preprint arXiv:2309.10443, 2023.
  • [7]Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao, and Raquel Urtasun.Lookout: Diverse multi-future prediction and planning for self-driving.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16107–16116, 2021.
  • [8]Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, and Raquel Urtasun.Gorela: Go relative for viewpoint-invariant motion forecasting.In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7801–7807. IEEE, 2023.
  • [9]Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider, and Nemanja Djuric.Multimodal trajectory predictions for autonomous driving using deep convolutional networks.In 2019 International Conference on Robotics and Automation (ICRA), pages 2090–2096. IEEE, 2019.
  • [10]Patrick Esser, Robin Rombach, and Bjorn Ommer.Taming transformers for high-resolution image synthesis.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • [11]Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, CharlesR Qi, Yin Zhou, etal.Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9710–9719, 2021.
  • [12]Luciano Floridi and Massimo Chiriatti.Gpt-3: Its nature, scope, limits, and consequences.Minds and Machines, 30:681–694, 2020.
  • [13]Junru Gu, Chen Sun, and Hang Zhao.Densetnt: End-to-end trajectory prediction from dense goal sets.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15303–15312, 2021.
  • [14]Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, etal.Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36, 2024.
  • [15]Zhiming Guo, Xing Gao, Jianlan Zhou, Xinyu Cai, and Botian Shi.Scenedm: Scene-level multi-agent trajectory generation with consistent diffusion models.arXiv preprint arXiv:2311.15736, 2023.
  • [16]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego deLas Casas, LisaAnne Hendricks, Johannes Welbl, Aidan Clark, etal.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022.
  • [17]Yihan Hu, Jiazhi Yang, LiChen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, etal.Planning-oriented autonomous driving.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  • [18]Zhiyu Huang, Peter Karkus, Boris Ivanovic, Yuxiao Chen, Marco Pavone, and Chen Lv.Dtpp: Differentiable joint conditional prediction and cost evaluation for tree policy planning in autonomous driving.arXiv preprint arXiv:2310.05885, 2023.
  • [19]Zhiyu Huang, Haochen Liu, and Chen Lv.Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving.arXiv preprint arXiv:2303.05760, 2023.
  • [20]Chiyu Jiang, Andre Cornman, Cheolho Park, Benjamin Sapp, Yin Zhou, Dragomir Anguelov, etal.Motiondiffuser: Controllable multi-agent motion prediction using diffusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9644–9653, 2023.
  • [21]Xiaoyong Jin, Youngsuk Park, Danielle Maddix, Hao Wang, and Yuyang Wang.Domain adaptation for time series forecasting via attention sharing.In International Conference on Machine Learning, pages 10280–10297. PMLR, 2022.
  • [22]Jared Kaplan, Sam McCandlish, Tom Henighan, TomB Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
  • [23]Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017.
  • [24]Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox.Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7153, 2019.
  • [25]Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen.Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023.
  • [26]Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nicholas Rhinehart, Michelle Li, Cole Gulino, Tristan Emrich, Zoey Yang, Shimon Whiteson, etal.The waymo open sim agents challenge.Advances in Neural Information Processing Systems, 36, 2024.
  • [27]Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, KhaledS Refaat, and Benjamin Sapp.Wayformer: Motion forecasting via simple & efficient attention networks.In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2980–2987. IEEE, 2023.
  • [28]Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-TienLewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, etal.Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021.
  • [29]BernardoPérez Orozco and StephenJ Roberts.Zero-shot and few-shot time series forecasting with ordinal regression recurrent neural networks.arXiv preprint arXiv:2003.12162, 2020.
  • [30]Jonah Philion, XueBin Peng, and Sanja Fidler.Trajeglish: Learning the language of driving scenarios.arXiv preprint arXiv:2312.04535, 2023.
  • [31]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • [32]Vipula Rawte, Amit Sheth, and Amitava Das.A survey of hallucination in large foundation models, 2023.
  • [33]Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone.Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 683–700. Springer, 2020.
  • [34]Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, KhaledS Refaat, Rami Al-Rfou, and Benjamin Sapp.Motionlm: Multi-agent motion forecasting as language modeling.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
  • [35]Shaoshuai Shi, LiJiang, Dengxin Dai, and Bernt Schiele.Motion transformer with global intention localization and local movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022.
  • [36]Shaoshuai Shi, LiJiang, Dengxin Dai, and Bernt Schiele.Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying.arXiv preprint arXiv:2306.17770, 2023.
  • [37]Chonghao Sima, Katrin Renz, Kashyap Chitta, LiChen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li.Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150, 2023.
  • [38]Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, MichaelYu Wang, and Qifeng Chen.Pip: Planning-informed trajectory prediction for autonomous driving.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 598–614. Springer, 2020.
  • [39]Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun.Trafficsim: Learning to simulate realistic multi-agent behaviors.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10400–10409, 2021.
  • [40]Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao.Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024.
  • [41]Luís Torgo and João Gama.Regression by classification.In Advances in Artificial Intelligence: 13th Brazilian Symposium on Artificial Intelligence, SBIA’96 Curitiba, Brazil, October 23–25, 1996 Proceedings 13, pages 51–60. Springer, 1996.
  • [42]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • [43]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
  • [44]Aaron Van DenOord, Oriol Vinyals, etal.Neural discrete representation learning.Advances in neural information processing systems, 30, 2017.
  • [45]Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivastava, KhaledS Refaat, Nigamaa Nayakanti, Andre Cornman, Kan Chen, Bertrand Douillard, ChiPang Lam, Dragomir Anguelov, etal.Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction.In 2022 International Conference on Robotics and Automation (ICRA), pages 7814–7821. IEEE, 2022.
  • [46]YuWang, Tiebiao Zhao, and Fan Yi.Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023.arXiv preprint arXiv:2306.11868, 2023.
  • [47]Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, JhonyKaesemodel Pontes, etal.Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023.
  • [48]Jiahui Yu, Yuanzhong Xu, JingYu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, BurcuKaragol Ayan, etal.Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  • [49]Lijun Yu, José Lezama, NiteshB. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, AlexanderG. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, DavidA. Ross, and LuJiang.Language model beats diffusion – tokenizer is key to visual generation, 2024.
  • [50]Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, and Marco Pavone.Guided conditional diffusion for controllable traffic simulation.In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3560–3566. IEEE, 2023.
  • [51]Jinyun Zhou, Rui Wang, XuLiu, Yifei Jiang, Shu Jiang, Jiaming Tao, Jinghao Miao, and Shiyu Song.Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization.In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1450–1457, 2021.
  • [52]Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang.Query-centric trajectory prediction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863–17873, 2023.
  • [53]Zikang Zhou, Zihao Wen, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang.Qcnext: A next-generation framework for joint multi-agent trajectory prediction.arXiv preprint arXiv:2306.10508, 2023.

Appendix A Appendix

A.1 Implementation and Simulation Inference

Architecture details

Table5 summarizes the hyperparameters of the different models used in our implementation. We train a single model to generate the future motion of all three categories (i.e., Vehicle, Pedestrian, Cyclist), with each category having its own motion token vocabulary. The input road token feature contains three types of information: the position of each road token point, the road token direction at each point, and the type of each road token. For the prediction head in each decoder layer, we use a three-layer MLP, and the model weights are not shared across different decoder layers.

ModulesHyperparametersValues
SMART 1MSMART 8MSMART 36MSMART 96M
RoadNetNumber of self attention layers1112
Road token embeddings32128256512
Size of road token vocabulary1024102410241024
Road token attention radius40404040
MotionNetNumber of temporal attention layers1334
Number of agent-agent attention layers1334
Number of map-agent attention layers1334
Number of attention head4888
Dimension of attention head8163264
Feature dimension of Agent token embeddings32128256512
Size of motion token vocabulary5125125122048
SMARTTotal parameters1.1M8.1M36.7M96.3M

Training details

The simulation model is trained end-to-end for all three agent types using the AdamW optimizer [23]. Both the dropout rate and the weight decay rate are set to 0.1. The learning rate is decayed from 0.0002 to 0 using a cosine annealing scheduler. Training includes all vehicles within a scene. The batch size is set to 4, with a maximum GPU memory usage of 30GB.

Inference for WOSAC

The test set comprises 44,920 scenes, and each scene requires running the model inference 32×T32𝑇32\times T32 × italic_T times to generate the 32 simulations for a group of agents. During model inference, each simulation step produces the classified distribution of next tokens. There are two options for next token sampling: selecting the maximum-likelihood token or sampling among the top-k motion tokens with the redistributed probability. The first approach, while accurate, tends to yield less varied generations. Conversely, opting for the top-k motion tokens encourages diversity but can compound errors, generating trajectories with unrealistic kinematic motions or even drift. To balance realism and diversity, we use top-5 sampling at every step during the simulation. Videos of rollouts can be found on our project page or supplementary materials. For each scenario, the SMART model directly controls all agents within the scene. Due to the focus of this article on the generalization and scalability of the model, we have achieved good results in specific scene generation without extensive exploration of detailed tricks.

A.2 Detailed comparison in the WOSAC leaderboard

MethodKINEMATICINTERACTIVEMAPminADE\downarrow
LINEAL
SPEED\uparrow
LINEAR
ACCEL\uparrow
ANG.
SPEED\uparrow
ANG.
ACCEL\uparrow
DIST TO
OBJ.\uparrow
COLLISION \uparrowTTC \uparrow
DIST TO
ROAD\uparrow
OFF
ROAD\uparrow
WAYFORMER0.2020.1440.2480.3120.1920.4490.7660.3790.3056.823
SBTA-ADIA0.3170.1740.4780.4630.2650.3370.7700.5570.4833.611
CAD0.3460.2520.4320.3110.330.3110.7890.6370.5392.314
JOINT-MULTIPATH++0.4310.2300.0190.0350.3490.4850.8110.6370.6132.051
MTR+++0.4110.1060.4830.4360.3450.4140.7960.6540.5771.681
QCNeXt0.4770.2420.3250.1980.3750.3240.7560.6090.3601.083
MVTE0.4420.2210.5350.4810.3820.4500.8320.6640.6401.677
Trajeglish0.4500.1920.5380.4850.3870.9220.8360.6590.8861.571
SMART 8M0.3630.2960.4230.5640.3760.9630.8320.6590.9361.54

Here’s the revised version of your text with improved precision and grammar:

The Waymo Open Sim Agents Challenge (WOSAC) is a significant initiative aimed at advancing the development and evaluation of simulation agents for autonomous vehicles. This challenge leverages the Waymo Open Motion Dataset (WOMD) to provide high-fidelity object behaviors and shapes produced by a state-of-the-art offboard perception system. Participants are required to simulate scenarios involving up to 128 agents, focusing on closed-loop evaluation to ensure realism in agent behaviors and interactions. The evaluation framework employs various metrics, including kinematic features, interaction-based features, and map-based features, to assess the performance of simulation agents in generating realistic and diverse behaviors that match real-world driving data. WOSAC computes three metrics over nine measurements: kinematic metrics (linear speed, linear acceleration, angular speed, angular acceleration magnitude), object interaction metrics (distance to nearest object, collisions, time-to-collision), and map-based metrics (distance to road edge, road departures).

In the benchmark comparisons presented in Table 6, the SMART 8M method, developed by our team, demonstrates superior performance across multiple metrics, particularly excelling in interactive and safety-related indicators. Notably, SMART 8M achieved the highest scores in angular acceleration, distance to nearest object, collision avoidance, and off-road metrics, underscoring its effectiveness in complex driving scenarios. These results highlight the robustness of SMART 8M in ensuring safety and reliability, indicating its advanced capability in managing dynamic and potentially hazardous traffic conditions more effectively than other evaluated methods. This performance also suggests the potential of the SMART model to be applied to planning tasks.

A.3 Additional ablation studies

Comparison of different tokenizer

In Trajeglish[30], a detailed comparison of various discretized tokenizers is conducted. As introduced in Sec.3.1 of this paper, we ultimately adopted the k-disks approach for token vocabulary construction. Prior to our work, no studies had attempted to construct a vocabulary for agent and road motion tokens using latent tokenizer methods [44]. Therefore, we drew on the visual domain’s VQ-VAE approach to perform latent autoencoding of motion tokens and provided a comparison of this tokenizer with the method selected in this paper.

Train on WOMDTrain on NuPlan

Tokenizer

Kinematics

Interactive

Map

Kinematics

Interactive

Map

VQ-VAE0.4610.8100.8530.3760.6870.703
K-disks0.4530.8030.8510.4160.7850.797

From the results in Table7, it is evident that VQ-VAE performs better on a single dataset compared to k-disks. Specifically, both methods achieve similar results in interactive and map-based metrics, but VQ-VAE outperforms k-disks in kinematic metrics. The k-disks approach loses fine-grained trajectory information during discretization, whereas VQ-VAE better fits the true distribution of the dataset when reconstructing trajectories. However, when comparing the two methods’ performance in zero-shot generalization, k-disks significantly outperform VQ-VAE. We speculate that during the training of the VQ-VAE tokenizer to construct motion and road token vocabularies, the tokenizer may have already memorized or overfitted to the training dataset. Therefore, to achieve better generalization performance using the VQ-VAE approach, it is essential to pre-train the VQ-VAE tokenizer on a large-scale dataset.

Comparison of SMART models with different scales

For language models, large and diverse datasets are relatively easy to obtain. In contrast, the autonomous driving motion domain lacks a data source of comparable size and diversity. To validate scaling laws on a larger dataset, we integrated data from Waymo, Nuplan, and our proprietary dataset. We introduced our proprietary dataset solely for validating scaling laws. For the WOSC leaderboard evaluation, we exclusively used the Waymo dataset. For generalization and other ablation experiments, we utilized both Nuplan and Waymo open-source datasets to facilitate reproducibility of the experiments by providing access to widely available datasets. Table 8 below summarizes the scenario count, duration, and total motion token count for each dataset.

DatasetScene CountSingle Scenario DurationTotal Motion Token Count
Nuplan30w10s0.13B
Waymo48w9s0.18B
Proprietary150w11s0.68B
Total228w-1B

Method
Kinematic
metrics\uparrow
Interactive
metrics\uparrow
Map-based
metrics\uparrow
Training timeAverage inference time
SMART 1M0.4170.7610.8288hours10.30ms
SMART 8M0.4530.8030.85123hours10.79ms
SMART 36M0.4560.8100.8613days19.73ms
SMART 96M0.4570.8190.8721week46.58ms

The results in Table 9 highlight the performance of SMART models with different parameter scales across various metrics. As the model scale increases from SMART 1M to SMART 96M, there is a significant improvement in both the interactive metrics and the map-based metrics. This indicates that larger models are better at capturing interactions and understanding map-based context, leading to enhanced performance in these areas. However, the kinematic metrics show minimal variation. Additionally, the training time and average inference time increase substantially with larger models, reflecting the trade-off between model performance and computational cost. Validation is conducted every 50,000 train steps. The model is considered to have converged if there is no significant loss reduction or metric improvement after five consecutive validations. The training and inference time is measured on 32 NVIDIA TESLA V100 GPUs.

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction (2024)

References

Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 6239

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.