An increase in car ownership brings convenience to people’s life. However, it also leads to frequent traffic accidents. Precisely forecasting surrounding agents’ future trajectories could effectively decrease vehicle-vehicle and vehicle-pedestrian collisions. Long-short-term memory (LSTM) network is often used for vehicle trajectory prediction, but it has some shortages such as gradient explosion and low efficiency. A trajectory prediction method based on an improved Transformer network is proposed to forecast agents’ future trajectories in a complex traffic environment. It realizes the transformation from sequential step processing of LSTM to parallel processing of Transformer based on attention mechanism. To perform trajectory prediction more efficiently, a probabilistic sparse self-attention mechanism is introduced to reduce attention complexity by reducing the number of queried values in the attention mechanism. Activate or not (ACON) activation function is adopted to select whether to activate or not, hence improving model flexibility. The proposed method is evaluated on the publicly available benchmarks next-generation simulation (NGSIM) and ETH/UCY. The experimental results indicate that the proposed method can accurately and efficiently predict agents’ trajectories.
Automated driving technology has become a research hot spot due to its wide applications in intelligent transportation systems. As shown in
With the developments of machine learning [
With the rise of deep learning [
To solve the above problems, we propose to achieve trajectory prediction based on Transformer networks, which have achieved state-of-the-art performance on multiple tasks like the Natural Language Processing (NLP) [
LSTM could only process sequences sequentially due to its cyclic structure, whereas Transformer networks [
To sum up, the main contributions of this paper can be summarized as follows:
Considering that traditional forecasting methods do not make full use of historical information and cannot handle large amounts of data in parallel, a Transformer network is used to realize vehicle trajectory prediction in a better way. To solve the problems of attention query sparsity and the high space-time complexity in the self-attention mechanism, a probabilistic sparse self-attention mechanism is introduced to selectively calculate the query-key value and reduce the space-time overhead of the self-attention mechanism. To solve the problem that traditional activation functions are sensitive to parameter initialization, learning rate, and neuron death, activate or not (ACON) activation function [
Transformer networks have achieved excellent performance on multiple NLP tasks such as machine translation [
As shown in
When the attention mechanism [
The input consists of a
In order to improve the performance of the ordinary self-attention layer, the Transformer network further improves the self-attention layer and extends it to the multi-head attention mechanism. The single-headed self-attention layer has a constraint on the attention of a specific location. In contrast, the multi-headed attention makes different sub-spaces pay attention to multiple specific locations, which significantly expands the ability of the model attention mechanism. The specific process of the attention mechanism of multiple heads with the number of heads H is as follows:
Activation function [
Through a series of improvements, Google proposed an activation function called Swish [
A stationary reference system is used in this work. The origin of the reference frame is fixed on the predicted agent at time
We aim to predict the future trajectories of different agents, including vehicles and pedestrians. Specifically, their future trajectories are predicted based on their historical trajectories.
The model is mainly composed of three parts: the data processing part, the feature extraction part, and the trajectory prediction part. Descriptions of different parts are as follows:
Data Processing: For agent
In
To add timing information to the input embedding, the position embedding PE(
Feature Extraction: After data processing,
Afterward, the sparse self-attention module calculates the first
Trajectory Prediction: Trajectory prediction is mainly achieved in the decoder, which predicts future trajectories in an auto-regression manner. The decoder mainly consists of two inputs, the ground truth trajectory embedding
Afterward, the multi-head probabilistic sparse self-attention module takes three embeddings as input and outputs the attention matrix
Finally,
The Transformer network could not capture the sequential properties of sequences since it has no loop structure like LSTM. Hence, a position embedding technology [
In the traditional self-attention, each position of the trajectory needs to pay attention to all other positions. However, the learned attention matrix is very sparse. Therefore, the computation complexity could be reduced by incorporating structural bias to limit the number of query-key pairs that each query attends. Under this restriction, we introduce sparse attention, in which only the similar Q-K pairs are calculated. Then, the complexity of attention could be reduced by decreasing the number of Q, which represents the query prototype.
In this work, several query prototypes are selected as the primary sources to calculate the distribution of attention. The model either copies the distributions to the locations of the represented queries or populates those locations with uniform distributions.
The query sparsity measurement is adopted, and the prototype is selected from the query. The measurement method is the Kullback-Leibler divergence between the attention distribution of the query and the discrete uniform distribution. We define the
The first term is the log-sum-exp (LSE) of
Based on the above discussions, the sparse self-attention [
Ma et al. [
The loss function [
The proposed method is evaluated on two publicly available benchmarks, namely next-generation simulation (NGSIM) and ETH/UCY datasets. The former one contains vehicles’ trajectories, and the latter one contains pedestrians’ trajectories. Their details are introduced as follows:
NGSIM: NGSIM [
ETH/UCY: This dataset includes a total of 5 videos (ETH, HOTEL, UNIV, ZARA1, and ZARA2) from 4 different scenes (ZARA1 and ZARA2 from the same camera, but at different times) [
Average Displacement Error (ADE) and Final Displacement Error (FDE) are used as metrics for evaluation. ADE is used to measure the average difference at each time step between the predicted trajectory and the ground truth. FDE is used to measure the distance between the final destinations of the predicted trajectory and the ground truth.
The experimental environment used is the Ubuntu system and the PyTorch framework. The Adam optimizer is used to train the network with a learning rate of 0.0004. The dropout value is set as 0.1. Other hyper-parameters, including the training epochs, the layer numbers, the embedding size, and the head numbers used in the Transformer network, are fine-tuned through cross-validations on the ETH/UCY databases.
Unlike LSTM-based trajectory prediction methods, the Transformer network has a large space-time overhead. Hence, hyper-parameter fine-tuning is performed on the ETH/UCY database, which contains much fewer trajectories than NGSIM. Specifically, we evaluate the proposed method on the ETH/UCY when using different layer numbers, embedding sizes, head numbers, and different training epochs (100, 300, 500). The initial values of the layer numbers, embedding sizes, and head numbers are empirically set to 4, 128, and 8, respectively. We fix other hyper-parameters when we fine-tune a certain hyper-parameter.
Number of layers | ADE | FDE | ||||
---|---|---|---|---|---|---|
100 | 300 | 500 | 100 | 300 | 500 | |
1 | 0.505 | 0.502 | 0.504 | 1.011 | 1.008 | 1.003 |
2 | 0.494 | 0.492 | 0.491 | 0.992 | 0.990 | 0.988 |
4 | 0.479 | 0.475 | 0.985 | |||
6 | 0.478 | 0.476 | 0.474 | 0.984 | 0.984 |
Embedding_size | ADE | FDE | ||||
---|---|---|---|---|---|---|
100 | 300 | 500 | 100 | 300 | 500 | |
32 | 0.495 | 0.491 | 0.490 | 0.998 | 0.995 | 0.992 |
64 | 0.481 | 0.488 | 0.487 | 0.990 | 0.990 | 0.989 |
128 | 0.480 | 0.478 | 0.478 | 0.990 | 0.989 | 0.989 |
256 | 0.477 | 0.984 | ||||
512 | 0.479 | 0.477 | 0.476 | 0.983 | 0.983 |
Number of heads | ADE | FDE | ||||
---|---|---|---|---|---|---|
100 | 300 | 500 | 100 | 300 | 500 | |
2 | 0.501 | 0.498 | 0.497 | 1.021 | 1.010 | 1.002 |
4 | 0.484 | 0.482 | 0.482 | 0.990 | 0.991 | 0.987 |
8 | 0.478 | 0.477 | 0.983 | |||
16 | 0.481 | 0.479 | 0.985 | 0.983 | 0.983 |
The main contribution of this paper is to replace the traditional self-attention mechanism with the probabilistic sparse self-attention mechanism, which significantly reduces the space-time complexity of the model and improves efficiency. Besides, the ACON-C activation function is used to adjust whether to activate or not adaptively. We perform a detailed ablation study on two benchmarks to exhibit the effects of the proposed improvements. Specifically, we compare the trajectory prediction performance between the probabilistic sparse self-attention and the traditional self-attention. We also compare the performance between the ReLU and the ACON-C. As indicated by
Probabilistic Sparse Self-attention | Self-Attention | ACON-C | ReLU | ADE/FAD |
---|---|---|---|---|
√ | √ | 0.922/1.874 | ||
√ | √ | 0.833/1.782 | ||
√ | √ | 0.989/2.061 | ||
√ | √ | 0.962/1.894 |
In this work, an improved Transformer network is proposed to perform trajectory for both vehicles and pedestrians. Firstly, we evaluate the influence of prediction spans on vehicle trajectory prediction.
Models | Prediction span/s | |||
---|---|---|---|---|
2s | 3s | 4s | 5s | |
LSTM | 3.86/7.89 | 5.45/13.02 | 8.57/21.27 | |
Ours | 1.81/3.83 |
To further demonstrate the effects of the proposed method, we make a quantitative analysis by comparing the proposed method with several state-of-the-art methods, including the LSTM-based method, Social-LSTM [
Dataset | LSTM | Social-LSTM | SGAN | Sophie | Ours |
---|---|---|---|---|---|
I80-1 | 8.89/21.06 | 9.06/21.56 | 8.28/20.24 | 7.26/18.66 | 6.89/15.78 |
I80-2 | 8.26/19.84 | 8.54/19.29 | 7.78/18.28 | 5.76/14.24 | 6.21/15.01 |
I80-3 | 7.66/19.01 | 7.61/18.71 | 7.01/17.31 | 5.61/13.71 | 5.60/13.58 |
US101-1 | 9.86/25.02 | 10.21/25.16 | 9.56/24.11 | 8.86/22.16 | 7.76/18.95 |
US101-2 | 8.62/21.58 | 8.82/22.61 | 8.89/22.63 | 7.82/20.61 | 7.35/18.12 |
US101-3 | 8.13/21.08 | 8.23/20.88 | 8.02/20.14 | 7.02/18.40 | 6.21/16.88 |
Avg | 8.57/21.27 | 8.75/21.37 | 8.26/20.45 | 7.06/17.96 | 6.67/16.39 |
For comparisons on ETH/UCY as shown in
Dataset | LSTM | Social-LSTM | SGAN | Sophie | Ours |
---|---|---|---|---|---|
ETH | 1.09/2.41 | 1.09/2.35 | 0.87/1.62 | 0.85/1.67 | |
Hotel | 0.86/1.91 | 0.79/1.76 | 0.67/1.37 | 0.76/1.67 | |
Univ | 0.61/1.31 | 0.67/1.40 | 0.76/1.52 | 0.54/1.24 | |
Zara1 | 0.41/0.88 | 0.47/1.00 | 0.35/0.68 | 0.38/0.86 | |
Zara2 | 0.52/1.11 | 0.56/1.17 | 0.42/0.84 | 0.38/0.78 | |
Avg | 0.70/1.52 | 0.72/1.54 | 0.61/1.21 | 0.54/1.15 |
We perform the qualitative analysis to demonstrate the effects of the proposed method.
In this work, Gaussian noise is added to the model to simulate some uncertain factors. After that, vehicles’ predicted trajectories are more polymorphic than before. As shown in
An improved Transformer network is proposed to perform trajectory prediction in this work. A traditional Transformer network utilizes the multi-head attention mechanism to capture sequential information in agents’ trajectories. It exhibits better performance than the LSTM-based trajectory prediction methods. Further, a probabilistic sparse self-attention mechanism is introduced to solve the problems of attention query sparsity and the high space-time complexity in the self-attention mechanism. The ACON activation function is used to solve the problem that traditional activation functions are sensitive to parameter initialization, learning rate, and neuron death. Evaluations on publicly available NGSIM and ETH/UCY indicate that the proposed method is suitable to forecast future trajectories of vehicles and pedestrians. Our future works mainly focus on recognizing pedestrians’ crossing intentions [