Analysis on Data Requirements of End-to-end Intelligent Driving

Intelligent driving technology is striding towards the end-to-end direction, and its technical iteration affects the nerves of the whole intelligent driving industry. From the overall structure, the end-to-end technical route can be divided into global end-to-end and modular end-to-end.

Global end-to-end is simply divided into two categories: world model and VLM.Usually, the parameter scale is very large, and the global end-to-end generalization ability is stronger.,Agent has strong interactive understanding.

Modular end-to-end and traditional algorithm approximation,Auxiliary modules can be added at will to easily cope with traffic rules constraints; The end-to-end consumption of modules is relatively low, and the generalization ability is weak.

In fact, modular end-to-end is the mainstream of end-to-end that is close to landing at present, and a few enterprises are studying the overall end-to-end.(Tesla may have used a global end-to-end, but there is very little public information; Weilai plans to launch the world model). In China, Huawei ADS 3.0 is recognized as the leading intelligent driving system. Its technical route is modular end-to-end.

Huawei ADS 3.0 architecture

This paper analyzes the requirements of end-to-end technology for data from the industry, mainly involving dataScale, quality, diversity, real-time and closed-loop feedbackAnd so on.

End-to-end intelligent driving relies on deep learning models, which usually requireenormous dataConduct training. Tesla relies on millions of cars equipped with cameras and sensors around the world to continuously collect driving data to train the FSD(Full Self-Driving) system. In the financial report, Musk once mentioned the data needed for the training model: "One million video trainings are barely enough; 2 million, slightly better; 3 million, you will feel "wow!" ; When it reaches 10 million, it becomes unbelievable. "

Tesla data engine framework (From the network, small hidden drawing)

ask

  • Millions to billions of kilometers of data, covering different weather, lighting, road conditions, traffic flow and other complex environments.

High-quality data is very important for end-to-end learning, especially labeled data. Traditional autonomous driving (such as Waymo) usually relies on high-precision maps and accurately labeled 3D point cloud data. Tesla and domestic OEMs are abandoning high-precision maps and turning to the "visual priority" strategy, requiring neural networks to understand the environment end to end. Relying on automatic labeling system to reduce manual intervention and improve data quality. Automatic labeling can replace 5 million hours of manual work, and manual work only needs to check and fill in a very small part.

ask

  • accuracyMismarking data may lead to learning deviation and affect security.
  • Denoising: Filter sensor errors and low-quality data (blur, occlusion, etc.).
  • consistencyThe data of the same scene at different time points should be reasonably consistent for network generalization.

From the perspective of the whole industry,It is a new trend to improve data quality through self-supervised learning technology.. NVIDIA has improved the efficiency and reliability of data annotation through EmerNeRF’s self-supervised learning technology.

EmerNeRF decomposes scenes by introducing three neural fields: static field, dynamic field and flow fields, so as to realize effective learning of complex scenes.

EmerNeRF decomposes and reconstructs pipelines.

Static field is responsible for labeling static elements such as buildings, signs and street lamps, while dynamic field expresses all moving objects, while flow field simulates the motion of dynamic objects and is used for dynamic feature aggregation in time. Most importantly, EmerNeRF can learn these scenes automatically from raw data without any manual annotation. After learning, the model can present the temporal and spatial changes of the scene at the same time, and then realize the high-fidelity reconstruction of static scenes and dynamic objects. Through this technology, it can help mass production enterprises to further improve the training volume and gain more opportunities in end-to-end autonomous driving.

The end-to-end model needs to deal with the complex and changeable driving environment, so the data must have a wide range of scene coverage:

  • Geographical diversitySuch as cities, highways, rural roads, mountainous areas, etc.
  • weather condition: sunny, rainy, snowy, foggy, etc.
  • Difference of traffic rulesDifferent countries and regions have different traffic regulations, signs and driving habits.
  • Long tail sceneExtreme situations (sudden intrusion of pedestrians, crossing of animals, sudden accidents, etc.) are often the difficulties of autonomous driving. Tesla uses "shadow mode" to capture these cases and use them for training.

ask

  • For global vehicles, it is necessary to collect global data, covering different cultures and regulations.
  • Reinforcement learning pays special attention to the "long tail problem" data to avoid the failure of the model under Corner Case.

Since Tesla FSD entered China in 2025, many institutions have evaluated it. According to their feedback, the performance of FSD in China is moderate, although the basic driving ability is excellent.However, there are many defects in traffic light recognition, traffic rules compliance and mixed traffic scenes.. The main reason for this situation is the influence of data diversity differences. (FSD basically does not use China road traffic data to train the model. )

Intelligent driving system needscontinuous renewalData to adapt to the new driving situation. When there are many test vehicles, it is necessary to filter, compress and transmit valuable scenes efficiently. Tesla’sThe FSD Beta version adopts a fast iteration strategy.Its data pipeline can automatically discover and collect "valuable cases" to improve the model. The characteristics of end-to-end learning determine that it is more dependent on real-time data feedback. For example, when the model is wrong in a specific driving environment, the system needs to quickly recover data, analyze problems, and update the model through OTA.

ask

  • Data collection, processing, training and deployment should have high efficiency.
  • Through automatic scene identification and screening mechanism, scene data can be distilled efficiently and effective data can be obtained quickly for model training and evaluation.

The core advantage of end-to-end driving lies in data-driven closed-loop learning.

  • Application of shadow mode: Even if ADS doesn’t really take over driving, the system can still run in the background and record.Model decision vs. Difference of human driver’s decision.If the deviation is found, the scene data will be returned and used for model optimization.

Tesla shadow mode

  • Data enhancement and simulationTesla, NVIDIA, Weilai, Xiaomi and other companies use neural networks to generate Neural Rendering, while Waymo and other companies build large-scale virtual test scenarios for model fine-tuning.

Tesla trained the model through simulation.

High confidence synthetic virtual scene (aiSim)

ask

  • data mining: Automatically identify low performance areas and improve the convergence speed of the model.
  • Simulation ability: Using high fidelity analog data to improve data utilization.

andThe requirements for data are extremely strict, including:

  • Large scale data: covering millions to billions of kilometers of driving data.
  • High quality labeling: Reduce manual intervention and improve the accuracy and efficiency of labeling.
  • Scene diversity: Covers various road conditions, weather, geographical environment, etc.
  • real-time: Quickly collect, analyze, optimize and update OTA.
  • closed loop feedback: Automatically discover problem data and optimize the model.