The shining language of the ways
On Tesla AI Day, the Autopilot team revealed the massive improvements and upgrades to their software. Overall, Full Self Driving (FSD) has released 35 software updates to date. Ashok Elluswamy, the director of Autopilot, announced that approximately 160,000 customers worldwide are using the Autopilot and self-driving system beta software. That’s a jump from 2,000 customers last year.
The Autopilot team explained how the FSD system is trained and operated, from neural networks to training data, planning, training infrastructure, AI compiler and inference steps , and more.
The occupancy network is a multi-camera neural network that predicts the car environment using inferred images. The prediction process takes place in the vehicle system and does not depend on the server. Therefore, it is also able to predict the future movement and position of surrounding objects.
The Occupancy Network uses the vehicle’s eight cameras, capturing 12-bit images, to detect objects around the car and create a single, unified volumetric occupancy 3D vector space. Since it is based on video inputs, it can also instantly detect (in less than 10 milliseconds) changes in the environment such as passing pedestrians, debris or accelerating cars and adjust the speed and position of the car depending on the uncertainty.
Additionally, the team also develops Neural Radiance Fields (NeRF) networks by treating occupancy network output vectors as inputs for NeRF. Using images from the cameras on the vehicles, NeRF can reconstruct dense meshes in 3D using volumetric rendering.
The network is trained with a large self-labeled dataset without any human interaction. The team built three in-house supercomputers comprising 14,000 GPUs for training and automatic labeling. Training videos are stored in 30 petabytes of cache memory, with half a million videos entering and leaving the system daily.
Language of the alleys
In the previous lane detection method, Tesla used Pixelwise 2D Instance Segmentation, which could only detect the eagle lane and adjacent lanes. It only worked effectively on well-designed and structured roads like highways. But on roads in cities, intersections and lanes are quite complex.
Tesla introduced the “FSD Lanes Neural Network” which includes three components:Vision component, Map component, and Linguistic component.
The “vision component” consists of a set of convolutional layers, attention layers, and other neural network layers that, using video from the vehicles’ eight cameras, produce a visual representation. This visual representation is then enhanced with the “map component” that contains the road-level navigation map called the “lane guidance module”.
The lane guidance module consists of neural network layers that give information about the intersection, number of lanes, and various other road features that vehicle cameras might not be able to easily identify in time. real. The first two components produce a 3D Dense World Tensor.
This dense world tensor is treated as an input image and combined with the language developed by Tesla to encode track and track topology called the “lane language” – which is the third component – using LLMs in which the words and tokens are the positions of the spaceways.
Labeling the training data of half a million videos that pass through supercomputers every day is a daunting task. The team built a Automatic labeling machine for the Lanes Network which, from the video images of the vehicle camera, is able to reconstruct 3D vector spaces with the combination of the occupation network and the newly developed lane language. To create a vector mesh from a single path, the system only takes about 30 minutes.
Then, using “multi-trip reconstruction”, the images of different cars are combined and matched. This creates a map in an even shorter time and ultimately only requires human intervention to finalize the output label.
To correct some of the labels where the automated labeling system was having issues, such as parked vehicles, trucks, vehicles on winding roads, or parking lots, the team manually corrected 13,900 video labels to optimize the set of the data engine.
Using its accelerated video library built on PyTorch, the team noted a training speed of +30%. Using data generated by the occupancy network, lane language and 3D reconstruction models generated by NeRF, the team created a Simulation. In this 3D created world, the team introduced new challenges, environments and objects to train the system for different changing situations such as road design, biomes, weather conditions, etc.
Elon Musk said the FSD beta will be available worldwide by the end of this year. “But, for many countries, we need regulatory approval. So we’re kind of stuck with regulatory approval in other countries,” Musk explained, “Technically, it’ll be ready to go to a global beta by the end of this year. .