From a certain incident in Teslalin, transformer

Author:China Financial Information Ce Time:2022.08.02

Author: Zhong Juan, experts from Furui Microelectronics Autonomous Driving Algorithm, graduated from the Chinese Academy of Sciences, worked at the Institute of Joint Research of Tsinghua University, and successively served as senior algorithm experts and system architecture experts.

According to the news of Taiwan's "Lianhe Daily" and "Global News", on the morning of July 22, a traffic accident occurred when Taiwanese artist Lin Moumou drove Tesla, which attracted the attention of netizens. The accident occurred on Zhongzheng North Road, Taoyuan City, Taiwan. Departs from the lane and hit the separation island in the middle of the road and cause the vehicle to get angry. Fortunately, when the accident occurred, the workers at the nearby construction site rushed to the rescue, pulled Lin Moumou's father and son to a safe area to avoid more serious casualties. Take it to Linkou Chang Geng Hospital for treatment.

Judging from the on -site video, the road conditions, vision and weather of the impact section are relatively good. Because it has just finished, the speed does not look fast. Some people have made a detailed analysis of the video. The straight line distance from the point of view of the vehicle to the impact point is about 180 meters, the total driving is 12 seconds, and the average speed is 55 kilometers per hour. At 2 seconds, began to deviate from the lane and hit the separation island at a speed of nearly 60 kilometers. According to rescuers, Lin Moumou was unlocked by the rescue of the time belt, but it was not confirmed after the accident was unlocked before the accident.

The cause of the accident is still under investigation. The first half of the accident car that occurred in the accident was burned. Lin Moumou is currently clearly conscious. Due to the impact on the brain, the memory part of the memory part of the memory occur. It may be fully presented. There are two main speculations for the analysis of the cause of the accident:

Some people think that it is a problem with the vehicle system. With the opening of the driver, Tesla Autopilot is likely to not identify the V -shaped diversion band and ramp entrance in front, leading to collision. It happened in March 2018 with this accident. At that time, the Model X driven by Apple engineer Huang Weilun, when the AUTOPILOT was started, hit the ramp barrier of the California highway, causing the vehicle to fire and killed unfortunately.

Some people think that it is a human factor, because when Lin Moumou is unlocking when it was rescued, Tesla's unlocking the seat belt will forcibly withdraw from the automatic assistance driving, and the vehicle is handed over to the control. It should be enough to avoid manual intervention for 2 seconds to deviate from the lane, but from the running trajectory of the vehicle, it can be seen that the actions have not occurred, and it is most likely distraction.

The Model X of this accident is equipped with L2 -level autonomous driving. L2 level autonomous driving provides driving support to many operations in the steering wheel and acceleration through the driving environment. The driver needs to pay close attention to the entire driving process and observe the surrounding environment in real time. The situation is ready to take over the vehicle at any time. In the L2 stage, the detection and response task of goals and events shall be borne by the driver and the system. The responsibilities between the driver and the autonomous driving system are unclear, or the driver's excessive trust in autonomous driving is the L2 -level autonomous vehicle vehicle The potential cause of the accident. At present, most vehicles of L2 levels are pure vision or visual -based, radar supplemented, and the sensors do the weight -based fusion after each identification task. If the obstacles are found in the visual, the vehicle will respond regardless of whether the radar is found or not, and it will not work in turn. Vision has a blind spot for the objects that have not been seen. Only long -term accumulation of data can gradually improve the robustness of the model, but new things such as strange and strange clothes may cause visual misjudgment. Therefore, in the future model, autonomous driving of models will In the plan, many combinations of radar and vision are perceived.

Many domestic vehicles are now in the L2+stage, and the transition to L3/L4 is facing huge challenges. In addition to the upgrade of hardware, various evolution algorithms based on hardware are emerging, and Transformer has been deeply studied in this round of algorithm iteration. Tesla based on pure visual Transformer effectively merged multi -camera characteristics to BeV space for perception, which also causes to cause perception, which causes it to cause perception, which will cause it to cause perception, which will cause it. Everyone's widespread attention to Transformer. Recently, we have conducted an in -depth analysis of the Transformer model, and we have made multi -directional thinking on the application scenarios, algorithm principles, operators, and hardware acceleration of the model.

Transformer principle

Transformer is a neural network model proposed by Google for machine translation. The earliest problem for RNN (cyclic neural network) in NLP (natural language processing) is difficult to calculate in parallel. To achieve parallelization.

Transformer uses the concept of Q, K, V in the Attention mechanism, and the concepts of Q, K, V in the Attention mechanism corresponding to Query, Key, Value, Query, Key, Value are taken from the information retrieval system. For example, when you search for a product on a certain e -commerce platform, the content entered on the engine is Query, and then the search engine matches the key (product attributes) based on Query, and then gets matched according to the similarity of Query and Key Value. Q, K, and V in Self-ATENTION also play a similar role. In matrix calculations, the point is one of the methods to calculate the similarity of the two matrix. Use Q and K for the similarity calculation, and then based on the similarity and the similarity and the degree of similarity and the degree of similarity and the degree of similarity and the degree of similarity and the degree of similarity. V is weighed and matched, and the value of power is the similarity between Query and Key. After passing the Self-ATETENTION layer, the input information is established. This connection can be completed by parallel computing (matrix operation), but this matrix calculation is not known to the location information (there is no order). Therefore,,,,, there is no order). Transformer performs position encoding (Positional Encoding), and the input sequence position and sequential information are sent to the network together, retaining the location information of the input image and the timing information of the audio. The output of the Self-ATENTION will complete the ENCODER LAYER work through the normalization and feeding network, and complete the encoding work at the TranSFormer Middle-level Conference-level Encoder Layer, and establish a stronger connection between the input information.

The De decoder of Transformer also contains multiple Decoder Layer, Decoder Layer and Encoder Layer. The biggest difference is that the Decoder Layer uses the Cross-Wention mechanism, that is, Query comes from Decoder, key and Value come from ENCODER. The output size of Encoder is consistent with the size of the original signal, and the output size of the decoder is customized according to different needs, which has nothing to do with the original signal size. For example, the ENCODER output during autonomous driving image processing is the input size (240x135x256) of an image characteristic pyramid, and the output of Decoder can be the characteristic graph size of the custom bird's eye view (BEV perspective). It can be a custom detection box (100x3x256). This is also the magic of Transformer. The network converts objects from one feature space to another feature space. You don't need to consider how to match the two spaces. Transformer automatically completes it.

TranSFormer model development in the field of visual

The Bert algorithm based on Transformer has achieved great improvement in the 11 tasks of NLP. It is the most exciting news in the field of deep learning in 2018. In 2020, the Google team proposed to apply Transformer to the field of image classification. Some people think that CNN is a subset of Transformer. Transformer can be used as a global attention of images, and CNN is a local concern to the surrounding pixels. Transformer's global concern includes special circumstances that CNN's local concern (the weight of the attention except peripheral pixels is the weight of the pixels around the pixels around Set to 0). TranSFormer has the following key models in the field of image:

VIT, VIT splits the image into multiple PATCH (16X16), so as to make a sequence vector (256 vector), convert the two -dimensional structure of the image into one -dimensional, input it to the Transformer network for code, complete the image classification Task. VIT can use the transformer to get the result.

DETR, DETR first uses CNN to complete the feature extraction of the image, and then use each pixel (channel direction) of the feature diagram as a vector, so as to be a HXW sequence vector, thereby converting the two -dimensional structure of the image into one -dimensional, input to the Transformer Code in the network to complete the target detection task. Detr combined with CNN and Transformer, the first target detector to be completely ended. Because its attention mechanism is pixel -level, the calculation volume is very large, and it has an unacceptable computing complexity in practical applications.

SWIN-TRANSFORMER, SWIN-TRANSFORMER proposed a method based on Windows, limited the pixel attention mechanism to local windows, and adopted a CNN-like hierarchical structure to open different windows on each layer to strengthen the pixels between pixels between pixels Contact. This structure of SWIN-TRANSFORMER allows it to replace CNN to become the main network of visual tasks to extract image features.

DEFORMABLE DETR, DeFormable Detr uses the defermable attention (deformation attention mechanism), and the module thought is derived from the Deformable CNN. Deformable Detr uses the idea of DETR. The difference is that only the sampling point of a fixed number (4) sampling point in the neighborhood of the feature diagram is calculated. Due to the use of sparse attention mode, the calculation amount decreased significantly. Its DEFRMABLE Attention mechanism allows the model to automatically learn the area (pixels) to be automatically learned. This mechanism is widely used in multi -camera BEV feature fusion, characteristic pyramid multi -layer fusion, and multi -sensor feature fusion. The application of Transformer in the evolution of the autonomous driving algorithm

The application of Transformer in autonomous driving has become a trend. In addition to the wide application of BEV -based fusion before, Transformer can also replace CNN and RNN to implement image feature extraction, pyramid characteristics fusion, and time -order feature extraction. In recent years, the main research based on Transformer at home and abroad is as follows:

BevsegFormer, after the resin extract feature diagram, uses a structure similar to the Deformable Detr to encodes each camera multi -scale feature, and then decodes the characteristics of the BEV feature space to get the BEV visual feature diagram, and then outputs the BEV semantic segmentation result.

BevFormer, after the resure extracts feature diagram, also adopts a structure similar to the Deformable Detr. First use Temport Self-Wention to query the previous time stamp, integrate the information of the historical moment, and then use the Spatial Cross-OTENTION to projected multiple cameras in the BEV feature space, integrate the multi-camera information, obtain the BEV characteristics, and then use the corresponding correspondence Head completes 3D target detection and semantic segmentation.

DETR3D, use ResNet to extract features, and then use FPN fusion feature pyramid to obtain multi -layer pyramid characteristics and enter the decoder of the Deformable Detr. Unlike Bevformer and BevsegFormer, the decoder input and output of the design of the DETR3D is not the characteristic diagram of the BeV, but the detection box of 9 dimensions (position 3, size 3, flight angle 2, speed 1) under the bev viewing angle Object Queries Essence Detr3D's decoder Self-ATATENTION module uses Object Query to avoid multiple Query convergence to the same object. At the same time A specific projection calculation is obtained after obtaining projection features. Then use the output of the projection characteristics and the Self-ATENTION layer as the Cross-ATATENTION, and update the detection box and category of Object Query.

FUTR3D, FUTR3D adopts a structure similar to Detr3D. The difference is that FUTR3D combines multi -sensor information of cameras, lidar and millimeter wave radar. Use ResNet to extract the camera features, use PointPilar for laser radar feature extraction, and the millimeter wave radar directly uses a radar point cloud. During the feature projection, the radar directly uses the three -dimensional reference point under BEV. The camera uses DETR3D similar projection. The structure is the same as Detr3D, and the output is also a 3D detection box and category.

Mutr3D, MUTR3D is a end -to -end target tracking framework. Use ResNet and FPN to extract features and integrate multi -scale features. Then MUTR3D conducted two types of queries on each frame. New students' query obtained the 3D detection box Object Query in a way similar to the DETR3D. Old query is the goal of successful detection or tracking from the previous frame. The old query is responsible for tracking the previous goals that appeared in the current frame (the first successful detection was allocated), and the freshman query is responsible for the new target (the target that does not match the old query) in the current frame. Task. Mutr3D adopts a Transformer structure similar to Detr3D to complete 3D target detection tracking.

In addition, VectorMapnet used Transformer for high -precision maps to learn; Beverse and Bevdet used SWIN Transformer for a multi -camera feature extraction; PETR was an improvement on DetR3D. After projecting the camera image into the 3D cone space, similar similarity was used. The DETR3D method obtains 3D target detection and category, and PETR also uses Swin Transformer for feature extraction. The needs of transformer in hardware acceleration

It can be seen from the recent algorithm evolution application that the Transformer model that is used in the field of autonomous driving is mainly concentrated in using Swin Transformer for image feature extraction and the characteristics of multi -scale, multi -camera, and multi -sensor. The algorithm structure is dismantled for these two models. Compared with the CNN network, the following hardware acceleration can be obtained:

1. Matrix Multiply

2. A large number of 1x1 convolution calculations, and 3X3 convolution in the CNN network

3. SIN/COS calculation brought by position coding

4. The effects of different Window on the data Tile and the demand of the chip storage in each layer in Swin Transformer

5. In the Deformable Attention mechanism, the demand for data Tile and chip storage for the position of the multiple sampling points

6. Large models (the Transformer model before quantification, hundreds of m, even G) deployment in the acceleration unit

Transformer has achieved landing on the GPU, but the power consumption is high. In order to get lower power consumption, we carried out new NPU architecture design to meet the acceleration of the CNN and Transformer modules at the same time.

Source of this article: Lujiazui Financial Network

- END -

Zhenkang County Meteorological Bureau to lift the rain and blue warning [Class IV/General]

2022.06.09

The Zhenkang County Meteorological Bureau was lifted at 01:18 of 2022 at 01.18 seconds of 02 seconds 2022, 11:21:15 on June 08, and a blue warning released by 11:11:50.

Jinsha County People's Court: Maintain the Enterprise Equity of the County Realm Economy

2022.07.14

Recently, the Jinta County People's Court has publicly tried the case of the staff of a industrial park using the stool of the company's cable.Three people such as Liu were employees of a certain indu...