Regarding this communication failure, I want to say a few more words ...

Author:Fresh jujube classroom Time:2022.07.05

In the past few days, everyone is paying attention to the large -scale communication failure of Japanese telecommunications operator KDDI.

This failure has a great impact, involving a total of 39.15 million users in Japan. Moreover, the failure lasted for a long time, and it took almost two days to basically recover.

The specific reason for the failure, I saw that many public accounts have been written, and I will not repeat the analysis.

Today, this article, I want to enlarge the topic and talk to everyone in depth — it ’s 2022, why do we have so many faults in our communication network, and whether we have the ultimate solution.

█ Communication failure: a game that lasted a century

The failure is the natural attribute of the communication network. Just like people will be sick, since the birth of the communication network, it has accompanied the failure. In other words, we just created a communication network in the process of solving the failure.

After solving countless faults, the phone was invented

For more than a hundred years, countless communicators have been fighting unremitting struggle and games with faults. They worked hard to develop various technologies, adopted various means, and struggled with communication failures.

In macro perspective, the effect of fighting is significant. Under the continuous accumulation of experience, under the continuous progress of technology and technology, the probability of failure to fail in communication networks is declining.

Young readers may not know that more than 20 years ago, fixed phone calls were not available (there were not many families with phone calls), and they were like a common phenomenon like stopping power and power. More than 10 years ago, the mobile phone could not be allocated, and it was also a common phenomenon.

In the past ten years, these phenomena have become increasingly rare. Occasionally, everyone feels strange. The Internet is broken, and the first reaction of many people is that the mobile phone is broken, or if it is owed, and quickly restart or recharge. Isn't it?

The information society we are now, the same as hydropower, is an important infrastructure. Our work and life, as well as the operation of all walks of life, cannot open the letter network.

Under this premise, communication operators, as state -owned enterprises, and as maintenanceers of the network, will always put the security and stability of the network first.

In response to the stability of the network, the Ministry of Industry and Information Technology set strict assessment indicators for operators. If a network failure occurs in a certain city and a city, the leader must be responsible and the career is worrying.

The pressure of operators will be passed on to employees, and they will also be passed on to equipment vendors and outsourcing vendors.

The market competition is so fierce now. Once there is an accident, it is either a huge amount of compensation or losing the market share of this province. This is a loss that equipment vendors and outsider cannot bear.

Therefore, the entire communication industry is definitely sufficient to pay attention to the security and stability of the communication network. The key is the ability and execution.

█ Where is the weakness of the communication network?

First of all, I want to tell you the definition of the security level of the communication network.

Depending on the scene, the security of the communication network is divided into different levels. From low to high, home, enterprise, and telecommunications.

Safety level of communication system

Like the router used in our family, all belong to the family level. The safety and reliability of this device is very low. It is bad to say that it is bad, and it is easy to cause network interruption.

The enterprise level is the network equipment used in the unit. According to the size of the network and the number of users, enterprise -level equipment has high security reliability and is not easy to interrupt services.

Telecom -level requirements are even higher. Such as mobile, telecommunications, Unicom, their networks, to provide services to hundreds of millions of users, absolutely no faults are absolutely not allowed. Generally speaking, the reliability of telecommunications must meet the standards of more than 9 or more.

Today, Xiao Zaojun's communication network refers to the public communication network of operators facing the public, including both cellular mobile communication networks and fixed network broadband networks. They are telecommunications levels.

The architecture of honeycomb mobile communication networks and fixed -network broadband networks is actually similar, and the main difference is to access the network part.

Honeycomb mobile communication network is a wireless access network, and the access device is a base station. The fixed network broadband network is a wired access network, and the access device is PON device (passive light network device, including light cats).

We take the cell phone communication network as an example to analyze.

Public communication networks serve the number of hundreds of millions of user groups. Therefore, the pyramid -level architecture is usually adopted, the core network is the core, the transmission network (carrying the network) is the backbone, and the access network is the limbs.

You can understand at a glance that the biggest weakness of this architecture is the core network and the transmission network (especially the backbone network).

The core network is the management center, the heart and brain of the network. Once you hang it, you will hang the entire network. Therefore, the core network engineer (such as me at that time) was the most risk and stressful position.

Core network machine room

The transmission network (carrier network) is the blood vessels and nerves of the communication network. Fortunately, the peripheral is that it will affect a small piece at most, but if the cardiovascular and cerebrovascular are broken, what should I do? That was also completely paralyzed.

Optical transmission equipment

The failure of this KDDI, as well as the failure of DOCOMO in October 2021, and the failure of the four major British operators in 2020, the failure of CenturyLink in the United States in 2020 is related to the core router. To put it plainly, there is a problem with the heart and cerebral blood vessels, and the whole person (network) is paralyzed.

In contrast, the probability of a big problem with access to the network is very low. Individual base stations "dropped stations" have a maximum of hundreds of thousands, the scope is small, and the complaints are controllable. Base station equipment

If there is a large -scale failure in the access network, it is most likely the software version of the device, or the hardware batch problem. The probability of this situation is extremely low.

█ What did the communicator do in order to prevent faults?

So, in order to ensure the safe and smooth operation of the communication network and prevent the occurrence of faults, what methods do we use?

First, the improvement of the top -level architecture design.

The architecture of the network is the foundation of network security. A good structure must consider both performance and capacity, costs, and safety and redundancy.

Here, please remember to remember: As a complex product, no matter how you design or stack the materials, it has the possibility of faults, but it is the problem of high probability and early or later.

What should I do if the faults that may occur, instead of strictly guarding the death, it is better to consider the failure.

Therefore, the introduction of a backup mechanism is the most effective means to deal with faults.

Backup mechanism

Everyone has learned "probability and statistics". If the probability of a device is 1%, the probability of the two devices at the same time is 1%× 1%= 0.01%. right?

In order to ensure absolute security, when the network architecture is designed, the POOL (pool) network is used, as shown below::

Several devices together form a pool (POOL), each responsible for the business. If one is broken, the others will immediately push up to ensure that the business is not affected.

There are usually two or more core equipment, which are in different areas of the provincial capital cities, which are physically far away.

In addition, in the design of network architecture, important equipment network elements are usually placed in a higher -level core machine room.

Core machine room

For example, the most important and most important in the mobile communication network, which is responsible for storing and managing user data (that is, the previous HLR, which contains each user's mobile phone number, authentication data, business information, etc.). Core machine room. At the same time, the maintenance personnel will regularly separate the data of the data.

In recent years, operators have even begun to make backups in other provinces due to geological disasters, coupled with war or terrorist attacks.

For example, last year's flood in Zhengzhou, when the core machine room was flooded, HLR was retreated, and HLR, which was urgently used in the provincial capital of the provincial capital to achieve temporary recovery of business.

Different levels of disaster recovery

The second method is the main mechanism of the bottom.

What we just said is the redundant mechanism of the top -level design. Specific to the computer room, rack, single board, and cables, there are also main design, which can be called the underlying main mechanism.

If you have been to the computer room, you will find that the frames on the cabinet are inserted with various veneers. And these veneer basically appears in pairs.

A certain manufacturer 3G device front appearance

In other words, there are usually two types of verses.

The same is true for network cables and fiber. You can hardly see the single -rooted cable, which are right.

A manufacturer 4G device front appearance

The reason for this is to back up each other. If a single board is broken, the other board can continue to work to ensure that the business is not affected. At the same time, the system will report to the police to remind the staff to replace it as soon as possible.

The same is true for the power supply. All the cabinet equipment of the telecommunications room must have at least two power input.

Multi -way power input (one red and one blue is all the way)

In addition to Municipal Power, important computer rooms will also have emergency power supply equipment such as batteries, UPS, and generators.

The battery pack of the computer room

Third, complete management systems and regulations.

Technology is never the only element that affects network security and stability. The biggest threat to the communication network is actually human, not technology.

For this, Xiao Zaojun believes that every communicator will feel the same.

In terms of management processes and systems, in terms of engineering technical specifications, we have had countless lessons.

Why should the upgrade scheme repeatedly reviewed? Why should engineering specifications so strict? Why build a spare parts warehouse? Why do the cutting steps be double-check, or even Triple-Check? Why do you have to arrange for a major operation after major operation? Why do important holidays be sealed? Native

These are the experiences summarized by previous people.

Keep in awe at all times

In addition to internal management systems and process standards, the state has also established increasingly stringent laws and regulations for the deliberate destruction of the communication network that is currently often occurred and punished.

For example, illegal construction shovel break -fiber, deliberately destroying base stations, and cutting fiber will be sanctioned by law.

Base -based base station feed line that was cut malicious

█ The depth reasons behind the communication failure

There is a reasonable network architecture design, a master mechanism, and a complete system and specifications. Why do so many failures occur?

Next, let me talk about the reasons for the deep level.

First of all, it is estimated that the most agreed point of everyone is the inner volume environment of the communication industry.

In recent years, malicious competition and low prices have prevailed. Equipment and subcontractors must both grab orders and maintain profits. They can only desperately lower costs, such as product design costs, materials costs, and construction materials costs. More importantly, the salary cost.

The continuous compression of costs will inevitably affect product reliability and engineering quality. Too low wages lead to a lot of experienced talents. In order to complete the construction, the subcontractor can only recruit fresh graduates. After simple training (even no training), they can be sent to the scene to work. These personnel lack the necessary training and practice, insufficient quality and technical capabilities, and become a great risk point.

Some of them are very low, and they are oppressed and ruthless. It is not impossible to delete the library directly.

In the past few years, in order to ensure that the front -line employees were not deducted, some manufacturers even signed contracts with subcontractors to restrict the revenue of outsourcing employees.

In addition to low -cost competition, another important factor affecting network operation security is increasing technological complexity.

The more advanced technology, the higher the complexity, the lower the reliability. With the evolution of technology, the network scale of operators has become larger and larger, networking is becoming more and more complicated, and the probability of problems has increased significantly.

The tidal effect of the communication network is very obvious. Sometimes there are ten or even a hundred times different when you are busy and busy. If accidents (disasters, etc.) occur, the amount of calling is surging, it is more likely to be a thousand times different.

It is impossible for operators to make a thousand times redundant design. Therefore, if there is no reasonable bypass design or threshold design, the probability of congestion on the network is extremely high. (Several major faults in recent years have factor of signaling traffic congestion.)

At present, few people can fully understand the complicated networking of operators. Over time, the personnel are first -class, and they are even more unfamiliar.

The communication network was originally a metaphysics.

The third potential network security risk is also the most worried risk of Xiao Zaojun, which is external network attacks. For example, hackers, viruses and system vulnerabilities.

Nowadays, communication equipment is basically IP and cloudy, the network is becoming more and more open, and some are directly deployed on the public cloud. The physical isolation from the outside world is getting weaker and weaker, which is more likely to be attacked than before.

The current attacker is much higher than before, and the means is more diverse, which has great threats to the network.

Of course, operators and equipment vendors have also invested a lot in preventing network attacks.

Now, all manufacturers pay attention to the concept of "security reinforcement". As the name suggests, security reinforcement is to block the system loopholes, making the system more stable. The operator will use a third -party tool or hire a third -party manufacturer to conduct safe scanning the current network equipment, find security loopholes, and then ask the equipment dealer to rectify and block.

Everything is for safety

This game of "one foot height, one foot high" has continued for a long time.

However, Xiao Zaojun personally believes that the current defense party has great problems in terms of safety awareness and technical ability. In the future, we encounter more and more safety events.

It is hoped that the relevant units and departments will not put safety on their mouths, and truly spend some effort to improve their personnel quality and strengthen training. Otherwise, it would be too late to remedy.

█ The last words

The failure of Japan's KDDI is not the first time, and it is definitely not the last time. The communication network failure is like drumming and passing flowers. No one knows whether they are the next.

Now, manufacturers have proposed to introduce AI to allow artificial intelligence to take over the network to reduce the failure rate of the network. Some manufacturers, on the basis of network cloudization, to upgrade grayscale (that is, local upgrades), can also greatly reduce network risks. These are good trends.

I think that on the road to struggle with the communication network, we still have a long way to go. The road is long and far away, and the communicator sees it up and down.

Well, the above is the entire content of today's article. Thank you for your patience, we see you next time!

- END -

What should human beings do if animals in the world launch a counterattack?

Any warThere will be no winnersWait for the dustLiving human beings return to the ...

Tianzhou III has evacuated the space station, and the opportunity to enter the atmosphere

According to the China Manned Aerospace Engineering Office, Tianzhou -3 freight sh...