Estimated reading time: 5 minutes
Rules-based systems can’t cope when a network generates millions of alarms. Time to consider machine learning algorithms…
Every year, mobile network operators spend nearly a quarter of their revenue on network management and maintenance.
They have little choice.
It goes without saying that network outages annoy customers. This is not just bad for an MNO’s reputation, it is also potentially ruinous for its bottom line.
Why? Because downtime is one of the biggest drivers of churn. In one study (of cable customers), 60 percent of subscribers who had churned cited network performance as their main reason for quitting. Among ‘conditional churners’ – those considering leaving – 75 percent reported network issues.
In addition to lost customers and damaged credibility, network outages can also lead to costly employee overtime and possibly even penalties for not meeting service level agreements (SLAs).
It all explains why one study estimated that network outages cost
the world’s MNOs around $15 billion a year.
However, that study was carried out in 2013. Today, the stakes are so much higher.
This is because traditional telecommunications carriers have become Internet service providers (ISPs). Thanks to the widespread adoption of 4G, people and enterprises now depend on mobile networks for their connectivity.
So, a network outage no longer leads merely to dropped phone calls but – potentially – to the failure of critical enterprise services.
Now, as the industry prepares to evolve to standalone 5G, the volume of data signals generated by the network is set to escalate yet again.
It raises the question: how can MNOs predict and fix network faults in such a complex environment filled with so much ‘noise’?
The solution for more forward-looking carriers lies with handing over diagnostics to advanced systems that use machine learning to:
• Accurately analyse millions of network alarms
• Instantaneously identify alarms most likely to lead to a fault
• Discover alarm relationships and alarm families
• Eliminate ‘noisy’ alarms
Before we dive deeper into these ML-based solutions, let’s look at how the majority of today’s network fault detection systems work.
Network diagnostics now: people, rules and alarms
Every day, the world’s MNOs engage in a battle to keep their networks operational. Unfortunately, there is plenty that can go wrong.
In addition to technical faults (physical link failures, traffic congestion/overload, chip failures) there are other, more unpredictable factors – from cyberattacks to thunderstorms.
When something in the system does fail, it shows up as an anomaly in the network data. So MNOs put in place monitoring systems that trigger alarms when these anomalies occur. Network Operation Centers (NOC) manned by teams of human analysts then scrutinise this data and try to answer three key questions:
• What happened to the network?
• Why did it happen?
• What will happen next?
The problem with this human-centered approach is obvious. NOC agents can only process a limited amount of information. They have to prioritise their attention. For this reason, they ‘mute’ the majority of the alarms they receive.
In so doing they risk ignoring small problems that might develop into bigger faults and bring the network down.
Chris Neisinger, Chief Technology Officer at network diagnostics specialist Guavus, elaborates on why this ‘old school’ approach is limited.
“Most fault detections systems use rules that were defined and tuned by experts based on their experience,” he says.
“But these systems can only find what they are looking for. And they rely on people to update them. The problem is that humans are unreliable. Engineers write things down and maybe don’t pass them on. Meanwhile, rules get too complex to maintain or they change and then you get false alarms.
“Also, network engineers can’t look at every signal, so they put in silencing features. As a result, they miss things. They look through their logs after an outage, and often find a signal from weeks before that they muted.”
Clearly, these human-centered alarm-based systems are struggling to cope with the volume of signals generated by today’s networks.
However, things are about to get vastly more challenging. Standalone 5G (SA) is coming – and it will drastically increase network complexity yet again.
Standalone 5G: an unprecedented explosion in network data
How much greater is the challenge of analysing the data emerging from standalone 5G networks than their 4G predecessors?
Why? Because the ‘standalone’ 5G Core is a new kind of network. It is virtual – with foundational technologies (Network Function Virtualization and Software Defined Networking) that turn physical network components into software.
Virtualization will make 5G networks much bigger and able to support millions more connections. MNOs can use this extra capacity to allocate bandwidth to enterprises. Effectively, this ‘network slicing’ gives private companies the ability to run their own mini-networks.
All of which vastly escalates the amount of data generated by the network as a while.
And to further complicate matters, the vast majority of connected 5G devices will be machines. So when there is an outage, these devices will not be able to report a problem.
Ken Rokoff, VP Head of Product & Strategic Alliances at Guavus, believes many industry insiders radically underestimate the extent of the 5G data deluge coming their way.
He uses a simple analogy to illustrate the depth of this misunderstanding.
“People lack perspective when it comes to 5G,” he says. “They think, well I can already ride a bicycle, so now I am going to ride a motorcycle. But this move from 4G to 5G is more like going from a bicycle to stepping into the cockpit of an Airbus plane.
“The sheer amount of data and telemetry involved is orders of magnitude greater.”
Rokoff believes the exponential increase in complexity will compel carriers to consider a new automated approach to network analytics.
“In the past, MNOs have used rules to solve problems,” he says. “But the truth is, rules based systems won’t work in a 5G world, where there are millions of elements rather than hundreds.
“In this world, you need systems that can do advanced network analytics. They are no longer a ‘nice to have’. They are mandatory if you are going to efficiently run your network.”
How to mitigate the risks: moving to ML-based fault detection
In the world of network fault analytics, everything comes back to three acronyms:
• MTTA (mean-time-to-acknowledge)
• MTTD (mean-time-to-diagnose)
• MTTR (mean-time-to-resolve)
Mobile networks have to reduce these numbers, if they want to reduce the number of damaging outages.
However, we have already established how difficult it is for rules-based systems to handle the sheer volume of alarms generated by today’s mobile networks.
For this reason, many forward-looking MNOs now use ML-based probabilistic algorithms to do the work instead.
These systems carry out the task of monitoring network activity without human intervention. They can manage millions of alarms simultaneously. Over time, they can identify which to act upon and which to ignore.
Here are four ways in which ML-based systems produce better results.
#1. They escalate alarms for predicted incidents
Probabilistic algorithms prioritize alarms that have a high probability of leading to network incidents. Typically, this is just 10 percent of all alarms. Engineers can then resolve these issues without relying on network inventory, topology or static rules.
#.2 They reduce the noise of low-impact alarms
Conversely, ML-based systems learn over time which alarms don’t indicate serious problems. They deprioritize them. They can also suppress any alarm related to scheduled maintenance events.
#3. They consolidate alarms
Sometimes a single event can trigger multiple alarms. ML-based systems can consolidate them into one. This avoids engineers wasting time investigating multiple trails.
#4. They reveal relationships between alarms
Similarly, ML-based systems can gather a set of alarm families together for further root issue analysis.
Of course, the ultimate pay-off of advanced analytics solutions is that they are self-healing – and can even anticipate faults before they occur.
Chris Neisinger says: “These models are continuously training themselves – and they get more accurate over time. So let’s say the operator adds a new cell site. This changes the network architecture, and it means a new model needs to be created.
An ML-based system will automatically adjust itself. Within days it will be very accurate again. And no person needs to be involved.”