Service Assurance for Intent-based Network Architecture

Network Service YANG Modules [RFC8199] describe the configuration, state data, operations, and notifications of abstract representations of services implemented on one or multiple network elements.

Quoting RFC8199: “Network Service YANG Modules describe the characteristics of a service, as agreed upon with consumers of that service. That is, a service module does not expose the detailed configuration parameters of all participating network elements and features but describes an abstract model that allows instances of the service to be decomposed into instance data according to the Network Element YANG Modules of the participating network elements. The service-to-element decomposition is a separate process; the details depend on how the network operator chooses to realize the service. For the purpose of this document, the term “orchestrator” is used to describe a system implementing such a process.”

In other words, orchestrators deploy Network Service YANG Modules through the configuration of Network Element YANG Modules. Network configuration is based on those YANG data models, with protocol/encoding such as NETCONF/XML [RFC6241] , RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf [openconfig], etc. Knowing that a configuration is applied doesn’t imply that the service is running correctly (for example the service might be degraded because of a failure in the network), the network operator must monitor the service operational data at the same time as the configuration. The industry has been standardizing on telemetry to push network element performance information.

A network administrator needs to monitor her network and services as a whole, independently of the use cases or the management protocols. With different protocols come different data models, and different ways to model the same type of information. When network administrators deal with multiple protocols, the network management must perform the difficult and time-consuming job of mapping data models: the model used for configuration with the model used for monitoring. This problem is compounded by a large, disparate set of data sources (MIB modules, YANG models [RFC7950], IPFIX information elements [RFC7011], syslog plain text [RFC3164], TACACS+ [RFC8907], RADIUS [RFC2138], etc.). In order to avoid this data model mapping, the industry converged on model-driven telemetry to stream the service operational data, reusing the YANG models used for configuration. Model-driven telemetry greatly facilitates the notion of closed-loop automation whereby events from the network drive remediation changes back into the network.

However, it proves difficult for network operators to correlate the service degradation with the network root cause. For example, why does my L3VPN fail to connect? Why is this specific service slow? The reverse, i.e. which services are impacted when a network component fails or degrades, is even more interesting for the operators. For example, which service(s) is(are) impacted when this specific optic dBM begins to degrade? Which application is impacted by this ECMP imbalance? Is that issue actually impacting any other customers? Intent-based approaches are often declarative, starting from a statement of the “The service works correctly” and trying to enforce it. Such approaches are mainly suited for greenfield deployments.

Instead of approaching intent from a declarative way, this framework focuses on already defined services and tries to infer the meaning of “The service works correctly”. To do so, the framework works from an assurance tree, deduced from the service definition and from the network configuration. This assurance tree is decomposed into components, which are then assured independently. The root of the assurance tree represents the service to assure, and its children represent components identified as its direct dependencies; each component can have dependencies as well.

When a service is degraded, the framework will highlight where in the assurance service tree to look, as opposed to going hop by hop to troubleshoot the issue. Not only can can this framework help to correlate service degradation with network root cause/symptoms, but it can deduce from the assurance tree the number and type of services impacted by a component degradation/failure. This added value informs the operational team where to focus its attention for maximum return.

For the rest of story, refer to those two IETF drafts:

… and a typical IETF ASCII-art architecture, to get you interested.

 +-----------------+
| Service |
| Configuration |<--------------------+
| Orchestrator | |
+-----------------+ |
| | |
| | Network |
| | Service | Feedback
| | Instance | Loop
| | Configuration |
| | |
| V |
| +-----------------+ +-------------------+
| | SAIN | | SAIN |
| | Orchestrator | | Collector |
| +-----------------+ +-------------------+
| | ^
| | Configuration | Health Status
| | (assurance graph) | (Score + Symptoms)
| V | Streamed
| +-------------------+ | via Telemetry
| |+-------------------+ |
| ||+-------------------+ |
| +|| SAIN |---------+
| +| agent |
| +-------------------+
| ^ ^ ^
| | | |
| | | | Metric Collection
V V V V
+-------------------------------------------------------------+
| Monitored Entities |
| |
+-------------------------------------------------------------+

A quick insane sanity Quizz: what does SAIN stand for? 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.