Advancing anomaly detection with AIOps—introducing AiDice | Azure Weblog and Updates

This weblog put up has been co-authored by Jeffrey He, Product Supervisor, AIOps Platform and Experiences Crew.

In Microsoft Azure, we make investments great efforts in guaranteeing our providers are dependable by predicting and mitigating failures as rapidly as we are able to. In large-scale cloud methods, nevertheless, we should still expertise surprising points merely as a result of large scale of the system. Given this, utilizing AIOps to constantly monitor well being metrics is key to operating a cloud system efficiently, as now we have shared in our earlier posts. First, we shared extra about this in Advancing Azure service high quality with synthetic intelligence: AIOps. We additionally shared an instance deep dive of how we use AIOps to assist Azure within the secure deployment area in Advancing secure deployment with AIOps. At present, we share one other instance, this time about how AI is used within the area of anomaly detection. Particularly, we introduce AiDice, a novel anomaly detection algorithm developed collectively by Microsoft Analysis and Microsoft Azure that identifies anomalies in large-scale, multi-dimensional time collection knowledge. AiDice not solely captures incidents rapidly, it additionally supplies engineers with vital context that helps them diagnose points extra successfully, offering the very best expertise potential for finish prospects.

Why are AIOps wanted for anomaly detection?

We’d like AIOps for anomaly detection as a result of the info quantity is just too giant to research with out AI. In large-scale cloud environments, we monitor an innumerable variety of cloud elements, and every part logs numerous rows of knowledge. As well as, every row of knowledge for any given cloud part may include dozens of columns such because the timestamp, the {hardware} kind of the digital machine, the era quantity, the OS model, the datacenter the place the nodes internet hosting the digital machine keep in, or the nation. The construction of the info now we have is actually multi-dimensional time collection knowledge, which accommodates an exponential variety of particular person time collection as a result of varied combos of dimensions. Because of this iterating via and monitoring each single time collection is just not sensible—making use of AIOps is critical.

How did we method this, earlier than AiDice?

Earlier than AiDice, the best way we dealt with anomaly detection in large-scale, high-dimensional time collection knowledge was to conduct anomaly detection on a particular set of dimensions that have been an important. By specializing in a scoped subset, we’d have the ability to detect anomalies inside these combos rapidly. As soon as these anomalies have been detected, engineers would then dive deeper into the problems, utilizing pivot tables to drill down into the opposite dimensions not included to raised diagnose the problem. Though this method labored, we noticed two key alternatives to enhance the method. First, the previous method required a variety of guide effort by engineers to find out the precise pivot of anomalies. Second, the method additionally restricted the scope of direct monitoring by solely permitting us to enter a restricted variety of dimensions into our anomaly detection algorithms. Given these causes, Microsoft Analysis and Azure labored collectively to develop AiDice, which improves each of those areas.

How can we method this now with AiDice, and the way does it work?

Now with AiDice, we are able to robotically localize pivots on time collection knowledge even when dozens of dimensions on the identical time. This permits us so as to add much more attributes, whether or not that be the {hardware} era or {hardware} microcode, the OS model, or the networking agent model. Although this makes the search area a lot bigger, AiDice encodes the issue as a combinatorial optimization downside, permitting it to look via the area extra effectively than conventional approaches. Transient particulars of AiDice are described under, however to see a full clarification of the algorithm, please see the paper printed on the ESEC/FSE ’20: twenty eighth ACM Joint European Software program Engineering Convention and Symposium on the Foundations of Software program Engineering (ESEC/FSE 2020).

Half 1: AiDice algorithm—formulation as a search downside

The AiDice algorithm works by first turning the info right into a search downside. Search nodes are shaped by beginning at a given pivot and constructing the relationships out to the neighbors. For instance, if we take a node, “Nation=USA, Datacenter=DC1, DiskType=SSD”, we are able to kind out the neighboring nodes by swapping, including, or eradicating a dimension-value pair, as proven within the diagram under.

This image shows how the search space is formed. On the left is a node graph. On the right is a zoomed in version to a specific set of nodes and arrows labeled with the relationships between the nodes.

Half 2: AiDice algorithm—goal operate

Subsequent, the AiDice algorithm searches via the search area in a sensible method by maximizing an goal operate that emphasizes two key elements. First, the larger the sudden burst or change in errors, the upper AiDice scores the target operate. Second, the upper the proportion of the errors that happen on this pivot in relation to the whole variety of errors, the upper AiDice scores the target operate. For instance, if there are 5,000 complete errors that occurred, it’s extra vital to alert the consumer in regards to the pivot that went from 3000 errors to 4000 errors than the pivot that went from 10 to twenty errors.

Half 3: Customization of alerts to scale back noise

Subsequent, the alerts that AiDice produces must be filtered and customised to be much less noisy and extra actionable because the outcomes thus far are optimized from a mathematical perspective however haven’t but included area information across the that means of the enter knowledge. This step can fluctuate broadly relying on the character of the enter knowledge, however an instance may very well be that consecutive alerts that share the identical error code could also be grouped collectively to scale back the variety of complete alerts.

AiDice in motion—an instance

The next is an actual instance during which AiDice helped detect an actual subject early on. The small print are altered for confidentiality causes.

  • We utilized AiDice to watch low reminiscence error occasions in a sure kind of digital machine with greater than a dozen dimensions of attribute data alongside the fault depend, together with the area, the datacenter location, the cluster, the construct, the RAM, or the occasion kind.
  • AiDice recognized a rise within the variety of low reminiscence occasions on distinct nodes in a selected pivot, which indicated a reminiscence leak.

    • Construct=11.11111, Ram=00.0, ProviderName=Xxxxx-x-Xxxxxx, EventType=8888 (particulars have been altered for privateness).

  • When trying on the mixture pattern, this subject is hidden and with out AiDice it could take guide effort to detect the precise location of the problem (see graphs under, knowledge normalized for privateness).
  • The engineer accountable for the ticket appeared on the alert and a few instance circumstances proven within the alerts to rapidly ready work out what was occurring.

This image is a line chart of the aggregate trend in the low memory events for a certain type of VM 17 timestamps. Overall, the trend remains relatively stable over time.

This image is a line chart of the anomaly identified by AiDice in a particular pivot over 17 timestamps. Overall, the trend clearly exhibits an anomaly, starting low then constantly increasing.

On this real-world instance, AiDice was in a position to detect a problem in a dimension mixture that was inflicting a selected error kind in an automated trend, rapidly and effectively. Quickly after, the reminiscence leak was found and Azure engineers have been in a position to mitigate the problem.

Trying ahead

Trying forward, we hope to enhance AiDice to make Azure much more resilient and dependable. Particularly, we plan to:

  • Assist further eventualities in Azure: AiDice is being utilized to many eventualities in Azure already, however the algorithm has room to enhance with respect to the kinds of metrics it will probably function on. Microsoft Azure and the Microsoft Analysis workforce are working collectively to help extra metric eventualities.
  • Put together further knowledge feeds in Azure for AiDice: Along with upgrading the AiDice algorithm itself to help extra eventualities, we’re additionally working so as to add supporting attributes to sure knowledge sources to totally leverage the facility of AiDice.

Study extra

Leave a Comment