By: Francesco Paola
24 May, 2021
AI is top of mind for every CIO and her team. The promise of artificial intelligence with machine learning models to analyze ever increasing data volumes and diverse data types, and to proactively resolve customer inquiries and system alerts is prompting IT leaders to invest in platforms that leverage the investment to date in their IT infrastructure and deliver on the promise of lower total cost of ownership, increased customer satisfaction and enhanced bottom lines.
However, the achievement of these benefits has been slow. Today’s distributed IT environments, a mix of physical and virtual applications and infrastructure, with higher levels of automation, massive amounts of diverse data and siloed platforms, compounded with customer access across multiple channels – voice, text, chat, social, online – can strain even the best organized service desks and network operation centers (NOC).
Here we analyze the top challenges facing today’s Service Desk and IT support ecosystems and provide guidance on how to rapidly and efficiently address these challenges by intelligently deploying AIOps solutions, positioning the organization for scale.
Let’s start with the process from when alerts are
generated to when tickets are created – the “alert to
The most prevalent IT Ops challenge is alert storms; it is the primary manifestation of the “too much data” problem in IT support. The uncontrolled generation of alerts overwhelms the support desk, causing the most critical alerts to be missed or to be processed after significant delays, and impacting the ability of support teams to do their job – they are too busy being inundated with alerts. The delays cause unnecessary system downtime and potential service interruptions, impacting revenue and customer satisfaction.
Take a sell-side ad-trading platform as an example that relies on uninterrupted and streamlined network connectivity to generate revenue. As trading volumes increase, server capacity warning thresholds are breached, but the system still functions within its SLAs and performance is not impacted. At the same time, a critical network slowdown occurs, directly impacting trade volumes and therefore revenue. Which alerts should the support team be working on? - If the server threshold alerts are allowed to inundate the NOC, the critical network alerts may be lost, directly impacting the ad platform’s revenue.
There are several reasons, and some have to do
with the monitoring platforms taking an alert-based view
of the world, not a business or customer landscape view (what is
important to a specific customer) and
others with the poor implementation of the platforms.
One common occurrence is that existing monitoring platforms are reactive to alerts and inquiries as opposed to proactive – they simply pass on the alerts as opposed to correlating them to similar or related alerts and proactively providing a recommended resolution to the rep, like when procurement and supply chain resources are unable to access their inventory management system, and at the same time operational line staff cannot access the ERP platform – the monitoring platform should be able to correlate these like inquiries as opposed to the reps having to do so and determine the root cause of the failures.
Another reason is that the monitoring platform implementation may have been performed poorly, such as when it is poorly instrumented, for example, physical (or virtual) hardware assets such as servers, routers, and firewalls are not tagged appropriately and are not integrated well with the CMDB, or the CMDB is not kept up to date or configured properly.
Classification and categorization of alerts seem
to be an afterthought in some IT organizations, for example,
service level agreements are not applied correctly or at all to alerts
and hence critical alerts for inaccessible
systems, network congestion, malware detection or other security
breaches get lost in the noise and
overwhelm the IT support team.
Finally, the monitoring and alerting platform itself is sub-par or outdated. It does not enable automated triage, causality analysis and categorization of alerts prior to ticket generation, inundating the service desk and NOC with the unmanageable volumes discussed earlier.
The ad-trading system example above was poorly instrumented, alerts were not properly classified or categorized, and the monitoring tool had no correlation capabilities, shifting the analytical onus to the reps, directly impacting revenue.
A second common issue is the proliferation of
siloed monitoring solutions. As IT departments invest in new
and improved technology platforms to keep up with demand, more often
than not they do so with point
solutions that bring the promise of quick and easy deployment and
integration across the enterprise. Sorry
to disappoint you, but enterprise IT is not known for visionary
investments. It’s more about how I can solve
this problem now, thus incurring technical debt.
This lack of integration proliferates the data issue in more ways than one. With no unified platform to bring it all together, these point solutions don’t provide alert and ticket correlation – exacerbating the challenge of managing and processing massive volumes of data that live in siloed, disparate systems. For example, in many support organizations the Service Desk is a separate entity from the NOC, using different systems of engagement: the Service Desk may use Front or a ZenDesk to manage their inquiry queue, while the NOC may receive alerts from an APM like AppDynamics and use an enterprise ITSM platform like ServiceNow.
Tickets (both user & machine or alert
generated) are received in the support queue with little or no
context, meaning that the rep is left to their own devices to correlate
tickets, research possible resolution
options delaying the process of resolving the issues and closing the
tickets – extending troubleshooting
time and raising the cost per ticket. For example, in a CPG
manufacturing organization, a simple “user
cannot access the MRP system” without knowing whether it’s a network
issue, a server issue, a system
overload, a denial of service (DoS) or another system issue will take up
unnecessary research time by the
rep and prolong the resolution time, potentially delaying the
manufacturing of the good, causing stockouts
at the retail level.
So when an inquiry is received by the Service Desk that is caused by a systems issue, for example, a user is unable to access their benefits and payroll information in the HRIS platform (because the virtual server hosting the instance has been inadvertently taken offline say) the Service Desk may not have the requisite systems visibility and simply passes on the ticket to the NOC with little to no context – forcing the NOC rep to research the issue from square one.
Having sorted out issues with alert storms and
integrated monitoring, the challenges pertaining to ticket
resolution remain. What. What happens once an alert is converted to a
ticket, what functions does the IT
support desk have to perform in order to expeditiously resolve the
As a support engineer tries to determine the root cause of the inquiry, disparate knowledgebases containing SOPs, service level agreements, contracts and system configuration force the rep to have to access multiple systems and configuration files, and manually determine the appropriate resolution, as opposed to having the system of engagement recommend one or more possible paths. Unless the process has been automated, the support engineer may have difficulty accessing system data, for example log and configuration files, delaying the ability to extract insights about the underlying IT systems, monitoring operational and usage statistics, and proactively solving application performance problems.
As the support engineer is left to their own devices to research the issue, the challenges are compounded by disparate systems and disparate data sources: the same problem highlighted above, but this time, due to the nature of modern IT infrastructure with the mix of physical and virtual environments, combined with disparate and large data sources that may not be up to date make it challenging for individual reps to quickly identify the right resolution to the inquiry or alert.
In addition, the mantra of “automate everything” has permeated many an IT organization – good. But in many cases the execution of the automation requires human intervention – the platforms are not sophisticated enough and hence not trusted enough so a human has to trigger the automation script, delaying the resolution. What we refer to as augmented automation.
Finally, once the issue is cleared and the ticket
is closed, the resolution is not necessarily institutionalized:
in reality, it’s not a completed job unless the resolution is correlated
with the ticket and like tickets, and
that information is stored and made accessible for the next rep. It is
difficult for these disparate systems to
learn, i.e., there is little to no institutional memory that can be
leveraged to continually optimize the
This issue is especially critical where processes in IT support organizations are people dependent, and if there is high churn, then the institutional memory walks out the door, and you’re back at the starting point. Sure you can train the rep to manually update the knowledge base, but will they do it, and will they have time if they are inundated with alerts and tickets?
The issues with the current state of IT Ops are
well understood and have been addressed to varying
degrees with traditional process improvement methods and tools. Does
machine learning hold some
answers that will make a step rather than incremental change to the
alerting and monitoring process?
Look out for the next part of our blog on how machine learning can indeed be that vector of change.