By: Francesco Paola
24 May, 2021
AI is top of mind for every CIO and her team. The promise of artificial intelligence with machine learning models to analyze ever increasing data volumes and diverse data types, and to proactively resolve customer inquiries and system alerts is prompting IT leaders to invest in platforms that leverage the investment to date in their IT infrastructure and deliver on the promise of lower total cost of ownership, increased customer satisfaction and enhanced bottom lines.
However, the achievement of these benefits has been slow. Today’s distributed IT environments, a mix of physical and virtual applications and infrastructure, with higher levels of automation, massive amounts of diverse data and siloed platforms, compounded with customer access across multiple channels – voice, text, chat, social, online – can strain even the best organized service desks and network operation centers (NOC).
Here we analyze the top challenges facing today’s Service Desk and IT support ecosystems and provide guidance on how to rapidly and efficiently address these challenges by intelligently deploying AIOps solutions, positioning the organization for scale.
Let’s start with the process from when alerts are
generated to when tickets are created – the “alert to
ticket” challenges.
The most prevalent IT Ops challenge is alert storms; it is the primary
manifestation of the “too much data”
problem in IT support. The uncontrolled generation of alerts overwhelms
the support desk, causing the
most critical alerts to be missed or to be processed after significant
delays, and impacting the ability of
support teams to do their job – they are too busy being inundated with
alerts. The delays cause
unnecessary system downtime and potential service interruptions,
impacting revenue and customer
satisfaction.
Take a sell-side ad-trading platform as an example that relies on
uninterrupted and streamlined network
connectivity to generate revenue. As trading volumes increase, server
capacity warning thresholds are
breached, but the system still functions within its SLAs and performance
is not impacted. At the same time,
a critical network slowdown occurs, directly impacting trade volumes and
therefore revenue. Which alerts
should the support team be working on? - If the server threshold alerts
are allowed to inundate the NOC,
the critical network alerts may be lost, directly impacting the ad
platform’s revenue.
There are several reasons, and some have to do
with the monitoring platforms taking an alert-based view
of the world, not a business or customer landscape view (what is
important to a specific customer) and
others with the poor implementation of the platforms.
One common occurrence is that existing monitoring platforms are reactive
to alerts and inquiries as
opposed to proactive – they simply pass on the alerts as opposed to
correlating them to similar or related
alerts and proactively providing a recommended resolution to the rep,
like when procurement and supply
chain resources are unable to access their inventory management system,
and at the same time
operational line staff cannot access the ERP platform – the monitoring
platform should be able to correlate
these like inquiries as opposed to the reps having to do so and
determine the root cause of the failures.
Another reason is that the monitoring platform implementation may have
been performed poorly, such as
when it is poorly instrumented, for example, physical (or virtual)
hardware assets such as servers, routers,
and firewalls are not tagged appropriately and are not integrated well
with the CMDB, or the CMDB is not
kept up to date or configured properly.
Classification and categorization of alerts seem
to be an afterthought in some IT organizations, for example,
service level agreements are not applied correctly or at all to alerts
and hence critical alerts for inaccessible
systems, network congestion, malware detection or other security
breaches get lost in the noise and
overwhelm the IT support team.
Finally, the monitoring and alerting platform itself is sub-par or
outdated. It does not enable automated
triage, causality analysis and categorization of alerts prior to ticket
generation, inundating the service desk
and NOC with the unmanageable volumes discussed earlier.
The ad-trading system example above was poorly instrumented, alerts were
not properly classified or
categorized, and the monitoring tool had no correlation capabilities,
shifting the analytical onus to the reps,
directly impacting revenue.
A second common issue is the proliferation of
siloed monitoring solutions. As IT departments invest in new
and improved technology platforms to keep up with demand, more often
than not they do so with point
solutions that bring the promise of quick and easy deployment and
integration across the enterprise. Sorry
to disappoint you, but enterprise IT is not known for visionary
investments. It’s more about how I can solve
this problem now, thus incurring technical debt.
This lack of integration proliferates the data issue in more ways than
one. With no unified platform to bring
it all together, these point solutions don’t provide alert and ticket
correlation – exacerbating the challenge
of managing and processing massive volumes of data that live in siloed,
disparate systems. For example, in
many support organizations the Service Desk is a separate entity from
the NOC, using different systems of
engagement: the Service Desk may use Front or a ZenDesk to manage their
inquiry queue, while the NOC
may receive alerts from an APM like AppDynamics and use an enterprise
ITSM platform like ServiceNow.
Tickets (both user & machine or alert
generated) are received in the support queue with little or no
context, meaning that the rep is left to their own devices to correlate
tickets, research possible resolution
options delaying the process of resolving the issues and closing the
tickets – extending troubleshooting
time and raising the cost per ticket. For example, in a CPG
manufacturing organization, a simple “user
cannot access the MRP system” without knowing whether it’s a network
issue, a server issue, a system
overload, a denial of service (DoS) or another system issue will take up
unnecessary research time by the
rep and prolong the resolution time, potentially delaying the
manufacturing of the good, causing stockouts
at the retail level.
So when an inquiry is received by the Service Desk that is caused by a
systems issue, for example, a user is
unable to access their benefits and payroll information in the HRIS
platform (because the virtual server
hosting the instance has been inadvertently taken offline say) the
Service Desk may not have the requisite
systems visibility and simply passes on the ticket to the NOC with
little to no context – forcing the NOC rep
to research the issue from square one.
Having sorted out issues with alert storms and
integrated monitoring, the challenges pertaining to ticket
resolution remain. What. What happens once an alert is converted to a
ticket, what functions does the IT
support desk have to perform in order to expeditiously resolve the
issue?
As a support engineer tries to determine the root cause of the inquiry,
disparate knowledgebases
containing SOPs, service level agreements, contracts and system
configuration force the rep to have to
access multiple systems and configuration files, and manually determine
the appropriate resolution, as
opposed to having the system of engagement recommend one or more
possible paths. Unless the process
has been automated, the support engineer may have difficulty accessing
system data, for example log and
configuration files, delaying the ability to extract insights about the
underlying IT systems, monitoring
operational and usage statistics, and proactively solving application
performance problems.
As the support engineer is left to their own devices to research the issue, the challenges are compounded by disparate systems and disparate data sources: the same problem highlighted above, but this time, due to the nature of modern IT infrastructure with the mix of physical and virtual environments, combined with disparate and large data sources that may not be up to date make it challenging for individual reps to quickly identify the right resolution to the inquiry or alert.
In addition, the mantra of “automate everything” has permeated many an IT organization – good. But in many cases the execution of the automation requires human intervention – the platforms are not sophisticated enough and hence not trusted enough so a human has to trigger the automation script, delaying the resolution. What we refer to as augmented automation.
Finally, once the issue is cleared and the ticket
is closed, the resolution is not necessarily institutionalized:
in reality, it’s not a completed job unless the resolution is correlated
with the ticket and like tickets, and
that information is stored and made accessible for the next rep. It is
difficult for these disparate systems to
learn, i.e., there is little to no institutional memory that can be
leveraged to continually optimize the
service.
This issue is especially critical where processes in IT support
organizations are people dependent, and if
there is high churn, then the institutional memory walks out the door,
and you’re back at the starting point.
Sure you can train the rep to manually update the knowledge base, but
will they do it, and will they have
time if they are inundated with alerts and tickets?
The issues with the current state of IT Ops are
well understood and have been addressed to varying
degrees with traditional process improvement methods and tools. Does
machine learning hold some
answers that will make a step rather than incremental change to the
alerting and monitoring process?
Look out for the next part of our blog on how machine learning can
indeed be that vector of change.