While AI has emerged as a significant value driver in today’s digital economy, not all businesses are able to surmount the challenges associated with the traditional approach to data-driven solutioning. The key to delivering value with AI perhaps lies elsewhere – in this article, we look at how AI projects headed towards failure can be spotted early on, and how enterprises can, instead, take on a proven approach to implementing nimble, lean and scalable AI from the get-go.
Gartner estimates AI to deliver 2.9 trillion dollars of business value over the next few years. A McKinsey survey from 2019 shows us how.
Enterprises are adopting AI across functions with significant impact on their cost and revenue
(Source: McKinsey & Co. Global AI Survey 2019)
Data scientists (Harvard Business Review says it's the most in-demand job these days) across enterprises are doing their best in this 5 step journey to identify AI use cases, and design solutions that would create impact and keep adding to AI’s business value.
Despite their best efforts however, estimates say 87% of machine learning projects never make it into production, and almost 80% projects take up to 6 months to even reach the production stage. Reasons behind this startling estimate range from ‘dirty data’ to lack of integration of AI findings into next step actions or decisions.
Source: Business Over Broadway, 2018
Attributing failure of enterprise AI to Dirty data sounds a bit like attributing one’s lack of productivity in winter to arctic blast. Enterprise AI must know what an enterprise inherently is and tune its AI to meet the needs of enterprises and stop expecting the enterprises to get magically redesigned to suit the AI algorithms
Let’s dwell on the dirty data for a moment. Data could get dirty because of multiple reasons
- Lack of understanding of how data is born – are we expecting dimensional information that do not exist?
- Does it get dirty in transit – are incentives aligned to dirtiness rather than cleanliness?
- Does it get dirty in storage – is it getting mixed without reason? Are there processes that’s using this data without knowledge of the creator ?
Net net, data scientists can’t throw up their hands and expect someone to ‘clean’ the data (Reason 1), convince executives to change their processes and use AI in their decision making (Reasons 6 and 14), expect tons of support (Reason 3) while ignoring lack of clarity (Reason 4) – all because we as AI players and Data scientists have conjured up a model and are claiming success in a utopian world.
Studying the failure drivers closely, it’s evident that 10 out of 14 reflect a lack of understanding of processes and people that create and use the data.
Here’s our take on how we can fix it in two steps. First up, we need to embrace 4 home truths:
- Data is only a symptom: business processes are the root cause. Let us therefore call them data generating processes (DGP). DGPs are generating both real signal (borrowing a telecom analogy) and noise. Temporary workers and new joiners inject errors unknowingly due to lack of process and system knowledge. If their number is 15% there is a way to learn as opposed to if such data generators constitute 60% of the population.
- Treating symptoms is a hit-or-miss game: are we supposed to reject some signals from the DGP as noise, or does the model retrain to account for this time’s miss in the next one? Is the model scoring better by being trained on data generated from morning shifts, when employees are less tired than in the latter shift? Or should we re-think the training data set completely if 60% of population are temps and new hires always?
- Data quality management (DQM)– yet another cost black-hole: According to McKinsey, leading firms spend over 74% of their employees’ time in non-value-adding tasks that pertain to data quality issues. Overfitting remains a problem in models currently in production and is one of the key reasons behind executives’ lack of faith (Reason 3,4 and 14)
- The returns on dimensionality reduction – and why they lie untapped: Unstructured data will account for 93% of the data universe by 2022. Difficulty of modelling with unstructured data remains paramount – because sequential/unstructured data like language-based queries translate into massive, hyper-dimensional problem space, where intuitions of low-dimensional geometry begin to fail – as a result, models that give excellent results on structured data give mediocre returns on unstructured data. Remember the chatbots that pop up on sites of every bank and other enterprises these days? Beyond the Von Neumann menu, unless there is a live agent behind, they fail to truly understand - and start repeating the same trivialities.
Once we get this context - that data is just a symptom, overfitting is a problem (executives want resilience as much as accuracy and they are open for discussions on how to balance) and we need to watch for unstructured data as much or more than structured data, then we have reached a good place for cracking the enterprise AI problem already.
Following 3 steps can ensure reducing this damning 87% failure rate in enterprise AI:
- Make the model account for DGP: Data is only a symptom, and underlying processes and behaviors and practices, the root causes. If we can study and map the data generating process and capture its nuances – like seasonality, boundary conditions etc., then selecting a model becomes relatively an easy task. Understanding DGP helps informing the model about source of signals and ignoring noise – thereby avoiding overfitting. Evolution of the DGP must also get captured in the model and therein lies the key to avoid failures associated with rapid-fire/black-box AI that selects models from pull down menus. Lastly, DGP also contains critical information about underlying data quality; this helps cut costs associated with DQM for enterprises.
- Focus on dimensionality reduction: In analyzing unstructured data, dimensionality reduction holds key to making AI solutions lightweight, relevant and accurate in the business context. Dimensionality reduction is effectively a problem of learning representation. Reducing the problem space without losing signals is a challenge with most traditional dimensionality reduction techniques, like statistical correlation analysis and association analysis.
- Close the loop on deployment: While data modelling can never be fully automated with existing technical capabilities, machine learning solutions must always close the loop. For example – if a chatbot fails to deliver relevant results to a user query, can the model recognize its inability to render required result with confidence, and consequently redirect the user to a live agent? Can the solution demonstrate some degree of self-awareness when a previously trained model begins returning irrelevant results?
We were conscious of each of the 14 failure drivers while designing Sainapse. Sainapse conquers most of these failure drivers so that enterprises will not need to spend time and money on trying models and cleaning data. We have baked in capabilities to handle all forms of structured and unstructured data including text annotated images or mixed language messages. Sainapse accounts for length and breadth of enterprise customer support, service operations and even engineering support, internalizing quality management and cutting-edge dimensionality reduction techniques that turn any form of data into the right answers, instantly, on the first attempt.