Understanding the importance of data quality in causal AI

Thu, 18th Apr 2024

By Rafi Katanasho, APAC Chief Technology Officer, Dynatrace

Causal AI, the latest innovation in artificial intelligence, promises to revolutionise the way we understand cause and effect. Unlike traditional AI, which relies on correlations to identify possible causes, causal AI dives deeper to pinpoint the exact reasons behind events.

This makes it ideal for troubleshooting complex systems and identifying the root causes of problems. For example, imagine the difference between a doctor who simply observes symptoms and prescribes medicine versus one who delves into the underlying causes of illness to provide targeted treatment. Causal AI offers that same level of precision in the realm of data analysis.

But here's the catch: causal AI is only as good as the data it is fed. High-quality data is crucial for generating reliable insights. Inaccurate or incomplete data can lead the AI model down rabbit holes, producing misleading conclusions.

For this reason, it's important to understand what makes data 'good' for causal AI. The five key pillars of data quality are:

Accuracy: The data needs to be a faithful reflection of reality. This means ensuring data points are free from errors and correctly represent the situations they depict.
Completeness: It's important to determine whether any information is missing from the data set. Omissions can create wrong conclusions and contribute to bias.
Consistency: There should be no discrepancies in the data. Contradictory or inconsistent data confuses AI models and increases the risk of errors.
Timeliness: The data should be up-to-date and relevant to the current context. This is a critical factor in AI for IT operations (AIOps). Because IT systems change often, AI models trained only on historical data struggle to diagnose novel events. Causal AI requires real-time updates to the training model.
Relevancy: The data needs to be appropriate for the questions asked. In AIOps, this means providing the model with the full range of logs, events, metrics, and traces needed to understand the inner workings of a complex system. Feeding a model with irrelevant data is like asking a historian to solve a physics equation: they simply don't have the right information to provide a meaningful answer.

How to cultivate high-quality data for causal AI
Achieving high-quality data requires a strategic, organisation-wide effort. Everyone who interacts with data - from creation to utilisation - plays a crucial role, and the foundation lies in establishing data governance practices.

These define clear guidelines for data handling across various aspects, including quality, security, legal compliance, storage, ownership, and integration.

Data stewardship is becoming increasingly important in the pursuit of clean data. It ensures that data generated and maintained by different departments or individuals is accurate, consistent, and complete.

A recent concept called 'data mesh' encourages data creators to view it as something requiring dedicated management, similar to any other product. However, this approach can have drawbacks, such as the creation of duplicate data sets, which can lead to inconsistencies.

For teams to gain consistent and reliable insights, having high-quality operational data readily available in a central location is vital. This central data repository, often called a data lake house, facilitates instant analytics.

Challenges on the road to data quality
Organisations may encounter numerous barriers to ensuring data quality. For a start, the sheer amount of data generated today can make management daunting.

Modern, cloud-native architectures have many moving parts, and identifying them all is a task that can overwhelm human effort alone. Modern observability solutions that automatically and instantly detect all IT assets in an environment – applications, containers, services, processes, and infrastructure – can save time and resources.

Fragmented and siloed data storage creates inconsistencies and redundancies. Stakeholders need to put aside ownership issues and agree to share information about the systems they oversee, including success factors and critical metrics.

Another common impediment is manual data tagging and handling - an error-prone process that teams should minimise. Observability solutions automate much of the task of identifying the variables that go into application performance and availability. Human involvement should be limited to verifying the features or attributes machine-learning algorithms use to make predictions or decisions.

A double-edged sword for data quality management
Improving data quality management using causal AI is a fascinating prospect. Causal AI can be a powerful tool for improving systems management, observability, and troubleshooting. It can highlight inconsistencies or outliers in data sets that indicate anomalies and pinpoint the root causes.

This allows for targeted data cleaning efforts, focusing on areas with the most significant impact. Additionally, causal AI enables an AIOps approach with proactive visibility that helps companies improve operational efficiency and reduce false-positive alerts by 95%, according to a Forrester recent Consulting report.

Causal AI holds immense potential for problem-solving and process optimisation. By prioritising data quality, organisations can unlock the true power of causal AI and make data-driven decisions with confidence.