Water/wastewater

Machine Learning and Artificial Intelligence for Improved Data Quality

Author:

Will Barnes, Senior Data Scientist

on behalf of AtkinsRéalis

Free to read

Articles are free to download. Unlock the article to be shown more content, graphs and images.

Introduction

The UK water industry faces significant challenges in ensuring the accuracy and quality of the vast amounts of data being generated. This is compounded by a growth in demand for data which can be seen across many AMP8 investment plans, with several initiatives dependent on having good quality data available, including:
-    Prevention of pollution / combined-sewer-overflow discharges, which requires accurate Event Duration Monitor (EDM) data;
-    Demand management, which needs accurate data capture about current water usage;
-    Flood alleviation schemes, where accurate data capture associated with flooding incidents is needed to ensure capital investment targets appropriate areas; and
-    Capital maintenance programmes, which require accurate inventories about asset location 
and condition.
Alongside these initiatives, the water industry is increasingly adopting innovative solutions that leverage big data analytics to enhance organisational decision-making. However, the accuracy of the data is paramount to realising the benefits promised by advanced analytics. 
This article reviews techniques typically employed by data scientists working in the water sector to identify outliers within a dataset, before exploring how advances in artificial intelligence may influence this field in the future.

Outlier Removal

An outlier can be defined as a datapoint that lies at an irregular distance from a population. Outlier detection, which aims to detect these unexpected datapoints, is a critical topic that has attracted significant attention. Nonetheless, despite this interest, data scientists still spend 60% of their time cleaning and organising data [1]. Increasing the efficiency and accuracy of this process will allow more time for the development of digital solutions that can help provide the required change needed in the sector.

Statistical Techniques

Statisticians were the first profession to observe the presence of outliers as they can easily be identified using an array of traditional distribution-based techniques – e.g. z-scores, interquartile ranges, adjusted box plot limits, etc. see Figure 1. 

Figure 1: Example of Statistical Outlier Removal Techniques - Box Plots
Distribution-based methods identify outliers as datapoints that deviate significantly from a standard distribution, or expected range. Table 1 provides a brief overview of the advantages and disadvantages of these distribution-based techniques.

Table 1: Advantages and Disadvantages of Distribution-Based Outlier Techniques

It is important to expand upon the disadvantage cited in Table 1 that distribution-based approaches often focus solely on the distribution within a single parameter, without accounting for correlations with other variables. To give a water engineering example, this may be relevant when analysing data such as water demand, which should be considered alongside temperature before determining if the observed data is an outlier given the conditions.

Application of Machine Learning

To overcome several of the limitations of distribution-based methods, there has been an uplift in recent years of data scientists employing machine learning techniques for identifying outliers. When discussing machine learning, it is common for data scientists to refer to two key terms:
-    Supervised learning - using labelled data, the relationship between input and output data are learned; and
-    Unsupervised learning – an approach in machine learning where algorithms learn patterns exclusively from unlabelled data.
Regarding outlier detection, both approaches are applicable. For example, a clustering algorithm (unsupervised learning) may identify a District Metered Area (DMA) group with suspiciously high water demands, whereas Figure 2 provides an illustration of utilising Facebook’s Prophet model (supervised machine learning) to flag outliers in flow data whilst considering the influence of rainfall.

Figure 2: Flow Outlier Removal using Supervised Machine Learning (Prophet Model)
However, as with the application of statistical methods, there are still several drawbacks associated with machine learning techniques for detecting outliers – see Table 2.

Table 2: Advantages and Disadvantages of Machine Learning Outlier Techniques
 

Potential use of Artificial Intelligence for Outlier Detection

Both methods discussed above primarily rely on numerical reasoning to remove outliers, often neglecting contextual information related to the dataset. Consequently, prior to applying any statistical or machine learning outlier methodologies, data scientists must review literature, analyse sensor documents for accuracy, and consult experts to understand data limitations. This logic is then hard-coded, and outliers are removed based on contextual understanding – see Figure 3.

Figure 3: Traditional Workflow for Outlier Removal

To overcome this time-consuming process of extensive literature reviews, the field of artificial intelligence and large language models may hold the solution. The unprecedented growth in LLMs has been well reported over recent years (e.g. ChatGPT, Microsoft CoPilot, etc.) and such models offer expertise in working with linear and sequential text input data. LLMs have now even advanced to take text input data and transform this into code to automatically wrangle data (e.g. PandasAI).
In a recent study conducted by AtkinsRéalis, an LLM approach was tested on recorded strain measurements at an offshore wind turbine. Manufacturer documentation associated with sensor instrumentation, along with applicable maintenance information, was fed into a ChatGPT-3.5 model. This model successfully identified relevant text associated with data outliers – e.g. date intervals where maintenance was undertaken, and minimum / maximum values associated with sensor readings. The strain data was then cleaned automatically through AI generated code, saving data scientists countless hours reviewing extensive literature and hard-coding such logic – see Figure 4.
Nonetheless, before wide-scale uptake of an LLM approach, several challenges in the data science community and water industry need to be considered, including:
-  Computational expense:  Training and deploying models like GPT-3.5 is resource-intensive; 
-  Validation: Large language models have been reported to ‘hallucinate’ so their integration into critical workflows needs careful assessment; and
-  Data availability: Current data storage methods in the water sector must be evaluated to ensure contextual information associated with key measurements is easily identifiable.
When the above challenges are addressed, there are exciting opportunities for the use of large language models within the water sector. Integrating these models with statistical / machine learning techniques could enhance data cleaning and outlier removal, enabling data scientists to focus more on developing models, unravelling hidden insights, and tackling some of the pressing challenges facing the UK water industry.

References

[1] Press, G. (2016). Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. [online] Forbes. Available at: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/.

 

Free to read

Articles are free to download. Please login to read this article or create an account.


Digital Edition

AET 29.2 May 2025

May 2025

Water / Wastewater- From Effluent to Excellence: Microbiological assessment of a containerized modular water reuse pilot system- Without water everything comes to a haltAir Monitoring- Probe Sampli...

View all digital editions

Events

ReGen

Jul 23 2025 Sydney, Australia

Chemical Indonesia

Jul 29 2025 Jakarta, Indonesia

DXC 2025

Aug 04 2025 Rockville, MD, USA

INDOWATER 2025

Aug 13 2025 Jakarta, Indonesia

View all events