DR-Africa: Data Rescue Africa

DRAfrica will explore NLP models for automated observation transcription in the context of downstream applications such as climate change modelling.

Grant Value: £358k; Start: May 2024; End: March 2027

This project is funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT and the Centre for Machine Intelligence (CMI).

MetOffice NERC CMI

Investigators

PI Dr Stuart E. Middleton, University of Southampton, UK
CoI Prof Ed Hawkins MBE, NCAS, University of Reading, UK
CoI Prof Tim Osborn, University of East Anglia, UK

Researchers

Dr Gyanendro Loitongbam, University of Southampton, UK
UoS NERC CMI

Methodology

DR-Africa is exploring a range of approaches for automated observation transcription. We will fine-tune and evaluate them when applied to African scanned historical measurement log book images and (b) develop a hybrid model and compare that to the performance of individual models. The goal is to advance the state of the art in automated observation transcription models and to release a set of open source tools that can be used by stakeholder partners in the environmental science domain (e.g. South African data rescue teams). In year 1 we will adapt the academic state of the art models for automated table cell detection and Optical Character Recognition (OCR) developed by PI Middleton under the NERC-funded GloSAT grant (NE/S015604/1) and tailor them for DR-Africa and specifically new African datasets. In addition commercial and open source multimodal Large Language Models (LLMs) will be investigated.

In year 2 we will review other open-source solutions to OCR components and develop an automated observation transcription pipeline using a combination of manual table detection, handwriting focussed OCR and post processing heuristics.

In year 3 we will develop and evaluate a hybrid approach using models from both systems (developed in years 1 and 2 respectively). This evaluation will include some real data rescue from African datasets and an assessment of the quality of the rescued data in the context of a downstream application - climate change modelling.

Contact

Dr Stuart E. Middleton, University of Southampton, UK