Building a data toolkit for testing times

Data Analytics and Artificial Intelligence are critical to the way our public sector customers solve business problems today and in the future. They can often promise a silver bullet to harnessing the power of the data we hold, promising us the ability to understand and utilise information in ways previously unimaginable. While the scale of investment in AI and easy access to its computational power through the cloud has removed many barriers to entry in using this technology, there are many considerations for our public sector customers to take onboard before they can confidently count on AI to revolutionise their decision making processes.

With this in mind, our 2021 new starter projects have focused on developing a toolkit to overcome some of the basic challenges that can derail the effective adoption of data analytics and AI. In addition to our graduate intake, Principle One was pleased to welcome Adam Read, about to begin the final year of a masters in Mathematics and Computer Science at Durham University, to complete a summer internship with us.

Adam was teamed with new starters Celine Daniel, Josh Alder and Harees Hussain to build a data toolkit that would provide the foundation for data analytics and could easily be used to explore a range of customer problems.

We’ve often found that there is an impetus to start using AI tools without getting a basic understanding of your data or the business questions that you are seeking to address first. With that in mind, we set up a number of short sprints, enabling the team to work iteratively, validating their assumptions and learn as they go, and seeking to understand more about the data at each step.

Although it may feel like stating the obvious that first you need data, many projects will set off with grand ambitions, without having understood the raw materials that they have to work with and rapidly reaching the conclusion that “data was really hard to get”. Recognising that one of the most time-consuming elements of any data analytics process is acquiring data, the team were tasked with developing a toolkit that would support the acquisition, validation and consolidation of data from different structured and unstructured sources ready for analysis.

For each step of the approach, the team were tasked with identifying a toolset they could use and building up a catalogue that could be used across the wider Principle One team.

To focus their work, we selected a topic that had provoked a lot of office (and Teams!) chat over the summer. With many of the Principle One team looking forward to the prospect of overseas holidays again, could Adam and the team undertake analysis of the many commercial government approved COVID test providers to recommend the most cost effective and reliable tests to their colleagues?

After investigating different tools, the team went back to basics, relying on Adam’s Python expertise to support scraping and validation and cleaning of the data, automating much of the data preparation work an analyst would need to do at the start of an assignment. Having created a single, consolidated dataset, the next challenge was how they could most usefully use the range of analytical tools at their disposal. Without a true indicator of reliability in the data, the team were forced to consider a range of different tools that they could combine to create a risk indicator their colleagues could use to pick a test provider.

Working with Microsoft Azure’s analytical toolset, the team used a clustering algorithm using a range of indicators derived from the data – these ranged from validity of contact details, name changes, price differences from those published, TrustPilot scores and other data anomalies identified from the register.

This was supplemented by using TextAnalytics to identify sentiment from keywords in reviews published on different companies, simply visualised with a word cloud – possibly showing that COVID test providers are generally only reviewed when something goes wrong!

So, with Adam looking forward to the end of his internship and some holiday time in Europe before returning to his studies, could we help recommend the right company to use for his COVID tests? Despite the lack of a clear indicator of reliability, the team found a strong correlation across a number of factors to suggest whether a provider would be reliable and were able to develop a confidence score for each test provider. However, it should not be forgotten that correlation is not causation, and a single statistic cannot explain a complex and fast changing market. Nevertheless, with limited time and sometimes unreliable data, the team succeeded in providing a simple tool to help colleagues quickly assess which provider they should pick before planning a trip.

Despite the limits of the data we chose to test out the data toolkit developed over the summer, our project shared enough of the challenges of applying AI in the real world to provide a broad set of tools that we will be able to reuse across many similar customer problems. All our new starters gained a valuable grounding in the techniques and pitfalls of data analytics and we enjoyed seeing Adam both challenge himself to learn new skills and share his existing knowledge with our wider team.

For more information about opportunities for graduates and interns at Principle One, visit Join us | Principle One or email your CV to