Could Large Language Models help policing – or are they all talk?

Phil Tomlinson
Oct 31, 2024
6 min read

Updated: Jun 10

While the concept of language modelling has existed for a number of years, it was OpenAI’s release of ChatGPT-3 in November 2022 that captured the tech world’s attention – and generated a mixture of awe, excitement, curiosity and a healthy dose of scepticism from the media and public alike. Ever since, LLM software has been proposed as a potential solution to a broad range of business challenges – including many within policing.

So what are Large Language Models?

Large Language Models (LLMs) use deep-learning AI technologies that have been trained on vast amounts of data. These allow the software to generate language, use relevant information and present answers back to the user in a conversational manner. It can even adjust the style of its answer depending on the requirements of the user.

Every year, Principle One welcomes summer interns for ten weeks to explore emerging ideas, concepts and technologies and how they could impact customers. This summer, we were joined by Farah Suleman, who is studying for a Masters in Biochemistry at University College London and Erin Hennessy, who has just completed a degree in Criminology with Psychology at Lancaster University.

With support from Principle One’s consultants we sought to test whether Large Language Models can really assist policing – and if so, how? We started by considering four different areas where LLMs can impact on policing:

For the intern project we decided to focus on ‘AI for intelligence and investigation – to improve operational outcomes.’ Our reasoning was that we know the police collects more information about victims, witnesses and suspects than they can feasibly manage. This data is consolidated into crime reports, without necessarily being structured in a way that is easy to search and identify connections. Many of these reports will be linked through the recurrence of information about suspects, patterns of offending, locations or specific methodologies although the connections may not be immediately obvious with the dots that need to be connected buried within the large volume of unstructured ‘free-text’ that sits in each report

Our investigative work was underpinned by four key questions:

Can LLMs identify basic and even complex links between reports?
Can we trust the accuracy and reliability of LLMs?
How well can LLMs handle large volumes of police data
Will our initial testing justify testing on real police data?

The first challenge that policing always faces when testing AI technology, is getting access to a realistic dataset to experiment with. In the case of LLMs, that means largely unstructured datasets like intelligence reports and crime reports so the first task we set our LLM (Chat GPT and Gemini) was the creation of synthetic versions of crime reports to provide a dataset for exploration. We could then weave links and associations manually into the reports to test how effectively LLMs could identify them and create knowledge graphs - replicating the work that would be undertaken by investigators and analysts working through different reports to identify links. Some of the links were obvious where suspects had been named, but others were less obvious and less certain, where links existed only through descriptions or partial identifiers of vehicles.

Once we had our ‘ground truth’ of the dataset, we could start to see how effective the LLMs were at reverse-engineering the data to create the links and associations.

We took an iterative approach working through a cycle of Research, Build and Test, reviewing the results at the end of each phase and revising our approach based on our findings. We reviewed a range of LLM tools to trial as part of the research phase and adopted a Retrieval Augmented Generation (RAG) approach.

Retrieval-Augmented Generation (RAG) is a method that combines two steps to improve how AI models generate answers. First, it retrieves relevant information from a database or document collection, like finding pages from a book that might have the answer. Then, it generates a response based on both the retrieved information and the question asked.

We also considered how best to visualise the outputs of the model and settled on knowledge graphs as the best way to represent the connections within the data. We used Chat-GPT due to its compatibility with Microsoft Azure and the ‘relative’ ease in which data could be transferred into a separate open-source visualisation product called NEO4j, using Python to manage the extraction, transformation and loading of data, as shown below.

We instructed the LLM as follows.

'You are a helpful assistant to an intelligence analyst working within the police, you want to do a very thorough job. You need to read a set of reports describing incidents that police officers have responded to as well as various intelligence reports. Each report is delimited by triple quotes.

Your job is to extract all people, locations, events, objects, and then make connections for a knowledge graph.

The information needs to be extracted in a JSON format using the JSON schema provided.'

The JSON schema we created told the LLM to output the data using POLE (Person, Object, Location, event), as a common standard for investigative analysis. The schema described what a person, object, location and event was, as well as what connections across these would look like.

The first test was to see what associations could be identified in the reports without clear prompts from the user. Without a starting point, the software became somewhat overwhelmed and whilst it did recognise some links between crimes and suspects, these were quite basic. As we knew how we had seeded the data with various connections it was also easy for us to identify ‘false negatives’ – where the LLM had missed the links entirely. To address this, we directed the LLM to adopt a more iterative approach and identify similar reports to start with, and then create the POLE data from these reports. We called this a ‘Similarity Search’.

Following this simple change, we asked the software to visualise connections to a person called Christopher Hunt and this time the algorithm rapidly linked the person to the five offences we had seeded into the reports.

It also pulled out details of the victims and the locations of the offences. We reviewed this output against the manually produced graph and were able to validate that content and analysis was accurate.

Once we got to grips with the limits of the software when used without any direction, we were very rapidly able to create simple interactive visualisations, improving the reliability and consistency of the LLM with each developmental iteration. Our understanding of what was going on under the covers was an essential part of being able to tune the model – highlighting the risks of relying on ‘black box’ AI!

As an accelerator for analysts, it was easy to see the benefits – we could begin exploring the knowledge graph after less than a minute of processing – whereas extracting these links manually to create the visualisation would take far longer. However, it was important to then use the knowledge graph interface to drill down, check results and also consider what the connections actually meant – going back to the richer explanation within the original crime report.

Transferring a tool like this to the real world presents further challenges. We worked with a relatively small synthetic data set which was consistent in its spellings and accuracy, even with some partial matches introduced for testing. Testing on real or more representative crime reports introduces many more challenges – misspellings, less accurate or precise descriptions, partial matches and inconsistencies. Scaling this up for further testing will eventually require access to real data – however effective we are at working with our AI to create synthetic versions, we can’t be certain we can replicate the full crime report experience.

What’s even more challenging, however, is working with a ‘black box’, and one that has been designed effectively to be a ‘people pleaser’ with its responses. Our LLM was ‘learning’ as it went, and we had to work hard through providing prompts to ensure we were getting the right information returned each time, being mindful that some results could be missed as the data volumes increased.

In just ten weeks, Erin and Farah had created a strong foundation for testing LLMs – and learnt a lot about experimentation, failing fast and learning from each new iteration. What this work also showed, is the importance of starting slowly and making sure there is still a human in the loop. While it may be tempting to go out and buy a piece of ‘plug and play’ technology, our interns’ experience has taught us that small, well-informed steps, are likely to produce a better outcome for policing in the long run.

if you are interested in applying for an internship at Principle One, please send us a CV and cover letter outlining your interest to careers@principleone.co.uk.

Could Large Language Models help policing – or are they all talk?

Contact us on

enquiries@principleone.co.uk

Telephone 020 45702200