The role of the Digital Forensics Unit (DFU) has never been more critical to timely and effective investigations – in particular in the context of crimes such as rape, child abuse and exploitation. Due to the nature of these crimes and the need to rapidly interrogate data held on personal devices, officers are spending an ever increasing proportion of their time exploring the large data sets extracted from devices – either seized in the course of an investigation or provided by a victim. This explosion in data volumes has meant that the DFU can become the bottleneck in an investigation, with long lead times to complete data extraction and analysis, and there is a pressing need for innovation in ways of working.
However, while there is scope to look at applying technology to automate many of the tasks currently undertaken manually, any solution must enable rather than replace the tradecraft and judgement that can only come from the investigators themselves.
The challenges and constraints faced in the DFU with an ever expanding workload made it an excellent topic to focus upskilling for three of our 2021 graduate intake, Celine Daniel, Josh Alder and Harees Hussain, allowing them to strengthen their technical skills and adopt agile ways of working while learning about the challenges faced by our customers.
One of the more manual processes that felt a good candidate for automation is the reviewing of browser history obtained from a device, examining each URL to determine whether it could yield any useful information. In the case of serious crimes such as rape or child abuse and exploitation, the time required to sift through internet history is considerable.
From research completed as part of our business analysis process, it became clear that there were many instances where an inability to complete a triage of web history at pace had delayed arrest or conviction, putting further lives at risk or prolonging the pain and uncertainty of the investigation for victims and their families. Working with domain experts within the Principle One team and testing out ideas with existing customers, the team mapped out the current manual process for URL extraction and analysis, developing end user scenarios to understand a current day in the life of a DFU officer.
Using this initial analysis, they were tasked with responding to a ‘problem on a page’ based on seeking to automate the process of opening and screenshotting a list of extracted URLs. This would enable officers to scroll through a gallery of images rather than cutting and pasting and then having to wait for every website to open before determining there and then what may merit further investigation.
With that objective in mind, the team’s initial sprint focused on designing a simple productivity tool exploiting open source python libraries to create a low cost, easily deployable proof of concept solution under the working title of RELAY.
The initial functionality focused on automation – opening each URL extracted from the device’s browsing history and capturing a screenshot to enable investigators to rapidly browse the data to understand the content and triage it for further investigation. By automating this process and effectively removing the need for officers to cut and paste each URL into their browser, the initial triage time could be significantly reduced.
Expanding on this capability, the automation process was then extended to include the ability to ‘sift’ out websites, removing URLs that the officer made a judgement call are unlikely to be relevant to the investigation. This allowed some of the sites that many of us visit on a day to day basis (news sites, Netflix, supermarkets) to be excluded to reduce the volume of data to be processed by the tool and the officer.
The next step was to consider what additional analysis tools could be developed to support the investigator. Through using web scraping capabilities, it was possible to automate the extraction of common themes across websites, enabling further analysis of unusual or repeated search patterns. By capturing key nouns, verbs and adjectives from the text and titles and extracting tags from video content on YouTube, the team could rapidly extract key information and create easy to understand charts and visualisations.
Sticking with the constraint of using low cost and easily available tools, the team used Microsoft’s Power BI tool to test out different approaches to visualising the data using word clouds and simple frequency charts, which combined to enable the officer to search for keywords or relevant terms to the investigation.
To keep the outputs associated with the source material, a traceability feature was added, enabling the officer to rapidly trace back from a key word identified in the data through to the source websites and investigate further. Throughout, the focus has been on enabling faster analysis of the data while never using anything that could be considered artificial intelligence or computer-aided decision making to replace the officer’s understanding of the investigation and the human tradecraft that can only be derived from experience.
For Celine, Josh and Harees, developing the RELAY toolset has helped them consolidate their skills in a number of different ways; from getting to grips with agile working to learning how to start from a ‘problem on a page’ and work through it further with their Product Owner. Although the initial problem focused on automation, it’s through the team's further analysis and innovative use of data visualisation tools that the RELAY toolset can add even more value. By taking an iterative approach to development, they progressed from a simple tool to accelerate triage of relevant websites to a suite of productivity tools that would streamline analysis and identification of common themes in the data – helping find the needles in the digital haystack.