top of page

Leveraging Retrieval Augmented Generation (RAG) to Analyze Crash Reports Narratives

Project Description

Crash reports serve as a vital source of information for understanding road crashes, devising strategies for prevention, and informing policies. However, the coding on these reports often lacks detailed characteristics crucial for comprehensive analysis of pedestrian and bicyclist crashes. Crash reports typically contain structured data, which may lack the nuanced details often found in the narrative section regarding the circumstances surrounding a crash. Information such as unhoused status of a pedestrian, detailed explanation of the vehicle movement before hitting a pedestrian, witness description of a speeding vehicle’s behavior pre-crash, and description of a hit-and-run crash conditions may be embedded within the narrative descriptions but remain unrecorded in the structured fields of the report form. Extracting this implicit data poses a significant challenge for traditional analysis methods. Retrieval Augmented Generation (RAG), employs an embedding model to scan extensive text, seeking similarities between the query—here, the presence of a vulnerability factor or demographic context—and segments of the text. Once relevant portions are pinpointed, both the query and context undergo analysis by a Large Language Model (LLM). In this instance, the LLM validates the presence of and extracts pertinent information. This study will explore the ability of RAG to identify crash characteristics found only in the crash report narratives using crash reports from California.

Outputs

The project will produce a policy brief narrative for distribution by CPBS and a general public outreach article to be distributed by SafeTREC. The project will also produce a final report and a public github repository with the code developed for the proof of concept software developed in Python to analyze crash report narratives. The team will also produce an academic research paper that will be submitted for presentation or publication at transportation journals like Transportation Research Records.

Outputs

The project will produce a policy brief narrative for distribution by CPBS and a general public outreach article to be distributed by SafeTREC. The project will also produce a final report and a public github repository with the code developed for the proof of concept software developed in Python to analyze crash report narratives. The team will also produce an academic research paper that will be submitted for presentation or publication at transportation journals like Transportation Research Records.

Outcomes / Impacts

This project will provide a proof of concept for a method that could greatly improve the ability of transportation safety engineers, planners, and researchers to efficiently review crash narratives and glean additional information that is not in the coding. As our transportation system changes faster than the crash report forms can keep up, this method will allow those working on addressing the most pressing safety issues to make informed decisions and respond quickly to new safety challenges.

Dates

06/01/2024 to 05/31/2025

Universities

University of California at Berkeley

Principal Investigator

Julia Griswold

University of California at Berkeley

juliagris@berkeley.edu

ORCID: 0000-0002-1125-3316

Research Project Funding

Federal: $115,020

Contract Number

69A3552348336

Project Number

24UCB03

Research Priority

Promoting Safety

bottom of page