All Articles

How Good is GPT-4 at Doc Review?

A look at Sidley's assessment of GPT-4's document review skills.

How Good is GPT-4 at Doc Review?
Ofer Bleiweiss
Ofer Bleiweiss

[A review of a recent article "Replacing Attorney Review? Sidley's Experimental Assessment of GPT-4’s Performance in Document Review," by Colleen M. Kenney, Matt S. Jackson, and, good friend of the 'Chron, Robert D. Keeling. The original article can be found here.]

Our latest post dives into the world of artificial intelligence advancements in e-discovery, specifically focusing on the use of LLMs for document review. ChatGPT's impressive capabilities have captured the attention of legal professionals, and its subsequent iteration, GPT-4, promises even more profound implications for the future of law.

The article presents findings from an experiment assessing the use of GPT-4 to conduct a document review. The collaboration between the law firm Sidley and Relativity tested GPT-4's ability to sift through documents and code them for responsiveness, a classic first-level review task traditionally performed by human attorneys (in some cases, with the help of TAR). The experiment had two phases: An initial pass by GPT-4, based on the same instructions provided to the attorney reviewers who performed the original review. A second pass refining those instructions with feedback on the initial pass, akin to the QC step in a traditional review.

GPT-4's performance was notable, hitting a general accuracy of 70%, on average. When the AI was most confident, its predictions were strong: an 85% success rate in tagging non-responsive documents and 84% in tagging responsive ones. However, its prowess dimmed with documents needing nuanced contextual understanding, specifically: "documents that were part of a responsive family, documents that contained short responsive messages, or documents that had responsive attachments." The researchers were not surprised by this result, given that GPT-4 was only looking at the four corners of each document (i.e., solely at the document's extracted text) when making its coding decision. They noted that there could be ways to improve its performance by providing additional context.

While much faster than humans -- GPT-4 coded documents at a rate of approximately one document per second -- GPT-4 was slower compared to TAR tools, pointing to another area for future improvement. Nonetheless, the study notes two aspects where GPT-4 shows meaningful promise over traditional TAR:

While TAR tools score documents based on the decisions of reviewers on so-called “training” documents, GPT-4 performs independent evaluations of each document against the prompt. Based on our review of the results, this leads to far more consistency in coding than we see traditionally with human review. It also leads to a cleaner and more efficient QC process that would likely change traditional review workflows. Specifically, GPT-4 can generate explanations about its responsiveness determination, which will likely expedite the QC process by allowing a human reviewer to more quickly confirm that a document is responsive.

In conclusion, the Sidley experiment offers a tangible look at GPT-4's current and potential impact on document review. While AI may not replace human legal review entirely, it can significantly augment the process. For those in the legal industry, this experiment provides a glimpse into a future where AI and human expertise collaborate for a more efficient and reliable document review process.

[This is (very intentionally) an AI-human collaboration. Let us know if you found it helpful.]

More from the Blog

Next level litigation® with Everchron.

Transform the way you manage cases. Schedule a demo to learn more.