Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Sawicki, Piotr, Grzes, Marek, Brown, Dan, Goes, Fabricio (2025) Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics ISBN 979-8-89176-332-6. (doi:10.18653/v1/2025.emnlp-main.1625) (KAR id:112727)

PDF Author's Accepted Manuscript Language: English This work is licensed under a Creative Commons Attribution 4.0 International License.
Download this file (PDF/338kB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://aclanthology.org/2025.emnlp-main.1625/

Abstract

This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.

Item Type:	Conference proceeding
DOI/Identification number:	10.18653/v1/2025.emnlp-main.1625
Uncontrolled keywords:	Computational Creativity, Large Language Models, Poetry Evaluation, Natural Language Processing, Consensual Assessment Technique, Automated Assessment, Human vs AI Evaluation, Generative AI, GPT-4, Claude-3
Subjects:	Q Science > Q Science (General) > Q335 Artificial intelligence
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	There are no former institutional units.
Funders:	University of Kent (https://ror.org/00xkeyj56)
Depositing User:	Piotr Sawicki
Date Deposited:	14 Jan 2026 14:03 UTC
Last Modified:	21 Jan 2026 03:47 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/112727 (The current URI for this page, for reference purposes)

University of Kent Author Information

Sawicki, Piotr.

Creator's ORCID:	https://orcid.org/0009-0004-0973-4892
CReDIT Contributor Roles:

Grzes, Marek.

Creator's ORCID:	https://orcid.org/0000-0003-4901-1539
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.