The Digital Muse: Advancing LLM-Based Methods for Poetry Generation and Automated Evaluation

Sawicki, Piotr (2026) The Digital Muse: Advancing LLM-Based Methods for Poetry Generation and Automated Evaluation. Doctor of Philosophy (PhD) thesis, University of Kent,. (KAR id:112926)

PDF Language: English This work is licensed under a Creative Commons Attribution 4.0 International License.
Download this file (PDF/2MB)	Preview

Abstract

This thesis investigates the application of Large Language Models (LLMs) to poetry generation and evaluation, chronicling methodological advancements during a period of unprecedented technological development in artificial intelligence (2021-2025). Through a series of interconnected studies spanning multiple generations of language models-from GPT-2 to Claude-3-Opus and GPT-4o-we develop frameworks for style-controlled poetry generation and automated evaluation that document both specific technical implementations and durable conceptual contributions.

Our research begins by examining the challenges of fine-tuning GPT-2 models to generate poetry in the style of specific Romantic-era poets, highlighting the importance of guarding against memorization and developing multi-faceted evaluation approaches. We then extend these methodologies to GPT-3, demonstrating the effectiveness of structured prompt-completion pairs for generating poetry with controlled content while preserving stylistic elements. Subsequent investigations reveal the limitations of zero-shot and many-shot prompting with early GPT-3.5-turbo and GPT-4 models, emphasizing the continued importance of fine-tuning for specialized stylistic tasks at that technological stage.

In later chapters, we shift focus to the evaluation challenge, developing a methodology inspired by the Consensual Assessment Technique (CAT) that leverages state-of-the-art LLMs as judges of poetic quality. We demonstrate that these models can significantly outperform non-expert human judges in aligning with established ground truth quality rankings, providing a reliable and scalable alternative to traditional evaluation methods. Finally, we apply these generation and evaluation methodologies to compare the quality of AI-generated and human-written poetry, finding that, according to LLM evaluators, recent AI models can produce poems matching or exceeding certain categories of human poetry. However, this assessment reveals significant differences in evaluation biases between LLMs and underscores the need for further validation by human literary experts.

The primary contributions of this thesis include:

1. A demonstrably effective methodology for fine-tuning LLMs with structured summary-poem pairs to generate user-controlled, stylistically consistent poetry.

2. A novel LLM-based, CAT-inspired in-context evaluation framework for assessing individual poems across multiple criteria.

3. Empirical evidence that LLM-based evaluators can surpass non-expert human judges in aligning with poetry quality benchmarks when assessing human-written works.

4. Systematic documentation and analysis of the evolving capabilities of successive LLM generations in creative text generation and evaluation.

5. Identification and critical analysis of significant evaluation biases between state-of-the-art LLMs (Claude-3-Opus and GPT-4o). This analysis reveals that while LLMs can reliably tier human poetry, their assessment of AI-generated poetry may reflect preferences for 'LLM-native' characteristics, underscoring that high LLM-assigned quality scores for AI work necessitate careful interpretation and validation by human literary experts.

Collectively, these contributions advance the state of the art in computational poetry generation while establishing methodological frameworks that maintain relevance despite the rapid evolution of underlying language model technologies.

Item Type:	Thesis (Doctor of Philosophy (PhD))
Thesis advisor:	Grześ, Marek
Thesis advisor:	Brown, Dan
Uncontrolled keywords:	Large Language Models (LLMs), Computational Creativity, Poetry Generation, Automated Evaluation, Natural Language Processing, Fine-tuning, Prompt Engineering, Stylistic Generation, LLM-as-Judge, Generative AI
Subjects:	Q Science > QA Mathematics (inc Computing science) > QA 76 Software, computer programming,
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	There are no former institutional units.
Funders:	University of Kent (https://ror.org/00xkeyj56)
SWORD Depositor:	System Moodle
Depositing User:	System Moodle
Date Deposited:	30 Jan 2026 15:10 UTC
Last Modified:	02 Feb 2026 13:48 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/112926 (The current URI for this page, for reference purposes)

University of Kent Author Information

Sawicki, Piotr.

Creator's ORCID:
CReDIT Contributor Roles:

Depositors only (login required):

Total Views

Total unique views of this page since July 2020. For more details click on the image.