Skip to main content

Is GPT-4 good enough to evaluate jokes?

Goes, Fabricio, Sawicki, Piotr, Grześ, Marek, Brown, Dan, Volpe, Marco (2023) Is GPT-4 good enough to evaluate jokes? In: International Conference for Computational Creativity. . , Waterloo, Canada (In press) (KAR id:101552)

PDF Author's Accepted Manuscript
Language: English
Click to download this file (212kB) Preview
[thumbnail of Is_GPT_4_Good_Enough_to_Evaluate_Jokes_.pdf]
This file may not be suitable for users of assistive technology.
Request an accessible format


In this paper, we investigate the ability of large language models (LLMs), specifically GPT-4, to assess the funniness of jokes in comparison to human ratings. We use a dataset of jokes annotated with human ratings and explore different system descriptions in GPT-4 to imitate human judges with various types of humour. We propose a novel method to create a system description using many-shot prompting, providing numerous examples of jokes and their evaluation scores. Additionally, we examine the performance of different system descriptions when given varying amounts of instructions and examples on how to evaluate jokes. Our main contributions include a new method for creating a system description in LLMs to evaluate jokes and a comprehensive methodology to assess LLMs' ability to evaluate jokes using rankings rather than individual scores.

Item Type: Conference or workshop item (Poster)
Uncontrolled keywords: Creativity; GPT-4; LLMs; NLP
Subjects: Q Science > Q Science (General) > Q335 Artificial intelligence
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Funders: University of Kent (
Depositing User: Piotr Sawicki
Date Deposited: 05 Jun 2023 17:28 UTC
Last Modified: 07 Jun 2023 14:16 UTC
Resource URI: (The current URI for this page, for reference purposes)
Grześ, Marek:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.