Crowd score: a method for the evaluation of jokes using Large Language Model AI voters as judges

Goes, Fabricio and Zhou, Zisen and Sawicki, Piotr and Grześ, Marek and Brown, Dan (2022) Crowd score: a method for the evaluation of jokes using Large Language Model AI voters as judges. [Preprint] (doi:10.48550/arXiv.2212.11214) (The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided) (KAR id:101553)

The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided.
Official URL: https://doi.org/10.48550/arXiv.2212.11214

Abstract

This paper presents the Crowd Score, a novel method to assess the funniness of jokes using large language models (LLMs) as AI judges. Our method relies on inducing different personalities into the LLM and aggregating the votes of the AI judges into a single score to rate jokes. We validate the votes using an auditing technique that checks if the explanation for a particular vote is reasonable using the LLM. We tested our methodology on 52 jokes in a crowd of four AI voters with different humour types: affiliative, self-enhancing, aggressive and self-defeating. Our results show that few-shot prompting leads to better results than zero-shot for the voting question. Personality induction showed that aggressive and self-defeating voters are significantly more inclined to find more jokes funny of a set of aggressive/self-defeating jokes than the affiliative and self-enhancing voters. The Crowd Score follows the same trend as human judges by assigning higher scores to jokes that are also considered funnier by human judges. We believe that our methodology could be applied to other creative domains such as story, poetry, slogans, etc. It could both help the adoption of a flexible and accurate standard approach to compare different work in the CC community under a common metric and by minimizing human participation in assessing creative artefacts, it could accelerate the prototyping of creative artefacts and reduce the cost of hiring human participants to rate creative artefacts.

Item Type:	Preprint
DOI/Identification number:	10.48550/arXiv.2212.11214
Refereed:	No
Other identifier:	https://arxiv.org/abs/2212.11214
Name of pre-print platform:	arXiv
Uncontrolled keywords:	Large Language Models; Jokes Evaluation; Computational Creativity; Crowd Score; Prompt Engineering; AI judges; Personality InductionCreativity; GPT-4; LLMs; NLP
Subjects:	Q Science > Q Science (General) > Q335 Artificial intelligence
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Funders:	University of Kent (https://ror.org/00xkeyj56)
Depositing User:	Piotr Sawicki
Date Deposited:	05 Jun 2023 17:38 UTC
Last Modified:	22 Jul 2025 09:16 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/101553 (The current URI for this page, for reference purposes)

University of Kent Author Information

Sawicki, Piotr.

Creator's ORCID:
CReDIT Contributor Roles:

Grześ, Marek.

Creator's ORCID:	https://orcid.org/0000-0003-4901-1539
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.