Skip to main content
Kent Academic Repository

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill, Mark J., Hengchen, Simon (2019) Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34 (4). pp. 825-843. ISSN 2055-7671. (doi:10.1093/llc/fqz024) (KAR id:90143)

Abstract

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Item Type: Article
DOI/Identification number: 10.1093/llc/fqz024
Uncontrolled keywords: digital humanities, quantitative text analysis, natural language processing, authorship attribution
Divisions: Divisions > Division for the Study of Law, Society and Social Justice > School of Social Policy, Sociology and Social Research
Depositing User: Mark Hill
Date Deposited: 10 Sep 2021 13:56 UTC
Last Modified: 04 Mar 2024 16:30 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/90143 (The current URI for this page, for reference purposes)

University of Kent Author Information

  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.