Skip to main content

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill, Mark J., Hengchen, Simon (2019) Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34 (4). pp. 825-843. ISSN 2055-7671. (doi:10.1093/llc/fqz024) (KAR id:90143)

PDF Author's Accepted Manuscript
Language: English
Download (1MB) Preview
[thumbnail of Hill_Hengchen_OCR_ECCO_postprint.pdf]
Preview
This file may not be suitable for users of assistive technology.
Request an accessible format
Official URL:
https://doi.org/10.1093/llc/fqz024

Abstract

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Item Type: Article
DOI/Identification number: 10.1093/llc/fqz024
Uncontrolled keywords: digital humanities, quantitative text analysis, natural language processing, authorship attribution
Divisions: Divisions > Division for the Study of Law, Society and Social Justice > School of Social Policy, Sociology and Social Research
Depositing User: Mark Hill
Date Deposited: 10 Sep 2021 13:56 UTC
Last Modified: 13 Sep 2021 10:07 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/90143 (The current URI for this page, for reference purposes)
Hill, Mark J.: https://orcid.org/0000-0001-7273-1775
  • Depositors only (login required):

Downloads

Downloads per month over past year