Hill, Mark J., Hengchen, Simon (2019) Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34 (4). pp. 825-843. ISSN 2055-7671. (doi:10.1093/llc/fqz024) (KAR id:90143)
PDF
Author's Accepted Manuscript
Language: English |
|
Download this file (PDF/1MB) |
Preview |
Request a format suitable for use with assistive technology e.g. a screenreader | |
Official URL: https://doi.org/10.1093/llc/fqz024 |
Abstract
This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.
Item Type: | Article |
---|---|
DOI/Identification number: | 10.1093/llc/fqz024 |
Uncontrolled keywords: | digital humanities, quantitative text analysis, natural language processing, authorship attribution |
Divisions: | Divisions > Division for the Study of Law, Society and Social Justice > School of Social Policy, Sociology and Social Research |
Depositing User: | Mark Hill |
Date Deposited: | 10 Sep 2021 13:56 UTC |
Last Modified: | 04 Mar 2024 16:30 UTC |
Resource URI: | https://kar.kent.ac.uk/id/eprint/90143 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):