Skip to main content
Kent Academic Repository

Computer Assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes

Mills, Jon (2002) Computer Assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes. Doctor of Philosophy (PhD) thesis, University of Exeter. (doi:10.22024/UniKent/01.02.8301) (Access to this publication is currently restricted. You may be able to access a copy if URLs are provided) (KAR id:8301)

Language: English

Restricted to Repository staff only
[thumbnail of thesis.pdf]
Official URL:


This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be capable of uniquely identifying every lexical item that is attested in the corpus. A survey of published and unpublished Cornish dictionaries, glossaries and lexicographical notes was carried out. A corpus was compiled incorporating specially prepared new critical editions. An investigation into the history of Cornish lemmatisation was undertaken. A systemic description of Cornish inflection was written. Three methods of corpus lemmatisation were trialed. Findings were as follows. Lexicographical history shapes current Cornish lexicographical practice. Lexicon based tokenisation has advantages over character based tokenisation. System networks provide the means to generate base forms from attested word types. Grammatical difference is the most reliable way of disambiguating homographs. A lemma that contains three fields, the canonical form, the part-of-speech and a semantic field label, provides of a unique code for every lexeme attested in the corpus. Programs which involve human interaction during the lemmatisation process allow bootstrapping of the lemmatisation database. Computerised morphological processing may be used at least to partially create the lemmatisation database. Disambiguation of at least some of the most common homographs may be automated by the use of computer programs.

Item Type: Thesis (Doctor of Philosophy (PhD))
DOI/Identification number: 10.22024/UniKent/01.02.8301
Additional information: The author of this thesis has requested that it be held under closed access. We are sorry but we will not be able to give you access or pass on any requests for access. 24/05/22
Uncontrolled keywords: Cornish Language, lexicography, computational linguistics, linguistics
Subjects: P Language and Literature > P Philology. Linguistics
Divisions: Divisions > Division of Arts and Humanities > School of Culture and Languages
Depositing User: Francis Mills
Date Deposited: 10 Mar 2009 01:02 UTC
Last Modified: 27 Jul 2022 09:03 UTC
Resource URI: (The current URI for this page, for reference purposes)

University of Kent Author Information

Mills, Jon.

Creator's ORCID:
CReDIT Contributor Roles:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.