Skip to main content
Kent Academic Repository

Lexicon Based Critical Tokenisation: An Algorithm

Mills, Jon (1998) Lexicon Based Critical Tokenisation: An Algorithm. In: Actes EURALEX'98 Proceedings: Papers Submitted to the Eighth EURALEX International Congress on Lexicography in Liège, Belgium. University of Liège, Liège, pp. 213-220. ISBN 2-87233-091-7. (The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided) (KAR id:8349)

The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided.
Official URL:
http://www.euralex.org/elx_proceedings/Euralex1998...

Abstract

In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the case with Cornish. However there is considerable inconsistency of segmentation to be found within the Corpus of Cornish. The individual texts that make up this corpus are not even internally consistent. The first stage in lemmatising the Corpus of Cornish, therefore, involves the resegmentation of the corpus into tokens. The whole notion of what is considered to be a word has to be examined. A method for the logical representation of segmentation into tokens is proposed in this paper. The existing segmentation of the Corpus of Cornish, as indicated by spaces in the text, is abandoned and an algorithm for dictionary based critical tokenisation of the corpus is proposed.

Item Type: Book section
Additional information: Cited in De Schryver, G.-M.& D.J.P., 2000. The Compilation of Electronic Corpora, with Special Reference to African Languages. Southern African Linguistics and Applied Language Studies, 18, pp.89–106. Available at: http://tshwanedje.com/publications/Corpora.pdf. Cited in Saphou-bivigat, G., 2010. A theoretical model for an encyclopaedic dictionary for the Gabonese languages with reference to Yilumbu. Available at: https://scholar.sun.ac.za/handle/10019.1/3990. Cited in Yamashita, T. & Matsumoto, Y., 2000. Language Independent Morphological Analysis. In Sixth Applied Natural Language Processing Conference. Association for Computational Linguistics, pp. 232–238. Available at: http://aclweb.org/anthology//A/A00/A00-1032.pdf.
Subjects: P Language and Literature > P Philology. Linguistics
T Technology
Divisions: Divisions > Division of Arts and Humanities > School of Culture and Languages
Depositing User: Francis Mills
Date Deposited: 25 Jun 2009 11:05 UTC
Last Modified: 16 Nov 2021 09:46 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/8349 (The current URI for this page, for reference purposes)

University of Kent Author Information

Mills, Jon.

Creator's ORCID:
CReDIT Contributor Roles:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.