Mills, Jon (1998) Lexicon Based Critical Tokenisation: An Algorithm. In: Actes EURALEX'98 Proceedings: Papers Submitted to the Eighth EURALEX International Congress on Lexicography in Liège, Belgium. University of Liège, Liège pp. 213-220. ISBN 2-87233-091-7.
| The full text of this publication is not available from this repository. (Contact us about this Publication) |
Abstract
In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the case with Cornish. However there is considerable inconsistency of segmentation to be found within the Corpus of Cornish. The individual texts that make up this corpus are not even internally consistent. The first stage in lemmatising the Corpus of Cornish, therefore, involves the resegmentation of the corpus into tokens. The whole notion of what is considered to be a word has to be examined. A method for the logical representation of segmentation into tokens is proposed in this paper. The existing segmentation of the Corpus of Cornish, as indicated by spaces in the text, is abandoned and an algorithm for dictionary based critical tokenisation of the corpus is proposed.
| Item Type: | Conference or workshop item (Paper) |
|---|---|
| Subjects: | T Technology P Language and Literature P Language and Literature > P Philology. Linguistics |
| Divisions: | Faculties > Humanities > School of European Culture and Languages |
| Depositing User: | Jon Mills |
| Date Deposited: | 25 Jun 2009 11:05 |
| Last Modified: | 14 Jan 2010 14:30 |
| Resource URI: | http://kar.kent.ac.uk/id/eprint/8349 (The current URI for this page, for reference purposes) |
- Depositors only (login required):

