Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa
Department of Linguistics, University of South Africa, PO Box 392, Pretoria 0003, South Africa.
2003 (English)In: Southern African Linguistics and Applied Language Studies, ISSN 1607-3614, E-ISSN 1727-9461, Vol. 21, no 4, 223-237 p.Article in journal (Refereed) Published
Abstract [en]

There are currently two distinct but not necessarily mutually exclusive approaches to the retrieval of information from linguistic corpora. ’Corpus-driven’ approaches rely solely on the corpus itself to yield significant patterns. With the exception of orthographic spacing, no additional annotations to a ’raw’ corpus are used to guide searches and the retrieval of information from the corpus. Typically, key word in context (KWIC) analyses are applied to relevant concordance lines to extract statistically significant lexical and grammatical patterns. In ’corpus-based’ approaches, on the other hand, information is retrieved from an enriched corpus on the basis of annotations in the form of linguistic tags and annotations. That is, the annotations are used to direct the searches to specific grammatical and lexical phenomena in a corpus. In this article, we propose a corpus-based approach and a tagset to be used on a corpus of spoken language for the African languages of South Africa. A number of problematic linguistic phenomena such as fixed expressions, agglutination, morphemic merging and spoken language phenomena such as interrupted words etc., often have some effect on tagging principles. These problematic phenomena are discussed and illustrated. The development of the tagset is based on the morphosyntactic properties of Xhosa for reasons that are outlined in the article. Manual tagging of a large corpus would be quite a daunting and time-consuming task, not to mention the potential for various kinds of errors. This problem is solved in a two-step process. Firstly, a computer-based drag-and-drop tagger was developed to facilitate the manual tagging of a so-called training corpus. This training corpus then forms the input to the development of an automatic tagger. The principles and procedures for the development of an automatic tagger for African languages are also discussed. ©2003 NISC Pty Ltd.

Place, publisher, year, edition, pages
2003. Vol. 21, no 4, 223-237 p.
National Category
General Language Studies and Linguistics
Identifiers
URN: urn:nbn:se:hj:diva-24939Scopus ID: 2-s2.0-9944243099OAI: oai:DiVA.org:hj-24939DiVA: diva2:754815
Available from: 2014-10-13 Created: 2014-10-10 Last updated: 2014-10-13Bibliographically approved

Open Access in DiVA

No full text

Scopus

Search in DiVA

By author/editor
Allwood, Jens
In the same journal
Southern African Linguistics and Applied Language Studies
General Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

Total: 132 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf