Using Corpus Analysis Software to Analyse
Specialised Texts
What is a
corpus?
A corpus is a
collection of texts of written (or spoken) language presented in electronic
form. It provides the evidence of how language is used in real situations, from
which lexicographers can write accurate and meaningful dictionary entries.
Using the corpus enables lexicographers to examine a word in
detail by looking at all the different contexts in which it occurs. Below is a
typical way of viewing the results of a search of the corpus, using a display
format called KWIC (or ‘key word in context’)
The corpus
contains over 2.5 billion words of real 21st-century English; this
is the largest lexical corpus in the world. It is not only size that matters,
though: it is the size of the corpus coupled with the careful selection and
development of its contents which means that it is a resource unlike any other
in the world. Moreover, because the corpus is a collection of texts, there are
not two billion different words: the humble word
‘the’, the commonest in the written language, accounts for almost 100 million
of all the words in the corpus!
Keeping track of
our language
Meanings of words and phrases change, and so do spellings,
despite the existence of ‘standard’ or ‘correct’ spelling. A strength of the
corpus is that it contains not only published works in which the text has been
edited (and made to conform to standard spellings and grammar) but also
unpublished and unedited writing like emails and blogs. Some of the most inventive
uses and deliberate exploitations of language (as well as genuine mistakes)
start out in this kind of informal and unselfconscious language, so tracking
them is an essential part of tracking the language as a whole.
Sources of
language corpora
Subscribe to a large corpus provider such as the
British National
Corpus (BNC).
Use web
concordancing.
Compile own
corpora and analyze data using analysis software
Antconc (for monolingual corpus)
Wordsmith (for monolingual corpus)
Paraconc (for multilingual corpus
Designing a specialized corpus
Corpus size
There are no fixed ruled; depending on
research purposes, availability of data and time.
Large, general corpora may be less
useful than small, focused corpora if searches are made on context-specific terms.
There are limitations of ‘too small’ corpora e.g. not enough
concepts, terms, or patterns under investigation.
It is preferable to create a ‘monitor’ or ‘open’ corpus because
specialized words/usage
are dynamic.
Text extracts vs. full texts
Text extracts vs. full texts
Depends
on the aim of corpus compilation.
Whole text offers more coverage because words or
terms to be looked at may be randomly distributed throughout the text.Specific
sections may be helpful if we are looking for words or phrase under particular
content areas or want to create purposeful sub-corpora.Number of texts
Choices can be made between collect few texts of
large size or a number of texts with smaller sizes.
Choices can also be made between
selecting texts written by one or two key writers or sources, or texts retrieved
from different sources or written by different authors.
Depends
on your research focus e.g. to
study overall language use or to study idiosyncrasy or linguistic choices
preferred by particular writers.
Subject and text type
Should mainly focus on the
specialized text under investigation, although this is less clear-cut in multidisciplinary
subjects.
Texts may come from different subject
if the research focus is on the study of particular language features rather
than term extraction.
Text types within a specialized subject
field may vary from‘expert-to-expert’ texts to ‘expert-to-non-expert’ texts, or in other
words, from technical to popular texts.
Other considerations
Other considerations
Authorship: Texts written by
experts in a field tend to present more reliable and authentic examples of
specialized language.
Language: Specialized texts can be stored and retrieved in the form
of monolingual, comparable, or parallel corpora.
Publication date: Texts should come
from recent publications unless queries are made in relation to particular
periods of time.
0 ความคิดเห็น:
แสดงความคิดเห็น