Abbreviations and important terms

Associations and institutions:

  • ACL: Association for Computational Linguistics;
  • ACL/DCI: Data Collection Initiative of the ACL;
  • CLR: Consortium for Lexical Research
  • ECI: European Corpus Initiative
  • ICAME: International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities in Bergen, Norway; http://nora.hd.uib.no/icame.html
  • LDC: Linguistic Data Consortium (offers text corpora of different European languages; information about available corpora via http://ldc.upenn.edu/ldc/catalog/index.html.)
  • OTA: Oxford Text Archive, Oxford University Computing Services; http://sable.ox.ac.uk/ota/

General index:

  • ASCII: American Standard Code of Information Interchange
  • BNC: British National Corpus
  • CALL: Computer-assisted language learning
  • CD-ROM: Compact disc (read only memory): stored data cannot be changed;
  • HTML = Hypertext markup language: ASCII + Strukturierungs- und Formatierungsanweisungen; z.B. zum Erstellen von Links; es gibt komfortable HTML-Editoren und -Konvertierer;
  • EAGLES: Expert Advisory Groups on Language Engeneering Standards, an EU- spronsored project;
  • KWIC: keyword in context
  • KWOC: keyword out of context
  • MRD = Maschine-Readable Dictionary: important tools/resources in MT, Intelligent Information Retrieval, NLP; they can be converted into a LDB = lexical database or LKB = lexical knowledge base
  • MI = Mutual Information
  • MT = Machine Translation and MAT = Machine Aided Translation
  • NLP = Natural Language Processing: simulation of human language processing on the computer by programming the knowledge of human cognitive mechanisms; the task of computer scientists is not to understand the human language production and understanding but to model and reproduce it according to the insights of  AI = Artificial Intelligence Research; typical applications are MT, Intelligent Information Retrieval, query answering, summary generation;
  • OCR = Optical Character Recognition: the technique of converting the graphical image of a scanned text into its constituent characters so that it becomes a computer-readable text;
  • OPAC = Online Public Access Catalog: Recherche-Software in den Literaturdatenbanken der Universitäts- und Institutsbibliotheken;
  • TCP/IP-Protocol = Transmission Control Protocol/Internet Protocol: spezielle Vernetzungssoftware einer Protokollfamilei, die einheitliche Verfahren und Formate zur Verfügung stellen, um die Rechner im Internet eindeutig zu identifizieren und die Art des Zugangs und des Datentransfers zu regeln;
  • SGML = Standard Generalised Mark-up Language: a set of conventions and procedures which allows consistant approaches to be adopted in adding structural and other information to computer-readable texts;
  • SPSS = Standard Package for the Social Siences
  • URL = Uniform resource locator: the unique website address;
  • MEC = The Middle English Compendium: a set of electronic resources for medievalists, for use both in research and teaching (contains the electronic version of the MED, an electronic HyperBibliography of ME. texts, manuscripts, editions, and other naterial, and a Corpus of Middle English Prose and Verse); available from the Univ. of Michigan Press ,http://www.hti.umich,edu/mec;
  • TOE = A Thesausrus of Old English: pilot study for the HTE,

 

Recommended further reading:

 

Kenny, Anthony. The Computation of Style. An Introduction to Statistics for Students of Literature and Humanities. 
Oxford: Pergamon Press, 1982.

 

Oakes, Michael P. Statistics for corpus lingustics. Edinburgh Textbooks in Empirical Linguistics. Edinburgh UP, 1998.

 

 

algorithm: 
Inside-Outside Algorithm: mathematical procedure to calculate the probability of possible different parses in sentences like She eats pizza without anchovies vs. She eats pizza without hesitation; used to statistically train/probabilize a grammar with a training corpus/treebank, associating a probability of use for each of its rules; (example in Black/Garside/Leech 1993, 6.1); 
Forward-Backward Algorithm: (developed by Baum and Eagon 1967) method of determining the probability of possible tags (usually in unsupervised training); the exact probability is computed directly by the product of the forward and backwoard probabilities to that tag, divided by the probability of the word sequence given a model; first used as a statistical estimation procedure to obtain the values of the parameters in HMMs by training on real data; (brief description in Furui 1989, Appendix D3, 343-7);

alignment: the practice of defining explicit links between texts in a parallel corpus; 
on a program for aligning sentences in bilingual corpora (trilingual corpus of economic reports issued by the Union Bank of Switzerland UBS, Canadian Hansards), i.e. for identifying correspondences between sentences, based on a simple statistical model of character lengths of sentences/paragraphs, without any lexical clues (error rate of only 4.2%, 80% of the alignments less than 0.7%!! = considerably better than alignment procedures based on words with an error rate of 6.5%): Gale/Church and Kay/Röscheisen (in Armstrong 1994) propose an algorithm for aligning words, essentially based on the similarity of their distributions (notably mor- phological-phological similarity of technical terms and proper names);

 

annotation: adding of explicit information to machine-readable texts;

 

chi-squared test = Pearson's x2 test: most commonly used significance test (unreliable with small frequencies); compares the difference between the actual (observed) frequencies and those which one would expect if no factor other than chance had been operating; the greater the difference between the observed and the expected frequencies, the more likely it is that the observed frequencies are being influenced by for instance a true difference in the grammars of the varieties; 


cluster: a group of words which follow each other in a text and often do match phrases or idioms;

 

colligation: collocation patterns based on syntactic groups rather than on individual words (e.g. between depend and on);

collocation: recurrent combinations of words that co-occur more often than expected by chance in a text and that correspond to arbitrary word usages (the word choices do not correspond to any syntactic or semantic rules); they are common in all types of English, including nontechnical but especially technical genres (domain-specific collocations often turn technical jargons totally unintelligible for the layman: a dry/wet suit used by sailors); collocations can be of any length (but they usually contain two to six words), interrupted or uninterrupted by other words; not all words that co-occur frequently are "true" collocations, but simple word associations in a given context, with related meanings as in doctor - nurse - hospital (there is no consistent functional relationship as between modifying adj./N, V/O, S/V etc.); 
different types of collocations depending on the number of words involved, the way they are combined and the degree of rigidness or flexibility (Smadja 1994 identifies 3 types of collocations: rigid noun phrases, predicative relations, and phrasal templates); 
"Collocations are a purely statistical phenomenon: potentioal collocates are found by examining and counting the span of words that appear to the left and right of a given item (the so-called node). If some words appear more frequently together than in isolation (i.e. if the frequency of co-occurance is greater than chance would predict), they form a significant grouping and are considered collocations." English Today, Vol 10 No 4 (Oct. 1994), p. 47; 
lexical collocations involve only open class words, grammatical collocations at least one closed class word (particles, prepositions, auxiliary verb, determiner) (Benson 1986); 
there is a number of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora (Choueka 1983, Church and Hanks 1989); all collocation analysis is primarily based on concordance lines (KWIC Indxes) which contain the basic information needed; most concordance packages include facilities for simple frequency information for words near to the keyword within predeterminde proximity limits, some provide more sophisticated statistical routines which calculate the figures needed for comparison and use them to produce measures of significance (WordCruncher: z- score, d.f. etc.); 
one of the more elaborated and precise collocation retrieval tools is Xtract (developed by Smadja/McKeown 1990) trained on a 10 million word corpus of stock market news reports with an estimated precision of 80%;

 

concordance:  list of all the contexts in which a keyword occurs in a text (i.e. a KWIC-Index), so that its use and behaviour can be properly analysed and typical patterns detected on the basis of the contextual information provided by the other words; 
Concordance Software (Concordancer): most concordance programs produce as a standard display lists of keywords in the centre of 1 line of context (with variable context shown) in their original order of appearance within the text; some programs allow sorting by specified words of the context, in alphabetical order of the context words to the left or right of the keyword or in order of keywords (if there are different keywords or multiple keywords specified by wild cards); 
most programs provide also basic statistical information about the size of the file, the number of types and tokens, the number of occurrences of keywords (their absolute frequency);

 

cookies: small files to store information downloaded from the Web on individual computers ???; A handle, transaction ID, or other token of agreement between cooperating programs. "I give him a packet, he gives me back a cookie": The claim check you get from a dry-cleaning shop is a perfect mundane example of a cookie; the only thing it's useful for is to relate a later transaction to this one (so you get the same clothes back). 
HTTP cookie:  a packet of information sent by an HTTP server to a World-Wide Web browser and then sent back by the browser 
each time it accesses that server. Cookies can contain any arbitrary information the server chooses and are used to maintain state between 
otherwise stateless HTTP transactions. Typically this is used to authenticate or identify a registered user of a web site without requiring them to 
sign in again every time they access that site. Other uses are, e.g. maintaining a "shopping basket" of goods you have selected to purchase 
during a session at a site, site personalisation (presenting different pages to different users), tracking a particular user's access to a site. 
magic cookie: something passed between routines or programs that enables the receiver to perform some operation; a capability ticket or opaque 
identifier. Especially used of small data objects that contain data encoded in a strange or intrinsically machine-dependent way.

 

corpus: a collection of texts, selected to represent a particular type of language (held in computer-readable form); main considerations of corpus design are 
1) the size of the corpus (including length and number of text samples), 
2) the range of text categories (or registers), 
3) the intended use of the corpus; 
starting with the COBUILD dictionary (Sinclair et al. 1987) British lexicographers (especially those working on dictionaries for foreign learners) are beginning to depend more and more on corpus data (as an alternative to traditional citation indexes: Murray 1877-9: "There was not a single quotation for imaginable, a word used by Chaucer, Sir Thomas More, and Milton."); "... dictionaries produced in the USA pay little attention to MR corpora, and are based on collections of selected citations, augmented by introspection..." (Zernik 1991:115); 
a corpus is usually destinguished from a mere randam collection of texts in text archives;

  • general purpose corpus: includes a wide range of styles and varieties; over- all linguistic characterizations of a language represent a kind of averaging of the systematic linguistic patterns found across different registers; (see Biber in Armstrong 1994);
  • representative corpus: containing all the characteristic grammatical and lex- ical/semantic features, structural patterns and their distribution in a language type (variety, register, text type, sublanguage); a representative corpus of general language must therefore include a "balanced" selection of all the relevant language types (size cannot compensate for a lack of diversity!); examples: BROWN, LOB as carefully and explicitly structured small corpora; COBUILD/Bank of English, Löngman/Lancaster Corpus, BNC combine the aspect of "balance" with the emphasis on size to varying extents;
  • monitor corpus: a growing, non-finite collection of texts;
  • parallel corpora: collection of pairs of texts translated from one language into another; application: contrastive studies (English shows less fronting of direct objects and of adverbials than Swedish; contains fewer clefts than Norwegian; uses fewer modal adverbs or particles and prefers modal or semi-modal verbs compared to German), MT (problem of translationese); for example the Canadian Hansard: "such bilingual corpora make it possible to undertake statistical, and other kinds of  empirical, studies of translation on a scale that was previously unthinkable" (Kay/Röscheisen in Armstrong 1994:121);

corpus analysis: with the availability of large samples of textual data (since the 1990s) and automated tools (POS taggers, parsers, concordancers) it has become possible to retrieve important lexical information that is not yet available in dictionaries or grammars, as it is usually unconscious and not directly observable and not open to the native speaker's (and the linguist's) intuition and introspection; intuition cannot identify the most typical and significant units and patterns of a language (words/phrases/collocations) or provide data on their relative frfrequency and preference relations; 
empirical linguistic research based on large corpora is becoming more and more important in lexicography and lexicology, natural language processing, information retrieval, machine translation as well as in theoretical linguistics and the testing (and training) of linguistic hypotheses and grammars;

 

degrees of freedom: d.f. = (number of features/forms/variables - 1) x (number of texts/compared varieties/samples - 1); is relevant when determining how significant a chi-square is;

 

entropy: information theory measure for the predictability of a corpus (as defined by Shannon 1948 a measure for the information content of a probabilitic source with the standard unit of the bit); the bigger the entropy of a corpus the less predictable it is, according to the formula: 
écorpus = - ä p(w)log p(w) 
                            w 
             p(w) = the probability of appearance of a word; 
In most cases entropy grows with the size of the corpus. 
example: Smadja (in Armstrong 1994:170) compares the entropy ratio of three subcorpora of 1 million words (BROWN = 10.5, Dow/Jones = 9.6, NewYorkTimes = 10.4); the more restricted or focused the subjects and domains of a variety, the lower its entropy;

 

frequency: expected vs. observed frequency; 
frequency distribution: represented in a histogram or frequency polygon (usually positively skew = with the tail to the right); 
frequency list: list of strings (usually words, but also other linguistic units, constructions or sequences of words or tags) with their frequencies of occurrance in a corpus; they are a prerequisite for any statistical analysis; word frequency lists can be sorted alphabethically, in order of (decending) frequency, appearance, in reversed order, in word length order etc. (depending on the program); major problems are related to the identification and definition of units, as for example the disambiguation of homographs and lemmatisaton;

 

Interlingua: conceptual and structural graphs used in MT, derived from the source language representing the language-independent meaning of a sentence;

 

Lexical database (LDB): can be filled to a large extent with lexical information by the automatic extraction of relevant facts from MRDs (merging of dictionaries by mapping one dict. onto another: see Atkins/Levin in Zernik 1991), supplemented with information from textual databases (TDB); 
the ideal LDB must provide information (if appropriate) about: - semantic class, aktionsart, and arguments - selectional restrictions on the arguments - subcategorization - morphologically-related words - related extended uses - related idiomatic uses - collocates - domain labels - pragmatic force - corpus citations exemplifying each feature (see Zernik:242); thus, the circularity in definitions and consistency problems of traditional (printed) dictionaries can be overcome (information is often implicit and not uniform within the same dictionary and often depends on examples; see Calzolari on lexical databases and Atkins/Levin, "Admitting Impediments" in Zernik 1991);

 

lemmatisation: grouping together of all the different inflected forms of the same word (I/my/me/mine, is/was/are/being etc.); complicated procedure when performed automatically; one of the major problems in early CL (especially in the design of word frequency programs); WordSmith offers a number of procedures (manual, auto-joining, file-based joining) of joining different entries in a word list;

 

lexical density: the proportion of lexical words in a text expressed as a percentage: if  N is the number of words and L the number of lexical words, then: lexical d. = 100 x L/N; most written texts have a l.d. above 40%, spoken texts show a wider range, but usually under 40% (24 to 43): can rely to a large extent on the immediate context; a high degree of shared information and shared context (as in casual spoken language) can lead to very low l.d.; (Stubbs 1996: Fig.3.1. 74);

 

loglinear analysis: starting from a frequency cross-tabulation the statistical significance of hypothesised factors/variables responsible for linguistic variation (text genre, semantic category, syntactic position/context etc.) can be calculated; the best model is the one with the fewest number of variables which still retain statistical significance; can be compared to 
variable rule analysis (VARBRUL) pioneered by the sociolinguist David Sankoff; (examples: explanation of the presence or ommision ot the preposition for  in phrases of duration: she studied -for- three years in Munich; negation in spoken and written English; factors influencing the use of modals; McEnery 1996:83);

 

multimedia application: complex computer software which typically includes visual images, short video clips, animation and sound, often alongside large quantities of text (encyclopaedias, reference works, CALL-software etc.);

 

multi-variate analysis: statistical procedures to determine and summarise the complex interrelationships of similarity and difference between a large number of samples and large number of variables with the aim to discover or confirm broader patterns within variables; all the different techniques as factor analysis, principal components analysis, correspondence analysis, multidimensinal scaling, cluster analysis start off with a traditional basic cross-tabulation; 
example: Biber's (1988, 1989, 1992) studies which have shown systematic patterns of variation among different registers in English (based on a quantitative analysis of the distribution of 67 linguistic features in LOB and LLC (brief methodological overview in Armstrong 1994);

 

MI = mutual information: a formula (borrowed from information theory) that compares the probability that two items occur together as a joint event with the probability that they occur individually (and that their co-occurances are a result of chance); high positive mutual information scores are likely to constitute characteristic collocations (vgl. Church in Zernik 1991); 
MI = log2  O 
                 E 
MI > 0: joint probability larger than chance (6-10: strong association); MI = 0: no interesting relationship; MI < 0: complementary distribution; 
not very reliable with small corpora or with words that appear less than 10 times per 1 million words of text!!

 

parsing: method of analyzing, or "diagramming" sentences by assigning syntactic structure; 
canonical case of structural ambiguity in prepositional phrase attachment: I saw the man with the telescope; lexical preference (frequency of co-occurence) can be used to resolve resolve structural ambiguity in a parser (see Hindle/Rooth in Armstrong 1994); 
parser: a computer program "that supplies for a given sentence a correct grammatical analysis, demarcating its part, labelling each, identifying the part of speech of every word used in the sentence, and perhaps offering additional information", for instance the grammatical function of phrases and clauses (S, O etc.) or the "semantic class" (semantic noun categories such as positive emotion, family member etc.); poor performance results of early, "theory-based" parsing systems with only 30-50% of the input sentences correctly analyzed; considerable improvement by the development of probabilistic, "frequency-based" grammars; 
probabilistic parser: "program for grammatical analysis (including a detailed grammar of English) ... that uses statistical methods to single out the correct parse, among all those offered by the grammar, on the basis of probabilities"; 
Lancaster Parsed Corpus (by the UCREL team; subset of LOB of about 135,000 words; sentences tagged by the CLAWS system were parsed with a probabilistic parser making use of the Lancaster/ Leeds treebank as training data, and then manually corrected): 
skeleton parsing/full parsing: simplified syntactic analysis restricted to basic distinctions and categories, thus giving "flat" structure trees (the ratio of daughter to mother nodes is relatively hiahg) and excluding areas of controversy; "a mode of manual parsing in which a reduced set of grammatical categories is employed, and in which constituent boundaries and labels are assigned to text only where the treebankers are quite confident of these assignments. The goal is not exhaustive bracketing .., but rather the indication of a minimal set of [basic] constituents which must be present in any "correct" analysis of the sentence." (Black/Garside 1993:214);

 

probabilistic models/methods vs. knowledge-based methods: lexical/conceptual information and rule-based techniques as typically represented in traditional dictionaries and purely rule-based techniques can be effectivly supplemeted with probabilistic information from corpora; this can be used effectively to reduce lexical and grammatical ambiguities in text analysis and processing (e.g. information retrieval): "lexical ambiguity is hard if we ignore lexical preferences" (Church in Armstrong 1994:9); probabilistic information (about the pragmatic and linguistic context) offers a means of predicting the most likely interpretation; 
rule-based methods cannot deal with the variety of possible constructions (and long sentences); probabilistic models must be trained on a corpus; if the corpus is (manually) tagged this is called "supervised training" (see Weischedel et al. in Armstrong 1994); 
typical applications of probabilistic techniques are the Hidden Markov Model in POS tagging, to determine the most likely tag for each word in a sentence; 
probabilistic taggers can be improved by integrating on-line dictionaries into the system, so that alternative (uncommon) uses of a word (that did not occur in the training set) will be considered;

 

propability value p: can be calculated depending on the degrees of freedom;: if p is close to 0 the difference is strongly significant, if p is close to 1 it is almost certainly due to chance; the normal cut-off point to assume sigificance is p < 0.05; p < 0.0001 means that the probability  is 1 in 10,000 that a calculated difference is due to chance; 
  
semantic prosody: the tendency for certain words to come with negative or positve contexts (e.g. cause and accident);

 

significant: reaching a degree of stastistical certainty at which it is unlikely that a result is due purely to chance;

 

methods of statistical analysis: 
- descriptive statistics: uses measurable and countable events and summarises the most important properties of the observed data, such as its average or degree of variation; thus, the characteristic features of a particular text, author, genre can be identified; 
1) absolute and relative frequency; 
2) measures of central tendency: the mode (the most freqeuntly observed score), the median (the central score of a distribution) and the mean (the average of  all scores in a data set, e.g. the mean sentence length in words); results in a single value which is the most typical score for a data set; 
3) relative proportion (ratio):  type/token ratio; lexical density 
4) measures of variability: calculate whether and to what degree the scores of a data set are spread out above and below the most typical score: range (highest value - lowest value),  variance (sum of all deviations of single scores from the mean score (x - x)2/N-1) and standard deviation; 
- inferential and evaluative statistics: compares groups of (abstracted) data and calculates whether observed differences are significant or due to chance and the inherent variability of the data; one must state the level of confidence with which one can accept a given result (usually 95% which corresponds to a p value of 0.05!); 
3) statistical significance tests to determine standard deviation/tolerable error (how high or low is the probability that the difference between texts on certain features is due to chance: chi-squared test, Student's t-test, Wilcox's rank sum test); 
4) significant collocation tests (mutual information, z-score); 
5) multivariate statistics (to examine relations between many variables in many samples); 
6) loglinear models (to find out which variable/s seem statistically most likely to be responsible for a particular effect); 
7) probabilistic language modelling (perform more accurately and robustly than most purely rule-based systems); used in natural language analysis as part-of-speech annotation and syntactic parsing, speech recognition, word sense disambiguation; 
The common statistical tests are based on the assumption that random variables are distributed normally in texts; they are not reliable for rare events in small or moderate-sized texts (20-30% of typical English words have a frequency of less than 1 in 50,000 words = many of the content-bearing words and nearly all the technical jargon). Dunning (in Armstrong 1994) proposes more applicable methods based on likelihood ratio tests (binominal or multinominal distibution) that yield good results with relatively small samples.

 

standard deviation: measures the dispersion of the items in terms of their deviation from the mean value (= square root of the average squared deviation from the mean);  a statistical measure of how widely the individual items in a  frequency distribution differ from the mean; 
degree of variability, as the mean value does not tell us anything about the actual distribution and variance of individual values (e.g. sentence length in words)! the minimum value is 0 = every value is equal to the mean and to every other value; 
provides also a common unit of measurement for the comparison of distributions in z-scores [Kenny, 52-58].

 

sublanguage: "a restriction of natural language to a particular fields or subject area (legal, medical texts, stock reports). They differ from general language in their distribution of words and word meanings, and sometimes in the syntactic structures they use" (Grisham/Kittredge 1986);

 

tagging: marking items in a text with additional information, often relating to their linguistic properties; usually: labelling of the grammatical word categories (POS); more or less detailed subclassification by feature values; an important measure of the usefulness of a tagging system is how well it performs in other domains; the accuracy can be improved by retraining the probabilites on the new data; 
context-sensitive models (like the tri-tag model, which bases the probability of a tag on what tags are in the adjacent context) vs. context-free models which associate a probability with grammar rules: see Chitrao/Grishman 1990, mentioned in Weischedel et al. 1994:330f.; Black/Garside/Leech 1993: accuracy rate 96-97%); context-free models are especially helpfull for training automatic parsers; 
word sense tagging: used in information retrieval systems to index texts by word senses taken from MRDs; the aim is not to always identify the single correct sense of a word but to rule out as many incorrect senses as possible; those word senses which cannot be distinguished (or for which there is no preference) are considered as falling into a word sense equivalence class (Zernik/Krovetz 51); 
tagging of unknown words: error rate can be reduced by adding to a purely probabilistic bi- or tri-tag model (taking into account only the context of the word) information about the word itself (a probability model of typical endings and capitalization): accuracy increased to 85% (Weischedel et al. in Armstrong 1994:325);

 

tagset: a collection of tags in the form of a scheme for annotating corpora; 
examples of POS tagset: LOB manual; Penn Treebank (Marcus et al. in Armstrong 1994:277); 
example of syntactic tagset: Penn Treebak (Marcus et al. in Armstrong 1994:281);

 

tokens: individual occurrences in a text = total number of words; 
types =  number of  unique word forms in a text; 
type/token ratio:  average frequency of the different words (word forms) in a text; computed by the formula: (number of types) ÷ (number of tokens) or:(number of types x 100) ÷ (number of tokens); 
this ratio varies very widely in accordance with the length of the text; a 1,000 word article might have a type/token ratio of 0.5 or 50% (which means that every words is used twice on average); a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%; so, if you compare texts of different size type/token information is rather meaningless; 
standardised type/token ratio: average type/token ratio based on consecutive chunks of text of equal length (e.g. 1,000 words)

 

treebank: a corpus of (manually-) skeleton-parsed sentences (annotated with phrase structure information) used for the testing and training of statistically-driven grammars for automatic parsing (IBM Grammar for parsing Computer Manuels) = Grammar Improvement Corpus;

  • Lancaster/Leeds treebank (1983-86 by UCREL, subset of LOB of some 50,000 words; manual annotation of sentences already tagged by CLAWS, originally produced in order to train a probabilistic parser);
  • UCREL/IBM Treebank (1986-1992) 3 Mio words of data (1 Mio AP news wire, 750,000 Canadian Hansard, c. 200,000 APHB, c. 53,000 SEC, 800,000 IBM Computer Manuals) tagged with the CLAWS2a tagset = the "Lancaster Tagset");
  • Penn Treebank (first phase 1989-92; automatically tagged with the probabilistic PARTS tagger, whose output was manually corrected and then parsed using the FIDDITCH parser); 4.8 million words of American English (including BROWN, DJ Newswire stories, Dept. of Energy abstracts etc.);
  • [available as part of the ACL/Data Collection Initiative CD-ROM; for more information: e-mail to treebank@unagi.cis.upenn.edu]; (see Marcus et al. in Armstrong 1994:273ff.);


treebanking: "the construction of bodies of hand-parsed sentences, often from a particular domain of English, in accordance with a previously established scheme of grammatical analysis",  (Black/Garside 1993:214);

 

t-score: a statistical significance measure based on the difference between observed and expected frequencies; focuses more on subtle distinctions (differences, dissimilarity) of different word co-occurences; a difference of at least 1.65 standard deviations is regarded as significant; 
useful tool for lexicographers to find out the more subtle contrasts between word pairs, especially synonyms (food/water, ship/boat etc.), for information retrieval to disambiguate polysemic/homonymic keywords ('money' or 'river' sense of bank, including upper vs. lower case); 
(Zernik 1991:122ff.; powerful support  is 13 standard deviations less likely than strong support);

 

z-score (z-Wert): = Signifikanzwert: Verfahren zur Bestimmung von signifikanten Kollokationen aufgrund statistischer/relativer Frequenz-Parameter (absolute Vorkommenshäufigkeit von Kollokant/node und Kollokat/collocate, Kookurenzwert, Kollokatsbereichsgröße, Dateigröße); 
Haskel (1971), Berry-Rogghe (1973): Problem der Festlegung des Koll.bereichs (Satzgrenzen mit max. 4 sprl. Einheiten ñ Kollokant bzw. flexibel); Problem von flektierten Formen; 
z-score is the difference between observed (O) and expected (E) frequencies of 2 (or more) words in a text divided by the standard deviation; if z < 0 a collocation is not significant, if  z > 0  not necessaricly significant;  weitere Unterscheidung in mäßig signifikant/hochsignifikant (cut-off measure for significance is around 3); 
      z = O-E 
            å (= the standard deviation of the occurence of the word in the whole text)

 

Statistical measure that quantifies the distance (measured in standard deviations) a  data point is from the mean of a data set; a z-score of 0 is equal to a 50% probability; if the observed frequency is above the mean, z will be positive, if below, negative! [Kenny, 57]

_______________________________________________________________________________________________

 

Essential criteria for evaluating linguistic annotation of corpora:

productivity/speed, accuracy, uniformity of analysis, and linguistic validity; constant need to adapt the annotation mechanisms to new and more specialized text types (supplementary wordlist, idiomlist, tag transition matrix);

 

TAGGING PROGRAMS/AUTOMATIC TAGGERS

 

CLAWS  = Constituent-Likelihood Automatic Word-tagging System;

developped on a general corpus of written English and modified in the light of experience and due to different goals and different text types; tendency to conservatism due to the need to rely on existing tagged data for the matrix of transitional frequencies; consists of the CLAWS lexicon, Idiom-list, probabilistic tag-transitio

Search