Morpheme Meanings and Constructional Meanings
(Linguistics and Applied Linguistics – University of Melbourne)
The current focus of the E-MELD project is GOLD (General Ontology for Linguistic Description). GOLD aims to enable the comparison of data categories across different language documentation projects. This paper looks at current efforts to build out GOLD in the light of its potential uses. A recent development is the DDLOD project (Data–driven linguistic ontology development) that is currently mining IGT (Interlinear glossed text) for categories to use to build out GOLD. The first question addressed here is ‘What is the nature of IGT?’ which leads to the second question; ‘Is IGT a suitable data source for finding categories to build out GOLD?’. The glosses in IGT are linked to specific morphemes but do not refer only to morphological knowledge. In fact, the core of the knowledge coded for by glosses is the linguist's understanding of the syntactic constructions found in the source language.
The nature of IGT is discussed in some detail and it is argued that IGT is primarily a system of shorthand and a system of cross–referencing. Thus it does not provide a suitable source for finding data categories for GOLD. As an alternative to mining IGT it is suggested that grammatical descriptions and contemporary typological work form the basis for building out GOLD. GOLD is understood as a work of typology because it involves comparing syntactic categories across languages. A parallel is drawn between the possible ways to build out GOLD and two contrasting approaches to typology. These alternative approaches are labelled the ‘morpheme–level’ approach and the ‘construction–level’ approach.
Constructions found in different languages can be grouped into semantic–functional domains. Similar domains have been argued by some linguists to be the only part of languages which are truly comparable cross–linguistically (Croft 2001). These domains form the basis for much functionally or semantically–oriented typological work. If we start with such a domain and then work downwards to the construction level, many of the pitfalls associated with comparing constructions and categories across languages are avoided1. A case study of this type of approach forms the last part of this paper — I look in detail at some group reference constructions and their associated grammatical morphemes.
The DDLOD methodology is based on assumptions shared with the morpheme–level approach to typology. If we compare languages at the level of syntactic constructions instead, we can make more meaningful comparisons of different languages, minimise errors and better reflect the real understanding linguists have of how particular languages work. IGT is not a repository of linguistic knowledge as such so it does not constitute a useful source of categories for GOLD. A more construction–oriented approach to building GOLD would look for categories from grammatical descriptions and typological work and thus utilise the main wealth of linguistic knowledge available. The resulting GOLD would be more useful to typologists and others interested in accessing material on specific languages in the future.
GOLD has many possible end uses and consequently a range of different interest groups. The main aims of GOLD are summarised in the draft GOLD community process document below:
While the aims of the interest groups involved overlap to some extent, the interest groups and their aims can roughly be divided into two. The first group is those with an interest in computational linguistics. The aim of GOLD is to link in with other ontologies and ultimately the semantic web, a development still in its experimental stage (Farrar and Langendoen 2003a). The semantic web could benefit all interest groups but at this stage those who are aware of this end–use form a small group. This interest group comprises computationally–oriented linguists and linguistically–oriented computer scientists who are in touch with incipient developments and interested in being involved with them. The main aims of this interest group are listed in (1).
(1) Aims of the ‘computational’ interest group
The second aim listed above has been referred to as an important purpose for GOLD (Farrar and Langendoen 2003a; Farrar et al. 2002). Web crawling is part of the DDLOD methodology and requires this type of ‘smart searching’. However, it is unclear what the uses of ‘smart searching’ will be after GOLD has been built, in particular, whether it will be useful to typologists.
GOLD is being developed in line with the overarching aim of the E-MELD project: that archived language documentation projects will have the maximum possible utility in the future. Those who are likely to access archived language documentation projects fall into two main interest groups. The first includes those linguists interested in comparing languages such as typologists and historical linguists. It also includes descriptive linguists. Descriptive linguists work in detail on a single language, but it is rare that they do this in isolation. In describing a language, a linguist needs to compare their analyses with work on similar phenomena. This requires comparison of the language under investigation with areally and genetically related languages, as well as unrelated languages which exhibit similar phenomena. So it is justified to group together typologists, historical linguists and descriptive linguists. The aims of this interest group are associated with accessing language documentation materials and are listed in (2).
(2) Aims of the ‘comparative’ interest group
There is a third important interest group relevant to the E-MELD project as a whole, but without a specific perspective on GOLD. These are those with an interest purely in a particular language. They may be speakers of endangered languages or their descendants or field linguists focussing purely on a single language (such as formally untrained linguists). This group also includes historians and anthropologists who may wish to access non-linguistic content in texts and dictionaries.
This interest group has been alluded to in publications about GOLD. However, I cannot claim to have any particular insight into the aims of this interest group. Whether their aims differ from the ‘comparative’ group or not is unclear. The perspective presented in this paper is that of the second interest group — from typology and descriptive linguistics. In addition to having an interest in typology, I also do fieldwork and am compiling the results of my fieldwork on the indigenous Australian language Mawng as a language documentation project. Some of the examples used in this paper are drawn from my work which provides an example of a typically imperfect language documentation project.
Before going any further it is useful to briefly consider the steps taken in building GOLD. The first step was setting up the computational framework. Initial efforts were guided by the need to both adequately represent linguistic knowledge and link to existing upper ontologies (Lewis et. al 2001; Farrar and Langendoen 2003b). The second step was to start inputting this knowledge. The first sources of linguistic knowledge that were used included the glossaries in (3), the more general books in (4) and the terminology used in dictionaries and grammars of endangered languages like those in (5) (Lewis et. al 2001).
(3) Morphosyntactic glossaries
(5) Dictionaries and grammars of endangered languages
The third step in building GOLD is to expand the total range of categories. Current efforts to build out GOLD are being carried out as part of a separate project — the DDLOD project. The DDLOD project focuses on the glosses given for grammatical morphemes. Different grammatical morphemes are sourced through web–crawlers which can find IGT (Eggers et. al. 2004). The meaning of the grammatical morphemes in the harvested texts is then deduced through a complex, multipronged process (Lewis and Farrar undated). It is clear from the information given that an analysis of the distributional properties of the morphemes is considered. What is unclear is whether relevant grammatical descriptions on the source language are consulted. It is possible that the IGT texts found by the crawler do not have published grammatical descriptions. Alternatively, the methodology may treat grammatical descriptions as irrelevant.
GOLD is designed to make IGT more comparable between language documentation projects. It may appear straightforward that the data an ontology is designed to interpret is the data that should be used to build the ontology but the nature of IGT presents problems for this approach. The various types of IGT and how IGT could be modelled computationally has been discussed at previous E-MELD workshops (Bow et al. 2003). The next section explores the place of IGT in language description.
The ideal place of IGT is as part of a language documentation project that includes a comprehensive dictionary and grammar as well as an interlinearized text collection. In such an ideal project, the dictionary would list bound and free morphemes and the grammar would list all grammatical morphemes in its index. In the real world IGT is found with varying amounts of other documentation. This may be because that documentation has never been published or alternatively a sample of IGT may be found in one place, such as on the web or quoted in an academic publication, while the other documentation types are found elsewhere.
To paint a clearer picture of the contexts in which IGT can be found three examples are discussed in the following paragraph. First, we may find a partial grammatical description which includes IGT. For example, in an article giving a syntactic description of coordinating constructions in Hausa (Abdoulaye 2004), we find a detailed description of the uses of the coordinating particle da. Consequently the reader learns much about da but the meaning of many other glosses found in the examples remain a mystery. Second, we may find only a short sample of IGT in a particular language, accompanied by a few brief comments. This type of context is typical of a typological survey of a large range of languages such as Stassen (2000). Stassen (2000) surveys conjunction and comitative constructions in 260 languages. The author gives excerpts of IGT from many languages, sometimes explaining certain aspects of the gloss to illustrate a particular point, sometimes not. The third context in which we may find IGT is when it is completely isolated, lacking any accompanying documentation. For example, I happened to use Austin (1997), a collection of interlinearly glossed texts in some Mantharta languages to look for a particular construction I was working on, although I did not have access to the relevant dictionary or grammatical description at that time. Similarly, if we use a web-crawler we may harvest some IGT but be unable to access any other materials on the source language.
The uses of IGT depends largely on what other language documentation materials complement it. In the context of the ideal language documentation project IGT has three main functions. First, the process of annotating a text with an interlinear gloss helps the linguist to develop a consistent and comprehensive analysis of the language. Secondly, the glosses are important in published grammatical descriptions. The use of examples with interlinear glossing makes grammatical descriptions more transparent and gives other linguists greater insight into the analysis. The third function is perhaps the most important; the interlinear gloss acts as a system of cross–referencing between a text collection, grammatical description and a dictionary. These three functions of interlinear glossing are summarised in (6).
(6) Uses of interlinear glossing within a language documentation project
Outside of the ideal language documentation project, interlinear glossing can still partially fulfil these functions. Some glossing symbols are universally recognizable and can be correctly interpreted if used in a standard way. Person information associated with number such as the symbol ‘1sg’ are recognizable, as is information about grammatical relations such as NOM, ACC and S and O. Based only on our guesses of the meaning of familiar symbols we can make a hypothesis as to the categories encoded by many gloss symbols. For example in (7) we might recognise the symbol 1sg.O and deduce that what is the subject in the English translation is expressed as an object in the Mawng sentence. Thus the interlinear gloss line allows the reader to construct part of a more literal translation of the text in (7).
(7) Mawng, Iwaidjan, Australia.
Faced with the example in (7), many linguists would hypothesise that the first word is the main verb and that this language has head marking for arguments and TAM. Some linguists would hypothesise further that the first two words form a complex predicate. This is because the use of capitals for grammatical morphemes is near conventionalized. The use of capitals for inflecting verbs in complex predicates is also fairly widespread, but many would not be familiar with this system of glossing. It is highly unlikely that any linguist other than those already familiar with my work on Mawng would deduce that the symbol GEN codes the fact that the verb has a subject of nonmasculine gender. Similarly the symbol LL would not generally be recognised as coding the Land gender article. The information to be gained from a section of IGT with no context is at best a hypothesis as to what the glosses mean. An isolated occurrence of IGT can fulfil the functions listed in (8).
(8) Uses of IGT with no context
IGT is usually found in a context between the two extremes discussed here — the ideal language documentation project and a complete lack of complementary material. So the uses of any actual instance of IGT are a combination of those listed in (7) and (8) above. IGT has many inherent limitations and in addition its utility is limited by the context in which it is found. IGT does not constitute a morphological, syntactic or semantic analysis in itself and is never designed to be self–explanatory.
Interlinear glosses are always created within an ideal world because they are created by linguists with the knowledge required to produce a dictionary and a grammar, whether they ever do so or not. In short, interlinear glosses are a system of cross–referencing and a system of shorthand. Any system of cross–referencing requires complementary materials to cross–reference. IGT requires a dictionary and a grammar to be fully functional. Shorthand is a good analogy of IGT as although standard systems of shorthand are taught, practitioners invariably adapt these and end up using a slightly idiomatic system of their own. It is often not possible to decode shorthand without the help of the author. Similarly the meanings encoded by interlinear gloss symbols vary in the degree to which they can be read by linguists other than their author. The ability of a reader to correctly interpret an interlinear gloss depends on whether the author has adhered to unofficial conventions and how much background knowledge is shared by the author and the reader4.
Interlinear glossing is the product of a compromise between what the author knows and what they are able to present within a very limited visual format. This format requires that all information is associated with a morpheme and can be abbreviated to a fairly short symbol. Glosses are a shorthand for the author‘s understanding drawn from the areas of morphology, syntax, semantics and discourse5. However it is difficult to encode those linguistic phenomena not strictly associated with a single morpheme. As an example of some of the problems inherent to constructing glosses, I refer to the system I use to gloss Mawng, which is by no means a model system.
Mawng verbs have five different TAM suffixes. Two are glossed using the labels Irrealis 1 (I1) and Irrealis 2 (I2). de Haan (2004) discussed the meanings of these two suffixes in the context of semantic maps at the E-MELD 2004 workshop. The commonly used term ‘irrealis’ is able to encompass the diverse meanings of these suffixes, which depend largely on the syntactic and discourse context in which they are used. The main way of expressing negation in Mawng is to combine the preverbal particle marrik with one of the two irrealis suffixes. The combination of the Irrealis 1 suffix and the preverbal particle marrik expresses present or future negation as in (9).
The combination of the Irrealis 2 suffix and the preverbal particle marrik express past tense negation as in (10).
In the absence of marrik the Irrealis 2 suffix can be used to form the imperative. Both Irrealis 1 and 2 can be used alone to express different types of hypothetical and counterfactual meanings such as ‘might (have)’, ‘could (have)’, ‘would (have)’, ‘tried to’, ‘wanted to’, etc.
The meaning of a syntactic construction might sometimes appear to reside in a single key grammatical morpheme. However, syntactic constructions are inherently idiomatic — their meaning is more than the sum of their parts (Goldberg 1995). The meaning ‘past negative’ in Mawng does not reside in a particular verbal suffix, or in the preverbal particle marrik, but is produced by a particular combination of these morphemes. Although we can find constructions in any language that are easily identified by a particular characteristic grammatical morpheme, the occasional occurrence of this type of morpho–syntactic–semantic alignment does not justify an assumption that such a relationship is found in all constructions.
A typological approach that is based on the assumption of a one–to–one relationship between constructions, their characteristic grammatical morphemes, and meaning will be referred to here as a ‘morpheme–level’ approach. A construction–level approach to typology involves comparing syntactic constructions within a particular semantic–functional domain. In this section some constructions that occur in the semantic–functional domain ‘group reference constructions’ are compared. Group reference constructions are found universally in languages, they provide a means of referring to a group of entities in such a way that some information is also given about the membership of the group. The fact that all languages have at least one group reference construction suggests that we are dealing with a useful semantic–functional domain for cross–linguistic comparison.
Three constructions in this domain are useful for illustrating the very different types of relationships that may exist between a construction and its characteristic grammatical morpheme. The constructions considered here are the comitative construction, the nominal conjunction construction and the inclusory construction. These are only a subset of all the possible constructions that could be categorised within the semantic–functional domain. As illustrated in Figure 1 there are many other constructions that could be discussed in this domain. Figure 1 also shows the relationship between group reference constructions and other nonsingular referring expressions such as nonsingular nouns and pronouns. The discussion of group reference constructions here is informed by recent typological work by Haspelmath (2004), Stassen (2000), Lichtenberk (2000) and detailed discussions of coordinating constructions in specific languages such as the contributions to Haspelmath (ed) (2004).
Figure 1. Group reference constructions
The comitative construction is an exemplar of the type of relationship between morphemes and constructions that is assumed by a morpheme–level approach. Firstly, the comitative construction is relatively uniform cross-linguistically (Stassen 2000). Secondly, the comitative construction requires the use of a grammatical morpheme to mark the added participant almost by definition6. An example is the comitative preposition with in English as illustrated in (11)
Stassen (2000) surveyed the comitative construction in a sample of 260 languages and found that it does not vary greatly between languages. However, in some languages we do not find a distinct comitative construction at all, as a single construction covers the functions of both comitative and conjunction constructions. Where the distinction does exist, it is both semantic and syntactic. The referents of conjunction constructions are presented as having the same semantic role whereas in the comitative construction one participant always has a slightly different role to the other. The syntactic distinction between the two constructions is that agreement with a comitative construction is with one of the referents but agreement with a conjunction construction is with both of the referents.
Conjunction constructions often have a characteristic grammatical morpheme but they may not. Grammatical morphemes that serve as the indicator of the presence of a particular syntactic construction will be referred to as ‘markers’. Markers with a range of different distribution patterns have been found in conjunction constructions (Stassen 2000). A single marker may occur between the two NPs like the English and, or it may be associated with the second NP. There may be two markers — each associated with one of the NPs. The third main pattern found is that in which there is no marker at all — the construction is then often referred to as an ‘asyndetic’ conjunction construction (Haspelmath 2004). The three main patterns for markers in the conjunction construction are listed in (12).
(12) Common distributions of markers in conjunction constructions
Stassen found that in one third of the 260 languages he surveyed the same marker was used to mark both comitative and conjunction constructions. This type of multifunctionality in markers should be able to be encoded in GOLD in its current form, however it may present some problems for the DDLOD methodology. A brief glance at the contributions in Haspelmath (ed) (2004) shows that it is common for authors to use a monosemous gloss for markers which participate in both comitative and conjunction constructions, for example ‘AND’ or ‘COORD’. This type of gloss is likely to mislead those applying the DDLOD methodology as these markers may be interpreted as simply marking conjunction constructions, when in fact they mark both comitative and conjunction constructions.
It will not be possible to search for asyndetic conjunction in IGT with a simple three–tiered structure. There can be no getting around this problem in terms of how GOLD is built or how the IGT search engines based on it are designed. It is a limitation of IGT that will be carried over to GOLD — constructions which lack markers cannot be encoded in IGT with a three-tiered structure. The fact that some constructions lack any characteristic marker underlines the fact that descriptive grammars will always be central to typology. Language documentation projects that comprise text and lexicons alone cannot fully capture the knowledge of a linguist with an special understanding of a language.
The inclusory construction is part of the same semantic–functional domain as conjunction and comitative constructions but is less well known despite being common cross–linguistically (Lichtenberk 2000). The inclusory construction is another way of referring to a group and a member of it. The underlying semantics are different but in many contexts the inclusory construction is the translation equivalent of a conjunction or comitative construction. The inclusory construction involves a pronominal element referring to an entire group and an NP referring to a participant who is construed as a subset of that group. There are two main forms that the construction can take, depending on whether the form of the pronominal element is a free pronoun like kamwa in (13) or a bound pronominal element like the verb prefix ngarr– in (14).
(13) Mokilese, Oceanic, Micronesia
(14) Mawng, non-Pama Nyungan, Australia.
It is common for the inclusory construction to lack a characteristic grammatical morpheme altogether as in the Mokilese and Mawng examples above. When a marker does occur it is often a marker which also occurs in the comitative construction. For example, in Polish a sentence like that in (15) can be interpreted as either containing a comitative or an inclusory construction. If it is interpreted as an inclusory construction then the entire group referred to can consist of only two people, but if it is interpreted as a comitative construction then the group must consist of three or more people.
(15) Polish, Slavic, Indo–European, Poland
In Nêlêmwa the same marker is used in the inclusory, comitative, conjunction and associative constructions7. An example of a Nêlêmwa inclusory construction is shown in (16).
(16) Nêlêmwa, Remote Oceanic, Oceanic, New Caledonia
The Inclusory Construction is globally common. It is found in most large–scale linguistic areas and is very common in some language groups such as Austronesian languages and Australian languages (Schwartz 1985; Singer 2001). In many languages the construction involves the use of a marker but no language has yet been found in which the inclusory construction has its own unique marker. The patterns of ‘marker–sharing’ that have been found between the inclusory construction and other constructions are listed in (17).
(17) Constructions attested to share a marker with the inclusory construction in some language
Given the patterns of marker sharing listed in (17) the multifunctionality of the markers involved can be characterised as in (18).
(18) Functions of markers (based on marker–sharing patterns in (17))
If we took a purely morpheme–level approach we would find two types of grammatical morphemes in our data — the first two listed in (18). If we start with constructions instead we see that the relationship between grammatical constructions and their markers is not at all straightforward. The relatively unknown status of the inclusory construction among linguists and its lack of a unique grammatical marker makes it likely that such a construction would go under the radar of a IGT–mining approach to building GOLD. The DDLOD method finds a grammatical morpheme and then looks at its distribution in texts. This methodology is an imperfect and unnecessary replication of the long and complex path descriptive linguists take to understanding the role of a morpheme in syntactic constructions. The linguist considers morphemes within the broader context of syntactic constructions, rather than simply looking at their distribution. The DDLOD methodology is an attempt to start at the morpheme and work up and consequently faces many of the same problems as morpheme–level typology.
There is a parallel between the morpheme–level approach to typology and the current approach to building GOLD. The morpheme–level approach takes a narrower view of the relationship between syntactic meanings and morphemes, and thus excludes much linguistic knowledge that GOLD aims to include. Although the symbols in interlinear glosses are pegged to individual morphemes, these morphemes are understood by the author of the IGT on the basis of their role in syntactic constructions. This type of knowledge is found primarily in works of grammatical description.
A construction–level approach could make GOLD more useful to all the interest groups identified earlier. Taking the semantics of constructions as the basis for GOLD rather than particular morphemes will give the GOLD ontology a higher level of generality. Morphemes do not really have their own ‘meanings’, independently of the constructions they participate in. Documenting such apparent ‘meanings’ is a potentially infinite task, whereas constructions can be grouped into semantic-functional domains. These domains are more comparable across languages than individual morphemes. These two reasons for advocating a construction–level approach to building GOLD are summarised in (19).
(19) Reasons to take a construction-level approach to building out GOLD
A construction–level approach to building GOLD will allow the builders to draw on the primary source of linguistic knowledge on particular languages — grammatical descriptions. Such an approach will allow GOLD to reflect much more linguistic knowledge than the current approaches. A construction–level approach could fully exploit the large amount of high quality grammatical description and typological work that is freely available.
1. Although by no means all, and for this reason Croft (2001) deos not admit comparison of constructions or grammatical categories across languages.
2. See references section below for full details of these publications.
3. Note that in discussing IGT I assume the trhee line format which Bow et al. describes as the most common type — an example is shown in (7) below.
4. Good (2004) describes some grammars as more accessible to ‘subcommunities’ of linguists. The same point applies to interlinear glosses.
5. Brugman (2004) argues a 12-tiered gloss would better reflect the different ontological types encoded in interlinear glosses.
6. Although it is theoretically possible that a comitative NP might be coded by word order alone, I have not found any descriptions of this strategy being used in a language. I exclude applicatives here, for simplicity.
7. See Moravscik (2003) for more information on associatives.
Austin, P. (1997) Texts in the Mantharta languages, Western Australia. Tokyo, Institute for the Study of Languages and Cultures of Asia and Africa.