> H_~TEXTR*chResource_Discovery.htmlLL2H Monaco?<?<LL22MPSR 5 > H_~TEXTR*chResource_Discovery.htmlLL2H Monaco?<?<LL22MPSR 5 Recommendations of Resource Discovery Working Group

Recommendations of Resource Discovery Working Group


Jeff Good (University of Pittsburgh)

Ulrike Kiefer (Förderverein für Jiddische Sprache und Kultur e.V.)

Robert Neumann (Förderverein für Jiddische Sprache und Kultur e.V.)

Other participants:

Anthony Aristar (Wayne State University)

Hennie Brugman (Max Planck Institute for Psycholinguistics, Nijmegen)

Scott Farrar (University of Bremen)

William Lewis (California State University, Fresno)

(And some others whose names, unfortunately, were not recorded.)

0. Introduction

The resource discovery working group focused on four questions with respect to EMELD and the issue of resource discovery:

The working group recommendations for each of these questions is given in the sections below.

1. What kinds of resources should EMELD provide search services for?

The working group recommends that EMELD search services should allow users to locate as wide a range of resources as they might be interested in for their linguistic needs. Such resources could include, among other things:

However, it would clearly be a major undertaking to implement all these searches, and it is far from clear that the working group on its own was able determine all the things that people might be interested in looking for. Therefore, the working group recommends that a questionnaire be distributed to determine linguists' searching needs (this questionnaire should presumably be distributed by Linguist). The working group volunteered to work on a draft of this questionnaire for EMELD.

Another aspect of the question of what resources EMELD should provide search services for is what resources, of a given category, should be within a search database. Here, the working group made the distinction between two kinds of best-practice resources: (a) those resources with best-practice metadata and (b) those digital resources which are in best-practice formats. The working group recommends that any resource for which there is best-practice metadata should be put into the search database.

In addition, EMELD should encourage all digital resources to be in best-practice formats. Such resources should be marked in a special way when they appear as search results (perhaps even ``ranked'' higher than other resources). Furthermore, for such resources, enhanced search features would be automatically provided insofar as only they would be interoperable with other best-practice documents.

An issue raised by the recommendation that best-practice digital resources get special marking is the general problem that, as of now, there are no reviewing bodies for digital resources. Making recommendations for the development of such groups is beyond the scope of this working group. However, we would like to suggests that another working group (perhaps under the auspices of OLAC) be formed to deal with this issue.

A final topic discussed by the group with respect to resource discovery is, at present, the adopted metadata standards are, for the most part, designed for describing linguistic resources but not non-linguistic resources (like tools, or advice documents) of interest to linguists. While developing metadata standards for such resources is outside of the scope of EMELD, this working group recommends that EMELD attempt to classify all the resources which are part of the School of Best Practice within the OLAC metadata standard. It could then use what it learns from this process to make initial recommendations to OLAC about metadata needs for non-linguistic resources of interest to linguists.

2. What should the architecture be for an EMELD search?

In discussing this question, the working group worked under the assumption that both metadata and data will be distributed---in a way similar to the OAI metadata model where each archive disseminates its own metadata.

There are two very distinct kinds of searches which EMELD wants to implement. The first is simple resource discovery. A basic architecture for this kind of search already exists in the OAI/OLAC model. There is no accepted basic architecture, at this point, for the second kind of search: querying across the data in interoperable documents. The need for tools and services of this kind is intensifying as they are increasingly being developed.

The working group recommendation about the basic architecture for this second kind of search is that it is necessary to develop a generalized query protocol for extracting data in linguistic documents. Specifically, a series of ``methods'' could be defined that could be called on resources which would return structured linguistic data matching query parameters. These methods would be the foundation for standardized services to facilitate extensive and intensive harvest of linguistic resources.

There are many barriers to developing such a query language. One of the most basic questions raised is what methods would need to be implemented. To help answer this question, the working group recommends structuring the search questionnaire in a way which allows the general community to give input on the sort of cross-database searching which is considered most desirable.

Another barrier to implementing this sort of ideal database querying system is, as of now, very few repositories allow their data to be accessed in any generalized way. As an initial step in moving towards this direction, the working group recommends that EMELD should encourage organizations hosting online database searches to document their data access systems and to develop a metadata standard to describe these systems. Such documentation would also be helpful in determining what kinds of query methods are of value to the linguistic community.

For the long term prospects of cross-database searching, the working group recommends that EMELD encourage OLAC to develop a database query language protocal. This would include a well-defined query language and a way to ``package'' queries. In addition, in order for such a query language to become widely used, there would need to be a linguistic data search registry where sites implementing the query protocol could register the query methods their site implements. One member of the working group informed us that Steven Bird had already secured a grant to work on such a query language---so, perhaps, this recommendation is, in some sense, obsolete. However, in addition, the working group recommends that EMELD implement such a protocal, when it is developed, for its database and archive best-practice documents for data creators not capable of implementing the protocol themselves.

A further step towards devising this query protocal recommended by the working group is that EMELD form a pilot project involving a few distributed resources in a cross-database search. To begin this project, EMELD could extract the FIELD search out of FIELD and then implement it both over the FIELD database, where it is already implemented, and some other set of similar databases. The AISRI dictionary database and some of the databases run by MPI Nijmegen are probably good choices for other databases for this pilot project.

Another possible solution to cross-database searching which would be simpler to implement that EMELD could adopt would be to build tools around a grammatical thesaurus that gives common synonym sets for grammatical terms (e.g. ``oral stop'' and ``plosive''). This thesaurus could then be used to allow a user's search to be ``expanded'' to include related terms, in addition to other possible applications. Such a thesaurus is actually being developed by the GOLD group. So, much of the work may have been done and what would be required would be the development of search tools exploiting a thesaurus, and not a thesaurus itself.

A separate issue discussed, with respect to the architecture of an EMELD search system was that EMELD should not only develop its own search interface but implement a Viser-like feature over its search system. That is, developers should be allowed to use the EMELD search engine as a ``back-end'' for their own specialized searches.

The working group discussed, but had no particular recommendations for, the question as to what the EMELD search interface should look like. Instead, it recommends adding questions about desired interfaces to the questionnaire so that EMELD can get advice from the larger linguistics community and that all the EMELD advisors should be asked to use the current search and give EMELD feedback about what they do and do not like about it.

3. How can EMELD get desirable metadata into its search database?

The working group discussed two broad classes for strategies as to how to make sure that desirable metadata gets into the EMELD database: ``stick'' and ``carrot'' strategies.

With respect to the ``stick'' category, the working group did not arrive at definitive recommendations. One suggestion was that EMELD appoint ``ambassadors'' in different linguistic subfields and areas to actively look for data creators in their assigned area and to encourage them to produce OLAC metadata. These ambassadors could be assisted by a ``Linguist Spider'' which develops a database of linguistic web sites based on the URL's which have been harvested by Linguist over the years. There's always a danger, of course, that if EMELD is too pushy it could alienate people, which is why it was difficult to devise definitive recommendations in this area.

With respect to ``carrots'', the working group has several suggestions, though it does not recommend any particular suggestion over another. (It does recommend that some subset of them be implemented, however.) The suggestions were: EMELD could support harvesting metadata contained in document headers; resources with best-practice metadata would be assigned some standard EMELD URI allowing them to be referenced in papers; and resources with best-practice metadata could be stored on Linguist servers and disseminated and advertised by Linguist. In addition, ``juicier'' carrots could be given to best-practice resources, for example: they could be assigned ``preferred'' EMELD URI's; given special marking when they appear in search results; be able to undergo ``advanced'' search techniques; or be peer-reviewed and vetted by some group organized specifically to review digital linguistic resources.

4. What level of metadata should be exposed?

A final issue discussed by the working group was the ``granularity'' problem. Right now there are no recommendations as to what level of metadata should be exposed by a data provider. This problem is most acute for large archives whose metadata has a complex hierarchical structure and for ``cutting-edge'' archives which cannot easily be accomodated in a model where a resource is considered to be a static object (perhaps resources are created dynamically based on user input, for example).

The lack of recommendations in this area at present inhibits metadata creation. Since what metadata is exposed by an archive significantly affects what can be searched, there is an urgent need for such recommendations. The working, therefore, recommends that EMELD encourage IMDI and OLAC to devise best practice recommendations for metadata exposure. There are at least three major classes of audiences in need of advice: individuals, trusted repositories (i.e. ``real'' archives), and ``cutting-edge'' data providers which create resources dynamically.