We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators.
download DUI
Access to biodiversity data is essential for understanding the state of the art of biotic diversity and for taking informed decisions about sustainable use of biotic resources and their conservation. Among several impediments to publishing and discovery of primary biodiversity data is a lack of professional recognition [1]. An important incentive for scientists to publish research articles, monographs or conference papers is the explicit recognition that their work receives by means of citations from fellow scholars. Owing to the common but tacit conventions established in academic communities, scientists recognize the use of previous research, critically as well as in a positive sense, by adding references to such work into their publication text and list of references. References can be seen as a kind of normative payment [2]. The reference lists can easily be broken down into single units, whereby each reference turns into a citation that can be aggregated in many different ways, forming a wide range of citation impact indicators [3, 4]. Typically, original academic articles, their references and the citations they receive are indexed in citation indexes, such as the Thomson-Reuters' Web of Science database [5]. These means of academic recognition and impact in science constitute the central indicators applied in the well established field of 'scientometrics'. Similarly, we believe that institutionalization of a Data Usage Index (DUI) [1] demonstrating impact of data publishing is feasible, even for a dynamic and complex network, such as the Global Biodiversity Information Facility (GBIF). So far, however, no metrics exist for data usage, especially biodiversity data usage, that recognize all players involved in the life cycle of those data from collection to publication. A set of DUI indicators is lacking [1]. We propose the indicators for the development of a DUI based on search events and dataset download instances - thus not based on traditional scholarly references and citations because no data citation mechanism now exists. As a spin-off, the DUI is also intended to provide novel insights into how scholars make use of primary biodiversity data in a variety of ways. Similar to scientometric analyses applying rank distributions, time series, impact indicators and similar calculations based on academic publications, the usage of primary biodiversity datasets leads to the development of a family of indicators and other significant metrics.
By applying instances of viewing, searching and downloading biodiversity dataset records, three characteristics are observable that differ from the use of publication references. First, in contrast to having fairly complete information on the nature of the original work (and its journal) citing a work, one has only limited knowledge of the internet protocol (IP) address that viewed, searched or downloaded dataset records, such as its location, that is, its geographic and institutional affiliation. We do not know who actually viewed or downloaded the dataset records. Second, we only know the data publisher's name and location. We do not know who in reality designed, collected and prepared the contents of the dataset and its records. The proposed DUI indicators are thus directly attributable to the academic institution rather than to the scholars behind it. Third, the basic unit in the proposed DUI is a biodiversity dataset record. Thus, we regard the dataset record as analogous to a journal article and the datasets as analogous to a journal. Biodiversity datasets are produced by data publishers. The latter may produce several datasets. As in similar scientometric analyses normalization is done by means of the basic analysis unit: here this is the dataset record.
In line with the publication and citation behavior mentioned above, and as stated by Chavan and Ingwersen (p. 5 of [1]), "[the] DUI is intended to demonstrate to data publishers that their biodiversity efforts creating primary biodiversity datasets do have impact by being accessed, searched and viewed or downloaded by fellow scientists". All players and their host institutions involved in the data life cycle from collection of data up to its publication require incentives to continue their efforts and recognition of their contribution. In a scientific digital library and open access environment, such as that developed for bibliographic information in astronomy [6], usage is measured in a two-dimensional way. The straightforward way is to apply common scientometric indicators with respect to citation patterns and impact. However, this track is not yet feasible in the case of biodiversity datasets. There are no robust and universally accepted standards for data(set) citations in scientific papers and quantitative analyses of citations to biodiversity datasets will provide unreliable results (see also below). A second avenue is to define usage metrics, based on requests, viewing and downloading of research publications in the form of metadata, abstracts or full text via the astronomy digital library client logs [6]. The citation analysis avenue clearly refers to the authors' intellectual property in the astronomy papers. The usage metrics avenue may also include players responsible for the technical infrastructure presenting such properties. The usage impact could be shared and the distribution of credit would be the responsibility of the host institution: a digital library or a data publisher.
By applying the usage logs of the GBIF data portal [7], the DUI indicators are confined to that context. The usage as measured by searches or downloads of dataset records is detectable only within the coverage of the host system logs, as in a library. However, in contrast to a closed library log system a substantial amount of log data are publicly accessible from the GBIF data portal usage logs and are consequently open to indicator calculations. The properties of the GBIF-mobilized data usage indicators bridges between known scientometric indicators on impact and existing socio-cognitive relevance or social utility measures used in information retrieval studies [10], such as download events, recommendation and rating metrics. The dynamic nature of the GBIF network suggests that short analysis windows be used for the indicator calculations, such as semi-annually, monthly or less, and that the underlying data structures become frozen in logs for later reproduction of analyses.
Figure 2 in [1] depicts a simplified representation of the current GBIF network configuration of servers and their contextual datasets (see [1] for a detailed description and discussion of this infrastructure). The proposed first phase of the DUI indicator developments is based on data usage logs of the GBIF data portal. These provide general usage data on kinds of access and searches via IP addresses as well as download events of datasets accessible through the GBIF data portal, established in 2001 [7]. Currently (as of 5 September 2011), over 300 million records published by 344 data publishers, with the largest data resource containing 42.2 million records, are accessible through the GBIF data portal.
Aside from directly gaining access to the dataset volumes and distributions, the GBIF data portal also provides free access to search and downloading events for each publisher and associated datasets, which can be defined for specific time slots via the dataset entry, Path C, to the data portal. At present only a maximum of 250,000 searching events can be effectively analyzed online from the data usage logs. Semi-annual, monthly or less extensive analysis periods should therefore be applied. Only the current number of stored datasets and records potentially available for searching or downloading in the same time window may be elicited from the internal GBIF data portal log for immediate public online analysis. The extraction of the searchers' IP addresses, location and number of times they individually search the portal can only be performed by staff in charge of the GBIF data portal. In the examples below we concentrate on usage indicators that are feasible to calculate online by the public in open access mode. They are thus reproducible. In total, the GBIF data portal provides five dimensions of data, characterizing datasets that can be used in a variety of dataset usage analyses:
The events of viewing data publisher and dataset metadata belong to characteristics of a searcher's interests and are indeed available in the GBIF data usage logs. However, they cannot be applied in further calculations because they do not entail record viewing. Such events are thus regarded as bounces. Further, similar to analyses of scholarly citations, the usage analyses do not discriminate between different purposes of use of datasets and records, nor their actual usefulness to later research works in the cases of usage through downloads. The latter would require comparison between download volume by a user and his or her actual use in publications shown through direct references to the dataset(s) in question.
The preliminary set of indicators relies on counting various events of searching and downloading records from selected GBIF units in given time windows. By 'unit' we mean typical GBIF defined entities, such as individual datasets or data publishers at institutional or geographical level, or group(s) of species. Because the hierarchy of data record, dataset and data publishers is well established by GBIF as a return of a query to the system, as is the entity of species group or individual species, it is up to the analyst to define further suitable aggregation entities of such units. 'Searching' (and viewing) indicates interest, whereas 'downloading' signifies usage on the side of the visitor accessing the GBIF data. 2ff7e9595c
Comments