Kæmpeprojekter

Bedømmelseskriterier
Kursushjemmeside
Tilmelding

Project proposal D: Web service technologies in the GBIF network

The Global Biodiversity Information Facility (GBIF) is a megascience project headquartered in Copenhagen. Its purpose is to make world's biodiversity data freely and universally available on the Internet. More in http://www.gbif.org/

GBIF network consists of 1) data providers - also called nodes, 2) a central services registry based on UDDI, 3) a central index of metadata at the providers, 4) and a portal. The architecture is described in http://circa.gbif.net/Public/irc/gbif/dadi/library Currently, some of the components are still being built but the network is scheduled to enter production at the end of 2003. It will provide a platform for scientific research mainly in biological and ecological sciences, but also for large-scale distributed computing.

The following subjects could be studied already in 2003-2004. This is by no means an exhaustive list.

Protocols

GBIF web services currently use two dedicated protocols DiGIR and BioCASE for querying and returning data. How can these be merged into a single next generation protocol? This requires analysis of GBIF data sources and XML technologies, in particular Xpath and Xquery. The currrent data providers od onot use WSDL descriptions of their services. This could be built into them to allow more dynamic discovery of the data contents. DiGIR and BioCASE do not build on SOAP. Embedding them as payoload within SOAP would allow many generic tools to be used for building the GBIF network.

UDDI registry technology

GBIF UDDI registry http://registry.gbif.net/ is currently based on a commercial tool from Systinet. What is the status of open source UDDI products and what are the possibilities possibilities for migrating into one. UDDI version 2 currently in use does not support replication between multiple servers. UDDI version 3 will do that. What is a suitable approach for distributing the GBIF registry to several countries?

Data vaults

GBIF network will consist of hundreds of data nodes. They can not always be on-line at the time when queries are made. Therefore intermediate storages such as caches and data vaults should be devised. How can queries be redirected to these? What kind of high performance query engines can be built that can support such large number of queries? What if data in the source

Data usage and data discovery patterns

GBIF makes data freely and universally available. Who is downloading whose data? Intelligent usage statistics could be developed that would 1) support reporting to data providers who is download their data, 2) allow those who use the data to recognise the data sources 3) optimisation of the network architecture and load balancing. Data may not always be downloaded from the original source but from value adding caching and datavault services. This requires that all data be tagged with globally unique identifiers that allow tracing the data back even after several steps of caching and forwarding. How can the identity of data be maintained? One possibility is the use of LSID/URNs that should be investigated.

Libraries for accessing data sources

The current data provider tools are based on PHP and Python. Other tools based on other languages could be made available and the performance increased. Also toolkits for building portals that make use of GBIF data providers, their descriptions in UDDI and the metadata caches could written in various languages.

Grid

Migration of the web services architecture of GBIF network into a Grid-based architecture. This includes comparing OGSA and Data Grid architectures with the current GBIF solutions. Also relevant are Semantic Web and Semantic Grid as approaches to organise biological names and concepts, and linking them to primary data (observations and specimens).

Supervisors

Jyrki Katajainen, DIKU
Hannu Saarenmaa, GBIF
jyrki-projekter@diku.dk Sidste ændring: 05.12.2003