Project Abstract

The objective of this network is to draw together a wide group of experts throughout Europe who are involved in the use of information technology in the biomolecular sciences. The EMBRACE network will optimise informatics and information exploitation by pure and applied biological scientists in both the academic and commercial sectors.The objective of this network is to draw together a wide group of experts throughout Europe who are involved in the use of information technology in the biomolecular sciences. The EMBRACE network will optimise informatics and information exploitation by pure and applied biological scientists in both the academic and commercial sectors.

The network will work to integrate the major databases and software tools in bioinformatics, using existing methods and emerging Grid service technologies. The integration efforts will be driven by an expanding set of test problems representing key issues for bioinformatics service providers and end-user biologists. As a result, groups throughout Europe will be able to use the EMBRACE service interfaces for their own local or proprietary data and tools.

The many publicly available collections of biomolecular information do a reasonable job for a given domain. Software tools to organise and analyse this information are available both in the public domain and commercially. Cross-references in these databases in principle allow inter-database navigation; however, the links are sparse and coarse-grained, and their exploitation requires biological knowledge and expert programming. The effect of this is to burden every serious bioinformatics centre with the task of maintaining local data and software, and supporting users in the nontrivial task of exploring the natural biological connections between data. This requires substantial human effort.

Today's trend towards systems biology demands very much better connections between different domains of knowledge, and the weaknesses in information integration are becoming an intolerable hindrance. This network will address these weaknesses by enabling data providers and tool builders to standardise their data access and software tools using the new grid computing technologies ideally adapted to the task. The use of these standard methods will allow data resources to be, in essence, self-describing, allowing software to work out largely automatically the structure of the data. Aside from facilitating widespread integration of software and data, this will make the interacting systems easy to update, for example to reflect cahnges to the scemata of the data.

Project objectives

The objective of this network is to draw together a wide group of experts throughout Europe who are involved in the use of information technology in the biomolecular sciences. The EMBRACE network will optimise informatics and information exploitation by pure and applied biological scientists in both the academic and commercial sectors. The result will be highly integrated access to a broad range of biomolecular data and software packages.

Groups in the network will be involved in:

  • collection, curation and provision of biomolecular information
  • development of tools and programming interfaces to exploit that information
  • tracking and exploiting advances in information technology with a view to their application in bioinformatics
  • training and outreach to groups which can benefit from the work of the network.

These groups will work together to enable highly functional interactive access to a wide range of biomolecular data (sequence, structure, annotation, etc.) and tools to exploit the data. This will, very naturally, include many core databases and tools available from the EBI; but, crucially, the methods used will support the integration of dispersed, autonomous information. As a result, groups throughout Europe will be expected to integrate their own local or proprietary databases and tools into the collaborative "information space" which constitutes the "EMBRACE grid", a "data grid" allowing integrated exploitation of data, and analogous to a "compute grid", which enables unified exploitation of dispersed computer resources.

Already the many publicly available collections of biomolecular information do a reasonable job for a given domain (e.g., EMBLBank, SWISS-PROT etc.). Software tools to organise and analyse this information are available both in the public domain and commercially (e.g. EMBOSS, SRS, GCG). Cross-references in these databases in principle allow inter-database navigation; however, the links are sparse and coarse-grained, and their exploitation requires biological knowledge and expert programming. Historically, attempts to unify this information by drawing it into huge, monolithic databases have failed, due to the complexity of duplicating and maintaining information from a multitude of ever-changing databases in some central resource.

The effect of this is to burden every serious bioinformatics centre with the task of maintaining local data and software, and supporting users in the nontrivial task of exploring the natural biological connections between data. This requires substantial human effort, and the resulting services are far from ideal.

Today's trend towards systems biology demands very much better connections between different domains of knowledge, and the weaknesses in information integration are becoming an intolerable hindrance. This network will address these weaknesses by enabling data providers and tool builders to standardise their data access and integration methods using new computing methodologies ideally adapted to the task:

  • Grid computing, embraced by particle physics to render huge computing problems approachable, promises robust protocols to integrate dispersed computer resources.
  • Web-service methods are beginning to enhance web functionality to enable programmatic interfaces via the web.
  • The Open Grid Services Architecture (OGSA) integrates the rather specific 'distributed-compute' vision of the grid with the web and web-services.

This will remove much of the burden of bioinformatics centres by allowing diverse tools and data (dispersed and local) to interact exactly as produced by their developers.

In the broad terms, this will be done by delivering:

  • Standardised application programming interfaces (APIs) to all the core biological databases at the EBI, as well as to a wide range of other information distributed throughout Europe. This will be the subject of a "Content Integration" workpackage.
  • Software tools which exploit the data through the new APIs to give a working environment to access and analyse the data, and also to facilitate the development of further tools in a consistent programming environment. This will be done in a set of "Tool Integration" workpackages.
  • Training and outreach to enable biologists to get the best out of the resulting tools and data, and bioinformaticians to develop ever better tools in the knowledge that they are firmly connected to all the data. This will be done through a set of "Outreach" workpackages.

Content

Data content under the content integration includes:

  • DNA sequence information
  • Protein sequence information
  • Genome annotation
  • Macromolecular Structure Data
  • Expression information
  • Literature
  • Orthologs
  • Untranslated regions
  • 3D Electron Microscropy data
  • Protein Families
  • Alignments
  • Protein/protein-associations
  • Structural domains
  • Gene3D
  • ORFandDB
  • SNPs in regulatory regions (rSNPs)

Software tools

Software application areas under tool integration include:

  • DNA sequence analysis
  • Protein sequence analysis
  • Pattern matching
  • Genome annotation
  • Expert systems
  • Hidden Markov Models
  • High performance homology searches
  • Phylogenetic analysis
  • Protein structure analysis
  • Protein structure comparison
  • Protein domain mapping
  • Microarrays and gene expression
  • Bioinformatics workflows
  • Integrated bioinformatics tool environments
  • Protein structure prediction
  • Electron Microscopy
  • Electron microscope tomography
  • Systems biology modelling
  • Text mining

Development process

The key feature of the development method chosen is that it will be Test Problem Driven. We will take a 'software process' approach, developing an 'information architecture' and achieving adequate functionality as rapidly as possible. This will be followed by refinement and enhancement. In developing this architecture, we will draw on the expertise of an experienced software architects at partner sites, with Peter Rice (EBI) coordinating this effort. We will use industry standard code repositories to ensure integrity, and will draw on the strengths of modern software development methods - particularly the "Unified Process".

We plan the development or services through a set of "Test problems" workpackages, which will use tasks from real biological research, designed to stretch the system in critical ways. We begin by defining a few test problems in this proposal, and will add to and refine the list early in the project. While these may reflect the real research interests of the proposers, from the perspective of the software engineers, they are tools to guide system development. Success of the project will be monitored with respect to real research problems, which will be used to drive the system towards the needs of biologists and bioinformaticians.

We anticipate that the technology required to provide the fully integrated "EMBRACE grid" will be stable only in another 1 or 2 years. To ensure we are ready to implement this essential new technology as soon as it is available, we have a set of "Technology" workpackages which will be used to guide the transition from exploration of the content and tool integration needs, firstly to the implementation of fully integrated services, and then to the adoption of these services by the European bioinformatics community.