Carlos's profileCarlos S. Zamudio of Sem...BlogLists Tools Help

Carlos S. Zamudio of Semantic Laboratories

www.semanticlaboratories.com
February 12

Semantic Mashup for Nicotine Dependence Research

Here’s an excellent paper by S Sahoo, O Bodenreider, JL Butter, KJ Skinner, and AP Sheth titled “An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence”. From the abstract:

“This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.”

The papers describes the methodology they used to create an RDF store for data related to genes and pathways associated with nicotine dependence.  I was particularly interested in how they handled data extracted from the NCBI Entrez database.  I’ve struggled with this myself in generated RDF data from PubChem.  They developed a simple XML->RDF translator and then used OWL to provide the semantic framework for the results.

This is a good paper illustrating the advantages of using semantic technologies for data integration.

W3C Final Report on Relational Data Schemas to RDF

The W3C have published the final report from the RDB2RDF incubator group, with their recommendation that the W3C proceed to initiate a formal working group to standardize a language for mapping relational database schemas into RDF and OWL.

“Such as standard will enable the vast amount of data stored in relational databases to be published easily and conveniently on the Web.  It will also facilitate integrating data from separate relational databases and adding semantics to relational data.”

They go on to describe a use case for integration of enterprise information systems:

“Efficient information and data exchange between application systems within and across enterprises is of paramount importance in the increasingly networked and IT-dominated business atmosphere. Existing Enterprise Information Systems such as CRM, CMS and ERP systems use Relational database backends for persistence. RDF and Linked Data can provide data exchange and integration interfaces for such application systems, which are easy to implement and use, especially in settings where a loose and flexible coupling of the systems is required.”

It’s easy to see where this is headed.  One scenario, for drug discovery data integration purposes, would facilitate publishing data from relational databases to an RDF store, and the RDBMS schema semantics would be maintained.  Additional data semantics could be integrated within the RDF store.

February 10

Interview with Martin Leach of Merck Research Labs

Here’s an interview from BioInform with Martin Leach, Executive Director of Information Technology at Merck Research Labs. He discusses the issues and challenges of supporting Merck’s research data output.  Even though I’ve visited almost all of the large pharmaceutical companies over the years, it’s still hard to imagine the complexity of trying to manage the knowledge output across research. From my experience, Merck seems to do as good as job at this than any other organization I’ve been exposed to.

The practical nuts and bolts issues of how to deal with petabytes of data, especially data coming from high resolution instruments such as Illumina’s is a bit sobering.  Even though Martin Leach doesn’t seem impressed with the contributions the Semantic Web can make, I think that if he thought of RDF, RDFS and OWL as the platform for data integration across instruments I think he might be able to see an opportunity that goes being exporting XML files and then importing them into ORACLE.  Just using RDF to capture instrument metadata could be a significant step in integrating experiments across laboratories.

Anyway, my hats off to these guys.  I’m sure there are many at Merck that don’t really have a good appreciation of their efforts.

February 03

Java Content Repositories

I’ve been exploring ideas (using Apache’s Jackrabbit) for a specialized Content Repository as the basis for a collaboration tool to be used by researchers involved in drug discovery research.  Most research project teams are organized as a matrix of specialized laboratory and computational skill sets that combine to collaborate on data acquisition, analysis, integration and publication.  Much of the knowledge produced is stored in a variety of structured, semi-structured and unstructured formats. Capturing the knowledge generated during the research workflow and supporting the variety of  data formats is challenging.  However, I’m starting to see the value of applying a Content Repository data model for capturing research workflow data as an alternative to a traditional relational database.  Here’s an essay by Bertil Chapuis comparing the rationale for choosing content repositories versus relational databases.   This is an excellent introduction to the design of the Java Content Repository specification.

The Java community have worked to define a Java API specification called the JSR 170: Content Repository for Java technology API.  The Apache community have released a reference implementation of JSR 170 called Jackrabbit

February 02

LabAutomation 2009 Presentations

I recently participated in a couple of sessions at the LabAutomation 2009 conference in Palm Springs, CA.  LabAutomation is an annual meeting of the laboratory automation industry focused on the Life Sciences.

I participated in two sessions: “Data Management, Mining & Visualization” chaired by Petar Stojadinovic and “Ontologies and Semantic Technologies in Drug Discovery” chaired by Dr. Reinhold Shafer.

You can find my presentations at semanticlaboratories.com/labautomation2009.  This was the first LabAutomation meeting where semantic technologies have been discussed as a potential technology for Life Sciences data integration. So, the main goal at this meeting was to introduce the audience on the concepts of the technology.

During the first session there was am excellent talk by Randall Julian, President of Indigo Biosystems, who described some of his work using semantic technologies for data integration in the Life Sciences. I’m hoping that he will share his presentation at some point in the near future.

During the second session there was an excellent talk by Alan Ruttenberg of the Science Commons. Science Commons is a project within the Creative Commons framework.  Check out their excellent work at the Neurocommons, where they are conceiving a platform for knowledge management for biological research. 

October 21

D2RQ Platform

I ran across another reference to the D2RQ platform recently and decided to explore this RDF mapping tool.  D2RQ describes itself as a platform "for accessing non-RDF, relational databases as virtual, read-only RDF graphs. D2RQ offers a variety of different RDF-based access mechanisms to the content of huge, non-RDF databases without having to replicate the database into RDF.  Using D2RQ you can:

    • query a non-RDF database using SPARQL or find(spo) queries,
    • access information in a non-RDF database using the Jena API or the Sesame API,
    • access the content of the database as Linked Data over the Web,
    • ask SPARQL queries over the SPARQL Protocol against the database."

I tried it on some simple data sets I downloaded from PubChem Bioassay using the Jena API.  Originally I planned to build  an RDF store from the bioassay data as a way to reason across data sets, but now I'm thinking I can accomplish some of my goals with DR2Q.  Descriptions of other similar tools can be found on the ESW Wiki.  D2RQ strikes me as an interesting platform for performing data integration across the various chemical screening data repositories generated in support of a drug discovery program.

October 10

OWL 2 Web Ontology Language Drafts

The W3C has just released draft specifications for the OWL 2 Web Ontology Language. Interestingly, the sub-languages of OWL 2 are being called "profiles".  As described in the Profiles document, there are 3 OWL 2 profiles currently specified:

    • "OWL 2 EL is particularly useful in applications employing ontologies that contain very large numbers of properties and/or classes: it captures the expressive power used by many such ontologies and is a subset of OWL 2 for which the basic reasoning problems can be performed in time that is polynomial with respect to the size of the ontology [EL++]. Dedicated reasoning algorithms for this profile are available and have been demonstrated to be implementable in a highly scalable way.
    • OWL 2 QL is aimed at applications that use very large volumes of instance data, and where query answering is the most important reasoning task. In OWL 2 QL, conjunctive query answering can be implemented using conventional relational database systems, and can directly access data stored in such systems. Using this technique, sound and complete query answering can be performed in LOGSPACE with respect to the size of the data (assertions). As in OWL 2 EL, there are polynomial time algorithms for consistency, subsumption, and classification reasoning. The expressive power of the profile is necessarily quite limited, although it does include most of the main features of conceptual models such as UML class diagrams and ER diagrams.
    • OWL 2 RL is aimed at applications that require scalable reasoning without sacrificing too much expressive power. It is designed to accommodate both OWL 2 applications that can trade the full expressivity of the language for efficiency, and RDF(S) applications that need some added expressivity. OWL 2 RL reasoning systems can be implemented using rule-based reasoning engines. Such rule-based approaches can be used to perform consistency, satisfiability, subsumption, classification, instance checking, and conjunctive query answering in time that is polynomial with respect to the size of the ontology."
October 09

Semantic Web Industry Review

David Provost has published an industry survey titled "On The Cusp: A Global Review of the Semantic Web Industry" where he reviews the current international industry players in the Semantic Web arena.  From his conclusions:

"The Semantic Web industry is alive, well, and it’s increasingly competitive as a commercial technology. At this point, there are too many success stories and too much money being invested to dismiss the technology as non-viable. The Semantic Web is presently building a track record, which means the big wins and unanticipated uses are yet to come. In the meantime, adoption is occurring, and the early news is very good indeed."

Although I'm focused on open source technologies for my Semantic Web development, companies like TopQuadrant are definitely on my radar for future integration.

October 08

Corporate (and Research) Intranet Wiki's

Here's a recent article in CIO by C.G. Lynch making the argument for integrating wikis into a Corporate intranet. For those who are using or experimenting with wikis, the benefits of wikis are a no-brainer.  The ease with which wikis can be used to support sharing of collaborative information is well documented.  I probably use dozens of public wikis on a regular basis for collecting information from the research projects I follow. It truly is a "bottom up" technology as the CIO article suggests, where the wiki authors are in charge of the platform.  Many public wikis allow any user to add content to the wiki.  The research organization, however, can present some additional challenges, such as restricted access and data security.  In these cases,  user authentication should be required for authoring and accessing a wiki. Setting up single sign-on for intranet web applications is still a challenge, at lease in my hands, and for this routine monitoring and configuration by the IT team is necessary.

There are many excellent open-source wikis that have been developed.  Probably the most well known is MediaWiki, which is the technology that pioneered the creation of wikis and used by Wikipedia.  I was able to install and configure MediaWiki myself, with no problems.  Here is a comparison of wiki software provided by Wikipedia.  The CIO article describes the product Socialtext, a commercial product targeted for the Enterprise.

Much like content management systems,  discussion forums and blogs, I see wikis as an excellent research IT component to support the "documentation" of projects and processes of a typical research organization. By reducing the barriers to timely publication of project-related information, wikis offer a robust platform for capturing research knowledge that can be further mined using intranet search engines. 

September 30

IBM Lotus Symphony

Slowly we have been migrating from Microsoft Office to the open source office suite OpenOffice.  I've yet to find a reason why an entire organization could not productively drop Microsoft Office and switch to Open Office for their document/presentation/spreadsheet/desktop database requirements.  A nice integrated feature is the ability to generate a PDF document for distribution.  There's even a useful Draw program that is missing from the Microsoft suite.  And, of course, OpenOffice runs on all the major operating systems, including Windows, Mac OS X and Linux.

Recently, I upgraded to the most recent IBM Lotus Symphony, which is based on OpenOffice.  I think they've done a good job of integrating the various tools into a single user experience, which might be important for some organizations.

For developers, IBM built Lotus Symphony using the Eclipse IDE (and is built as an Eclipse RCP application), and therefore is extendable via Eclipse plugins.  This means that organization can safely customize the Lotus Symphony using standard Eclipse and Java technologies.

With IT budgets being squeezed, migrating to OpenOffice strikes me as a no-brainer.

September 29

OpenWetWare

In the spirit of open-source science, here's an effort called OpenWetWare that promotes sharing of protocol knowledge that uses Wiki's as a form of publication and communication - think of it as Wikipedia for protocols.  Since a lot of these details don't make it into publications, I can see this type of effort really establishing an audience.  I would like to see better use of images and perhaps video, which would really distinguish this from static publications.

BioHackers

This article titled "As Synthetic Biology Becomes Affordable, Amateur Labs Thrive", in MIT's The Tech describes the new wave of amateur scientists doing synthetic biology experiments as a hobby.  The author uses the term biohacker, which I believe is intended to be positive.   I don't know if I have enough power outlets in my garage, but I'm inspired to try something.  Though, I'm not sure about waste disposal, however...

July 20

PUG

The NCBI has provided a new ways of accessing PubChem services.  It's called PUG, for the PubChem Power User Gateway.  PUG provides a programmatic interface to PubChem services. In addition to the basic download of structures in PubChem, analytical services such as search by molecular formula, 2D similarity search, substructure search are provided.  PubChem entries are already annotated for physical-chemical properties.

The original PUG application interface required constructing XML-formatted requests and parsing XML-formatted responses.  Although straightforward, I found the code necessary to encapsulate these capabilities to be error prone and complicated to support.  Luckily, the NCBI team recently added a SOAP web services interface to PUG.  I have been integrating these services in my applications, and I'm finding these tools to be very useful for basic cheminformatics analyses on compounds and substances in PubChem. PubChem is fast becoming the premiere provider of compound information with an ever expanding list of integrated chemical libraries.

July 16

Uncertainty Conference

I came across this posting for a workshop as part of the larger upcoming Semantic Web conference in Karlsruhe, Germany. The workshop is the 4th Workshop on Uncertainty Reasoning for the Semantic Web, focused on expressing uncertainty in reasoning and information. There is a W3C incubator for this area of research, Uncertainty Reasoning for the World Wide Web. I'm interested in seeing what comes out of the W3C incubator, since expressing uncertainty in life sciences ontologies could go a long way to improve reasoning results for data integration.

The OWLED 2008 conference is also co-located with this conference.

The OWL: Experiences and Direction (OWLED) workshop series is a forum for practitioners in industry and academia, tool developers, and others interested in OWL to describe real and potential applications, to share experience, and to discuss requirements for language extensions and/or modifications.

There are other interesting workshops and tutorials as part of the conference.  It might make a worthwhile trip to southern Germany.

July 07

Textpresso for text-mining

At a meeting recently, I was introduced to the Textpresso platform (developed at Caltech)  for text mining of scientific literature.  The systems integrates ontologies to classify documents.  It has been developed as part of Wormbase for curation of the c. elegans literature.  This is exactly what I've been looking for to help classify documents in my content management system and provide a way of identifying related documents to search results. The challenge will probably be in integrating a variety of ontologies. I'll post my comments on the integration.

The publication for Textpresso is:

Textpresso: an ontology-based information retrieval and extraction system for biological literature
Muller HM, Kenny EE, Sternberg PW, PLoS Biol. 2004 Nov;2(11):e309. Epub 2004 Sep 21.

June 30

Biology Communication Workshop

Last week I attended a workshop at UCSD titled "New Communication Channels for Biology Workshop" sponsored by the calit2 group (California Institute for Telecommunications and Information Technology).

First, calit2 is an interesting organization created to fuel innovation in California leveraging the University of California research environment.  Here's a link to their brochure. They have an impressive facility at UCSD, where researchers can experiment with digital communication technologies to facilitate collaboration.

The workshop focused on exploring innovative technologies for collaboration within the biology community.  The agenda outlines the topics presented and discussed.

A few of the presentations highlighted the use of Wiki's to establish communities of researchers around field of collaboration.  This definitely renewed my interest in exploring how to use Wiki's in the biotechnology research environment.  Luckily, there are open source initiatives to deal with the security and authentication systems required for an Intranet.

A few of the presentations focused on integrating the publication and sharing of video as a collaborative tool.  One group in particular, SciVee, looks promising. The group have developed a mechanism for authors to provide video commentary on their peer-reviewed publications. The education potential looks promising, especially when the video is tied to an on-line (and open) version of the publication.

Probably the most profound implication of a lot of this is that there is a strong push for "open" science, where limits on the access to publications is removed. Even the idea of peer-review is being re-evaluated in light of the existing communication technologies and the ability to publish instantaneously via the Web.

April 14

OWL 2

Looks like OWL version 2 (previously called 1.1) is on it way to becoming the next iteration on the W3C OWL standard.   The latest working draft of the specification has recently been released. Here's a link to the primer: OWL 2 Web Ontology Language: Primer.  A brief description of the new version:

"OWL 2 is a backwards compatible revision to the Web Ontology Language (OWL). OWL 2 adds several new constructs to extend the expressivity of OWL including those for qualified cardinality restrictions, role chains, and expressive data predicates. OWL 2 also includes a new XML Serialization (targeted to the XML tool chain, i.e., XSLT, schema languages, etc.) and a set of subsetting profiles with various desirable application and computational properties."

Practically, the new OWL syntax provides the potential for an improvement to the bridge between OBO and OWL.  I'm waiting for the Jena folks to formally state their intension to support the new standard.  In the meantime, we can use Protege 4 as a design tool for implementing ontologies with these new language features.

March 17

Yahoo and the Semantic Web

Yahoo has recently started to announce their efforts to integrate semantic web-based meta data into their search engine.  I think this has the potential differentiating Yahoo's search platform with other competing search engines.  However, given recent news reports of a Microsoft/Yahoo merger, perhaps it's future is clouded.

Nonetheless, at least the Yahoo engineers have convinced themselves that semantic annotation of web content is scalable (using RDFa). In the end, this approach has the potential of overcoming some of the limitations in determining search relevance based on term frequencies.

March 15

Semantic Web Tutorial

Here's a link to a Semantic Web tutorial by Lee Feigenbaum and Eric Prud'hommeux, which was presented  on March 5, 2008 at the Conference on Semantics in Healthcare & Life Sciences.  This is a really nice introduction to the application of semantic web technologies in the life sciences.  It would be hard to imagine doing the types of scientific knowledge integration they illustrate without RDF (and OWL).

March 06

Mapping Relational Data and RDF

A new incubator group, the RDB2RDF Group, has been formed as part of W3C efforts. Sponsored by Oracle and HP among others, the charter for the group is to:

  • examine and classify existing approaches to mapping relational data into RDF and decide whether standardization is possible and/or necessary
  • examine and classify existing approaches to mapping OWL classes to Relational data, or, more accurately, SQL queries

The relationship between RDF/OWL and RDBMS is an important one, as most experimental data is managed in an RDBMS, but benefits from a transformation to RDF/OWL for integration and reasoning.  It is useful to have RDF/OWL persist in an RDBMS, while processed through an API such as Jena.  The Jena team have implemented a couple of approaches to this, with emphasis on support for the SPARQL query language.

It's not hard to see efforts like these mirror the object-relational mapping work that has resulted in systems like Hibernate that seamlessly map Java objects to relational schemas.

February 08

Text Mining

Here's a recent publication in PLoS Computational Biology by K. Bretonnel Cohen and Lawrence Hunter titled "Getting Started In Text Mining".  The authors discuss the strategies for developing text mining applications in the biomedical field and why they believe bioscientists appear to be better at developing (useful) systems than text mining specialists.  Text mining in a specialized domain will always require domain experts to help evaluate algorithmic approaches, so it's no surprise that bioscientists can develop useful tools for the general scientist (that old precision/recall evaluation).

It would have been nice for the authors to have explored the role of scientific ontologies in emerging text mining algorithms, as I believe this is the next wave of development.  The authors include a nice list of references at the end.

February 05

Comparison of Programming Languages

A recent paper in BMC Bioinformatics, "A Comparison of Common Programming Languages used in Bioinformatics" by Mathieu Fourment and Michael R. Gillings compares the performance of C, C++, C#, Java, Perl and Python on a small number of common bioinformatics algorithms. Not surprisingly,  Perl and Python were significantly slower than the other languages.  Java compared well with the C programming language family, and with a little optimization (e.g., compiling) I am sure it would perform even better in their tests.

It would be fun to see how an object-oriented scripting language like Groovy would perform.  Since Groovy sits on top of Java, perhaps this  might be the best of both worlds, with the ease of programming of a scripting language and the performance of a compiled language.  I've been using Groovy on a couple of small projects and it's a compelling choice if you have a significant investment in Java components.

January 28

SPARQL elevated to W3C Recommendation

The SPARQL query language has recently been elevated to the status of a formal recommendation by the World Wide Web Consortium (W3C). SPARQL is the query language for the RDF data format which is one of  the foundations for the Semantic Web.  The simplest analogy is to relate SPARQL to SQL, since the query syntaxes look similar.  However, SPARQL is a profound innovation in allowing for queries across distributed data independent of format.

I am envisioning a day soon when research data managed within a drug discovery organization will be discovered, integrated and synthesized using SPARQL as the primary data access method to data represented in RDF.  The source data will continue to be acquired and managed in traditional RDBMS' like ORACLE, but the power of SPARQL will come in the integration of data which is distributed throughout the organization and is structured, semi-structured and unstructured.  SPARQL will be a key component to realizing re-usable drug discovery knowledge.

January 07

WordNet in OWL

I've been integrating the RDF/OWL representation of WordNet in my search application and I'm quite appreciate of the effort that the W3C task group made to develop the translation of Princeton University's WordNet lexical resource.  The defined synonym sets provide a valuable lexical source for improving search queries. Having tried in the past to come up to speed on the the prolog-based system and floundered, now that the data is represented in the OWL language I can use tools such as Jena and SPARQL to develop applications for information retrieval. 

Hopefully the group will find the resources to keep the RDF/OWL representation in sync with the Princeton release.  WordNet is now at version 3.0, while the RDF/OWL represents the 2.0 version.

Science Commons and Knowledge Re-use

In a post on the Science Commons Weblog, titled "Why we need to figure out what we already know" by Donna Wentworth of the Science Commons group, she reiterates the goal of the Science Commons to "enable scientists to use the Web to get precise answers to complex research questions — not 388,000+ search results that contain the word pyridinium."  I believe John Wilbanks, who leads the group, has the right vision for what the community needs to support a culture of knowledge re-use.

Although I am sure there are good reasons to be skeptical of exactly how the Semantic Web can support these kinds of efforts, in my own work I am convinced that the application of the Semantic Web technologies goes a long way to formalize the annotation of life sciences data with the appropriate metadata and enable improved search, data integration and knowledge synthesis.  The tools are robust, the data languages such as RDF and OWL are stable, and the work to develop scalable RDF stores is showing good progress.

 

Quote of the Day

Loading...