| Carlos's profileCarlos S. Zamudio of Sem...BlogLists | Help |
Carlos S. Zamudio of Semantic Laboratorieswww.semanticlaboratories.com February 12 Semantic Mashup for Nicotine Dependence ResearchHere’s an excellent paper by S Sahoo, O Bodenreider, JL Butter, KJ Skinner, and AP Sheth titled “An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence”. From the abstract:
The papers describes the methodology they used to create an RDF store for data related to genes and pathways associated with nicotine dependence. I was particularly interested in how they handled data extracted from the NCBI Entrez database. I’ve struggled with this myself in generated RDF data from PubChem. They developed a simple XML->RDF translator and then used OWL to provide the semantic framework for the results. This is a good paper illustrating the advantages of using semantic technologies for data integration. W3C Final Report on Relational Data Schemas to RDFThe W3C have published the final report from the RDB2RDF incubator group, with their recommendation that the W3C proceed to initiate a formal working group to standardize a language for mapping relational database schemas into RDF and OWL.
They go on to describe a use case for integration of enterprise information systems:
It’s easy to see where this is headed. One scenario, for drug discovery data integration purposes, would facilitate publishing data from relational databases to an RDF store, and the RDBMS schema semantics would be maintained. Additional data semantics could be integrated within the RDF store. February 10 Interview with Martin Leach of Merck Research LabsHere’s an interview from BioInform with Martin Leach, Executive Director of Information Technology at Merck Research Labs. He discusses the issues and challenges of supporting Merck’s research data output. Even though I’ve visited almost all of the large pharmaceutical companies over the years, it’s still hard to imagine the complexity of trying to manage the knowledge output across research. From my experience, Merck seems to do as good as job at this than any other organization I’ve been exposed to. The practical nuts and bolts issues of how to deal with petabytes of data, especially data coming from high resolution instruments such as Illumina’s is a bit sobering. Even though Martin Leach doesn’t seem impressed with the contributions the Semantic Web can make, I think that if he thought of RDF, RDFS and OWL as the platform for data integration across instruments I think he might be able to see an opportunity that goes being exporting XML files and then importing them into ORACLE. Just using RDF to capture instrument metadata could be a significant step in integrating experiments across laboratories. Anyway, my hats off to these guys. I’m sure there are many at Merck that don’t really have a good appreciation of their efforts. February 03 Java Content RepositoriesI’ve been exploring ideas (using Apache’s Jackrabbit) for a specialized Content Repository as the basis for a collaboration tool to be used by researchers involved in drug discovery research. Most research project teams are organized as a matrix of specialized laboratory and computational skill sets that combine to collaborate on data acquisition, analysis, integration and publication. Much of the knowledge produced is stored in a variety of structured, semi-structured and unstructured formats. Capturing the knowledge generated during the research workflow and supporting the variety of data formats is challenging. However, I’m starting to see the value of applying a Content Repository data model for capturing research workflow data as an alternative to a traditional relational database. Here’s an essay by Bertil Chapuis comparing the rationale for choosing content repositories versus relational databases. This is an excellent introduction to the design of the Java Content Repository specification. The Java community have worked to define a Java API specification called the JSR 170: Content Repository for Java technology API. The Apache community have released a reference implementation of JSR 170 called Jackrabbit. February 02 LabAutomation 2009 PresentationsI recently participated in a couple of sessions at the LabAutomation 2009 conference in Palm Springs, CA. LabAutomation is an annual meeting of the laboratory automation industry focused on the Life Sciences. I participated in two sessions: “Data Management, Mining & Visualization” chaired by Petar Stojadinovic and “Ontologies and Semantic Technologies in Drug Discovery” chaired by Dr. Reinhold Shafer. You can find my presentations at semanticlaboratories.com/labautomation2009. This was the first LabAutomation meeting where semantic technologies have been discussed as a potential technology for Life Sciences data integration. So, the main goal at this meeting was to introduce the audience on the concepts of the technology. During the first session there was am excellent talk by Randall Julian, President of Indigo Biosystems, who described some of his work using semantic technologies for data integration in the Life Sciences. I’m hoping that he will share his presentation at some point in the near future. During the second session there was an excellent talk by Alan Ruttenberg of the Science Commons. Science Commons is a project within the Creative Commons framework. Check out their excellent work at the Neurocommons, where they are conceiving a platform for knowledge management for biological research. October 21 D2RQ PlatformI ran across another reference to the D2RQ platform recently and decided to explore this RDF mapping tool. D2RQ describes itself as a platform "for accessing non-RDF, relational databases as virtual, read-only RDF graphs. D2RQ offers a variety of different RDF-based access mechanisms to the content of huge, non-RDF databases without having to replicate the database into RDF. Using D2RQ you can:
I tried it on some simple data sets I downloaded from PubChem Bioassay using the Jena API. Originally I planned to build an RDF store from the bioassay data as a way to reason across data sets, but now I'm thinking I can accomplish some of my goals with DR2Q. Descriptions of other similar tools can be found on the ESW Wiki. D2RQ strikes me as an interesting platform for performing data integration across the various chemical screening data repositories generated in support of a drug discovery program. October 10 OWL 2 Web Ontology Language DraftsThe W3C has just released draft specifications for the OWL 2 Web Ontology Language. Interestingly, the sub-languages of OWL 2 are being called "profiles". As described in the Profiles document, there are 3 OWL 2 profiles currently specified:
October 09 Semantic Web Industry ReviewDavid Provost has published an industry survey titled "On The Cusp: A Global Review of the Semantic Web Industry" where he reviews the current international industry players in the Semantic Web arena. From his conclusions:
Although I'm focused on open source technologies for my Semantic Web development, companies like TopQuadrant are definitely on my radar for future integration. October 08 Corporate (and Research) Intranet Wiki'sHere's a recent article in CIO by C.G. Lynch making the argument for integrating wikis into a Corporate intranet. For those who are using or experimenting with wikis, the benefits of wikis are a no-brainer. The ease with which wikis can be used to support sharing of collaborative information is well documented. I probably use dozens of public wikis on a regular basis for collecting information from the research projects I follow. It truly is a "bottom up" technology as the CIO article suggests, where the wiki authors are in charge of the platform. Many public wikis allow any user to add content to the wiki. The research organization, however, can present some additional challenges, such as restricted access and data security. In these cases, user authentication should be required for authoring and accessing a wiki. Setting up single sign-on for intranet web applications is still a challenge, at lease in my hands, and for this routine monitoring and configuration by the IT team is necessary. There are many excellent open-source wikis that have been developed. Probably the most well known is MediaWiki, which is the technology that pioneered the creation of wikis and used by Wikipedia. I was able to install and configure MediaWiki myself, with no problems. Here is a comparison of wiki software provided by Wikipedia. The CIO article describes the product Socialtext, a commercial product targeted for the Enterprise. Much like content management systems, discussion forums and blogs, I see wikis as an excellent research IT component to support the "documentation" of projects and processes of a typical research organization. By reducing the barriers to timely publication of project-related information, wikis offer a robust platform for capturing research knowledge that can be further mined using intranet search engines. September 30 IBM Lotus SymphonySlowly we have been migrating from Microsoft Office to the open source office suite OpenOffice. I've yet to find a reason why an entire organization could not productively drop Microsoft Office and switch to Open Office for their document/presentation/spreadsheet/desktop database requirements. A nice integrated feature is the ability to generate a PDF document for distribution. There's even a useful Draw program that is missing from the Microsoft suite. And, of course, OpenOffice runs on all the major operating systems, including Windows, Mac OS X and Linux. Recently, I upgraded to the most recent IBM Lotus Symphony, which is based on OpenOffice. I think they've done a good job of integrating the various tools into a single user experience, which might be important for some organizations. For developers, IBM built Lotus Symphony using the Eclipse IDE (and is built as an Eclipse RCP application), and therefore is extendable via Eclipse plugins. This means that organization can safely customize the Lotus Symphony using standard Eclipse and Java technologies. With IT budgets being squeezed, migrating to OpenOffice strikes me as a no-brainer. September 29 OpenWetWareIn the spirit of open-source science, here's an effort called OpenWetWare that promotes sharing of protocol knowledge that uses Wiki's as a form of publication and communication - think of it as Wikipedia for protocols. Since a lot of these details don't make it into publications, I can see this type of effort really establishing an audience. I would like to see better use of images and perhaps video, which would really distinguish this from static publications. BioHackersThis article titled "As Synthetic Biology Becomes Affordable, Amateur Labs Thrive", in MIT's The Tech describes the new wave of amateur scientists doing synthetic biology experiments as a hobby. The author uses the term biohacker, which I believe is intended to be positive. I don't know if I have enough power outlets in my garage, but I'm inspired to try something. Though, I'm not sure about waste disposal, however... July 20 PUGThe NCBI has provided a new ways of accessing PubChem services. It's called PUG, for the PubChem Power User Gateway. PUG provides a programmatic interface to PubChem services. In addition to the basic download of structures in PubChem, analytical services such as search by molecular formula, 2D similarity search, substructure search are provided. PubChem entries are already annotated for physical-chemical properties. The original PUG application interface required constructing XML-formatted requests and parsing XML-formatted responses. Although straightforward, I found the code necessary to encapsulate these capabilities to be error prone and complicated to support. Luckily, the NCBI team recently added a SOAP web services interface to PUG. I have been integrating these services in my applications, and I'm finding these tools to be very useful for basic cheminformatics analyses on compounds and substances in PubChem. PubChem is fast becoming the premiere provider of compound information with an ever expanding list of integrated chemical libraries. July 16 Uncertainty ConferenceI came across this posting for a workshop as part of the larger upcoming Semantic Web conference in Karlsruhe, Germany. The workshop is the 4th Workshop on Uncertainty Reasoning for the Semantic Web, focused on expressing uncertainty in reasoning and information. There is a W3C incubator for this area of research, Uncertainty Reasoning for the World Wide Web. I'm interested in seeing what comes out of the W3C incubator, since expressing uncertainty in life sciences ontologies could go a long way to improve reasoning results for data integration. The OWLED 2008 conference is also co-located with this conference.
There are other interesting workshops and tutorials as part of the conference. It might make a worthwhile trip to southern Germany. July 07 Textpresso for text-miningAt a meeting recently, I was introduced to the Textpresso platform (developed at Caltech) for text mining of scientific literature. The systems integrates ontologies to classify documents. It has been developed as part of Wormbase for curation of the c. elegans literature. This is exactly what I've been looking for to help classify documents in my content management system and provide a way of identifying related documents to search results. The challenge will probably be in integrating a variety of ontologies. I'll post my comments on the integration. The publication for Textpresso is: Textpresso: an ontology-based information retrieval and extraction system for biological literature June 30 Biology Communication WorkshopLast week I attended a workshop at UCSD titled "New Communication Channels for Biology Workshop" sponsored by the calit2 group (California Institute for Telecommunications and Information Technology). First, calit2 is an interesting organization created to fuel innovation in California leveraging the University of California research environment. Here's a link to their brochure. They have an impressive facility at UCSD, where researchers can experiment with digital communication technologies to facilitate collaboration. The workshop focused on exploring innovative technologies for collaboration within the biology community. The agenda outlines the topics presented and discussed. A few of the presentations highlighted the use of Wiki's to establish communities of researchers around field of collaboration. This definitely renewed my interest in exploring how to use Wiki's in the biotechnology research environment. Luckily, there are open source initiatives to deal with the security and authentication systems required for an Intranet. A few of the presentations focused on integrating the publication and sharing of video as a collaborative tool. One group in particular, SciVee, looks promising. The group have developed a mechanism for authors to provide video commentary on their peer-reviewed publications. The education potential looks promising, especially when the video is tied to an on-line (and open) version of the publication. Probably the most profound implication of a lot of this is that there is a strong push for "open" science, where limits on the access to publications is removed. Even the idea of peer-review is being re-evaluated in light of the existing communication technologies and the ability to publish instantaneously via the Web. April 14 OWL 2Looks like OWL version 2 (previously called 1.1) is on it way to becoming the next iteration on the W3C OWL standard. The latest working draft of the specification has recently been released. Here's a link to the primer: OWL 2 Web Ontology Language: Primer. A brief description of the new version:
Practically, the new OWL syntax provides the potential for an improvement to the bridge between OBO and OWL. I'm waiting for the Jena folks to formally state their intension to support the new standard. In the meantime, we can use Protege 4 as a design tool for implementing ontologies with these new language features. March 17 Yahoo and the Semantic WebYahoo has recently started to announce their efforts to integrate semantic web-based meta data into their search engine. I think this has the potential differentiating Yahoo's search platform with other competing search engines. However, given recent news reports of a Microsoft/Yahoo merger, perhaps it's future is clouded. Nonetheless, at least the Yahoo engineers have convinced themselves that semantic annotation of web content is scalable (using RDFa). In the end, this approach has the potential of overcoming some of the limitations in determining search relevance based on term frequencies. March 15 Semantic Web TutorialHere's a link to a Semantic Web tutorial by Lee Feigenbaum and Eric Prud'hommeux, which was presented on March 5, 2008 at the Conference on Semantics in Healthcare & Life Sciences. This is a really nice introduction to the application of semantic web technologies in the life sciences. It would be hard to imagine doing the types of scientific knowledge integration they illustrate without RDF (and OWL). March 06 Mapping Relational Data and RDFA new incubator group, the RDB2RDF Group, has been formed as part of W3C efforts. Sponsored by Oracle and HP among others, the charter for the group is to:
The relationship between RDF/OWL and RDBMS is an important one, as most experimental data is managed in an RDBMS, but benefits from a transformation to RDF/OWL for integration and reasoning. It is useful to have RDF/OWL persist in an RDBMS, while processed through an API such as Jena. The Jena team have implemented a couple of approaches to this, with emphasis on support for the SPARQL query language. It's not hard to see efforts like these mirror the object-relational mapping work that has resulted in systems like Hibernate that seamlessly map Java objects to relational schemas. February 08 Text MiningHere's a recent publication in PLoS Computational Biology by K. Bretonnel Cohen and Lawrence Hunter titled "Getting Started In Text Mining". The authors discuss the strategies for developing text mining applications in the biomedical field and why they believe bioscientists appear to be better at developing (useful) systems than text mining specialists. Text mining in a specialized domain will always require domain experts to help evaluate algorithmic approaches, so it's no surprise that bioscientists can develop useful tools for the general scientist (that old precision/recall evaluation). It would have been nice for the authors to have explored the role of scientific ontologies in emerging text mining algorithms, as I believe this is the next wave of development. The authors include a nice list of references at the end. February 05 Comparison of Programming LanguagesA recent paper in BMC Bioinformatics, "A Comparison of Common Programming Languages used in Bioinformatics" by Mathieu Fourment and Michael R. Gillings compares the performance of C, C++, C#, Java, Perl and Python on a small number of common bioinformatics algorithms. Not surprisingly, Perl and Python were significantly slower than the other languages. Java compared well with the C programming language family, and with a little optimization (e.g., compiling) I am sure it would perform even better in their tests. It would be fun to see how an object-oriented scripting language like Groovy would perform. Since Groovy sits on top of Java, perhaps this might be the best of both worlds, with the ease of programming of a scripting language and the performance of a compiled language. I've been using Groovy on a couple of small projects and it's a compelling choice if you have a significant investment in Java components. January 28 SPARQL elevated to W3C RecommendationThe SPARQL query language has recently been elevated to the status of a formal recommendation by the World Wide Web Consortium (W3C). SPARQL is the query language for the RDF data format which is one of the foundations for the Semantic Web. The simplest analogy is to relate SPARQL to SQL, since the query syntaxes look similar. However, SPARQL is a profound innovation in allowing for queries across distributed data independent of format. I am envisioning a day soon when research data managed within a drug discovery organization will be discovered, integrated and synthesized using SPARQL as the primary data access method to data represented in RDF. The source data will continue to be acquired and managed in traditional RDBMS' like ORACLE, but the power of SPARQL will come in the integration of data which is distributed throughout the organization and is structured, semi-structured and unstructured. SPARQL will be a key component to realizing re-usable drug discovery knowledge. January 07 WordNet in OWLI've been integrating the RDF/OWL representation of WordNet in my search application and I'm quite appreciate of the effort that the W3C task group made to develop the translation of Princeton University's WordNet lexical resource. The defined synonym sets provide a valuable lexical source for improving search queries. Having tried in the past to come up to speed on the the prolog-based system and floundered, now that the data is represented in the OWL language I can use tools such as Jena and SPARQL to develop applications for information retrieval. Hopefully the group will find the resources to keep the RDF/OWL representation in sync with the Princeton release. WordNet is now at version 3.0, while the RDF/OWL represents the 2.0 version. Science Commons and Knowledge Re-useIn a post on the Science Commons Weblog, titled "Why we need to figure out what we already know" by Donna Wentworth of the Science Commons group, she reiterates the goal of the Science Commons to "enable scientists to use the Web to get precise answers to complex research questions — not 388,000+ search results that contain the word pyridinium." I believe John Wilbanks, who leads the group, has the right vision for what the community needs to support a culture of knowledge re-use. Although I am sure there are good reasons to be skeptical of exactly how the Semantic Web can support these kinds of efforts, in my own work I am convinced that the application of the Semantic Web technologies goes a long way to formalize the annotation of life sciences data with the appropriate metadata and enable improved search, data integration and knowledge synthesis. The tools are robust, the data languages such as RDF and OWL are stable, and the work to develop scalable RDF stores is showing good progress. |
|
|||
|
|