How many languages do you speak?
An essential problem facing all areas of computing is that of managing multiple ways of representing data. Recently, I’ve started wondering if there are too many languages for representing knowledge. Let me give you an idea of what I mean.
We are developing a prototype portal for finding and downloading datasets generated by climate models. The name of the system is CDP-Curator because it is an extension to an existing system called the Community Data Portal (CDP).
Just for kicks, I’m going to briefly outline all of the data representations I can think of that we have to deal with in hosting the climate model datasets. I will also list our motivations for using each one.
- NetCDF – This is the network Common Data Format developed at Unidata. It serves as a common data format for array-oriented scientific data. Although there are other similar representations, almost all of the datasets we are working with are already in NetCDF. In a sense, NetCDF is really outside of the CDP-Curator system boundary. We are pretty much forced to use this format because that’s what the climate modeling community is using and that’s the format of existing datasets. I should also point out that NetCDF files have a “header” containing metadata about the fields contained in the file.
- XML – This is the eXtensible Markup Language. It is an extremely popular, tag-based syntax for data exchange. It is particularly popular as a format for exchanging data among web-based systems. Thus far, XML will serve as the syntax used for metadata crossing the system boundary. This simply means that when someone wants to submit a new dataset (or climate model description) we expect the metadata to be delivered in XML. Our motivations for using XML include its wide acceptance throughout the climate community, the fact that it is human and machine readable/writeable, and the maturity of tools and APIs for manipulating XML.
- W3C XML Schema – The schema language constrains the XML by defining what elements and attributes we expect to appear in a given XML document. Clearly, an XML schema language of some sort is required in order to let data contributors know the expected format of the metadata. Our specific choice of W3C XML Schema is based on the fact that it has wide tool support and the fact that other community members are already comfortable with it. Another option would be the Relax NG schema language.
- RDF/OWL – Although technically distinct, I am treating RDF/OWL as one language. OWL (Web Ontology Langauge) is an ontology language built on top of RDF (Resource Description Framework). These two languages are (or will be, in theory) at the heart of the Semantic Web. The RDF layer describes “resources” using subject-predicate-object triples. OWL sits on top of RDF and is a full-blown ontology language with a theoretical basis in Description Logics. The metadata we receive in XML will be translated into RDF/OWL and stored in a Sesame triple store. Our motivations for using RDF/OWL: it is a “web-friendly” (XML syntax, URIs as identifiers) language, it is good for representing lots of dense relationships (arbitrary graphs), it is conceptual in nature, good support for class hierarchies, and it seems to work well with our faceted search interface.
- RDBMS – We also plan on integrating with an existing relational database (RDBMS) for long term storage of the metadata (but not the climate data itself). RDBMSs are very mature, reliable, and have been around for a while. They are highly scalable, very fast for most querying needs, connect well with Java and web-based programming languages, and have sophisticated backup and replication capabilities. This is a natural choice for ensuring that the metadata will not be lost.
- UML – We are using UML (Unified Modeling Language) class diagrams to model the RDF/OWL ontology. Currently our process is a bit backwards because we make the change first in the RDF/OWL and then we go back and update our conceptual model in UML.
What I have been considering lately is the following quesion: What is the cost of having all of these languages in place in one system? Maybe a better question is: What metrics do we use to measure the cost of dealing with data in multiple languages?
Probably the biggest cost involved is language translation. For example, in CDP-Curator, our current thinking is to ingest XML, load it into a RDBMS, populate the triple store periodically (e.g., nightly) from the RDBMS, and have the interface query the triple store. This involves the following translations:
- XML to relational. This involves parsing the XML and writing SQL statements to insert the data into the RDBMS. Some RDBMSs may take the XML directly and do the conversion internally. A possible tradeoff here is a lack of control over the translation process.
- Relational to RDF/OWL. Certainly many folks have already done this, although it is probably not understood as well as XML/relational translations. The translation could be done programmatically by requesting data from the RDBMS using SQL and then writing out the corresponding RDF. However, it may be difficult to do this serially because of the graph nature (triples) of RDF. A more suitable option might be to use an RDF/OWL library such as Jena. Jena will create an in-memory object model of the RDF/OWL and it can then be written out serially.
I guess the point that I am getting at is that our choice of languages for data/knowledge representation is definitely non-trivial, but at the same time it is hard to quantify which languages are suitable for which purposes. It is also hard to measure the impact of using one language over another, or one combination of languages verses a different combination. In a future post, I’ll attempt to talk about what kinds of questions we should ask when choosing a data/knowledge representation language and what kinds of metrics we could imagine.