Rocky Dunlap’s Weblog

Entries tagged as ‘ontology’

Categorizing the Web

December 5, 2008 · 1 Comment

A perfect intelligence would not confine itself to one order of thought, but would simultaneously regard a group of objects as classified in all the ways of which they are capable.  

Stanley W. Jevons, 1874.

We have known for some time that we humans tend to mentally categorize objects in order to facilitate our understanding of the world and to enable reasoning capabilities.  Linguists and cognitive scientists suggest that categories are useful because they allow us to infer unobserved properties from observed properties.  While it is certainly impossible for us to know everything about an object, we can observe only a few things (e.g., furry with four legs and barks) but still determine a lot about the object (e.g., this is a dog and most dogs are nice so I should not run away).

Meanwhile, the Web can be seen as another “world” of observable objects.  The massive scale and complexity of the Web makes it essentially impossible for humans (and artificial cognitive agents) to come to grips with all but a very small slice of the content available.  It is truly information overload.  As one way to cope, Web scientists envision a “Semantic Web” in which content is given explicit meaning—a level of semantics greater than that provided by HTML, and through which intelligent agents can reason about the content found there.  

The task of creating a more intelligent Web is one of categorization.  Just as humans tend to work in categories as the basis for higher level reasoning, cognitive agents rely on categories to reason about objects on the Web.  To be sure, much work has already been done along these lines to project us toward the Semantic Web vision.  The notion of ontology as a formal description of concepts has caught on like wildfire.  The first set of standardized ontology languages has already emerged (e.g., RDF, OWL) and researchers continue to work on improving the expressivity and reasoning capabilities of these ontology languages.  While the need for formalized ontologies has gained enormous acceptance, there are still critics as to whether such an approach is even feasible given the scale of the Internet.  Who is going to classify the Web?  There are far too many objects and far too many categories for any individual or large corporation or government to make sense of it all.

Meanwhile, another phenomenon that has recently gained popularity is tagging.  Tagging—at least as far as the Web is concerned—is simply the unstructured, unconstrained labeling of objects (e.g., such as web pages, books, people, etc.) by people on the Internet.  While tagging is certainly useful personally (in the same way that labeling jars of spices in your kitchen is useful), much of the power of tagging is based on the fact that a single item may be tagged by hundreds or thousands of people.  Presumably, many of those tags will agree, but there will of course be some outliers.  But, despite the informal, unconstrained nature of tagging, it is still a form of categorization designed to help us come to grips with the Web.

Because each kind of categorization supports fundamentally different kinds of purposes and reasoning capabilities, it is unlikely that one scheme will emerge as the “end-all” mechanism for categorizing objects on the Web.  Instead, the Semantic Web will emerge as a synthesis of categorization schemes, where each scheme provides only the specific reasoning services that it is good at, and leaves the rest to other schemes.  The unanswered question, though, is how the different schemes will interact to form a fully integrated system.  In this article I present some possible interactions among categorization schemes found on the Web.

The full article is available here:  Categorizing the Web.

Categories: Research
Tagged: , , , , ,

How many languages do you speak?

May 7, 2008 · Leave a Comment

An essential problem facing all areas of computing is that of managing multiple ways of representing data. Recently, I’ve started wondering if there are too many languages for representing knowledge. Let me give you an idea of what I mean.

We are developing a prototype portal for finding and downloading datasets generated by climate models. The name of the system is CDP-Curator because it is an extension to an existing system called the Community Data Portal (CDP).

Just for kicks, I’m going to briefly outline all of the data representations I can think of that we have to deal with in hosting the climate model datasets. I will also list our motivations for using each one.

  • NetCDF - This is the network Common Data Format developed at Unidata. It serves as a common data format for array-oriented scientific data. Although there are other similar representations, almost all of the datasets we are working with are already in NetCDF. In a sense, NetCDF is really outside of the CDP-Curator system boundary. We are pretty much forced to use this format because that’s what the climate modeling community is using and that’s the format of existing datasets. I should also point out that NetCDF files have a “header” containing metadata about the fields contained in the file.
  • XML - This is the eXtensible Markup Language. It is an extremely popular, tag-based syntax for data exchange. It is particularly popular as a format for exchanging data among web-based systems. Thus far, XML will serve as the syntax used for metadata crossing the system boundary. This simply means that when someone wants to submit a new dataset (or climate model description) we expect the metadata to be delivered in XML. Our motivations for using XML include its wide acceptance throughout the climate community, the fact that it is human and machine readable/writeable, and the maturity of tools and APIs for manipulating XML.
  • W3C XML Schema – The schema language constrains the XML by defining what elements and attributes we expect to appear in a given XML document. Clearly, an XML schema language of some sort is required in order to let data contributors know the expected format of the metadata. Our specific choice of W3C XML Schema is based on the fact that it has wide tool support and the fact that other community members are already comfortable with it. Another option would be the Relax NG schema language.
  • RDF/OWL – Although technically distinct, I am treating RDF/OWL as one language. OWL (Web Ontology Langauge) is an ontology language built on top of RDF (Resource Description Framework). These two languages are (or will be, in theory) at the heart of the Semantic Web. The RDF layer describes “resources” using subject-predicate-object triples. OWL sits on top of RDF and is a full-blown ontology language with a theoretical basis in Description Logics. The metadata we receive in XML will be translated into RDF/OWL and stored in a Sesame triple store. Our motivations for using RDF/OWL: it is a “web-friendly” (XML syntax, URIs as identifiers) language, it is good for representing lots of dense relationships (arbitrary graphs), it is conceptual in nature, good support for class hierarchies, and it seems to work well with our faceted search interface.
  • RDBMS – We also plan on integrating with an existing relational database (RDBMS) for long term storage of the metadata (but not the climate data itself). RDBMSs are very mature, reliable, and have been around for a while. They are highly scalable, very fast for most querying needs, connect well with Java and web-based programming languages, and have sophisticated backup and replication capabilities. This is a natural choice for ensuring that the metadata will not be lost.
  • UML – We are using UML (Unified Modeling Language) class diagrams to model the RDF/OWL ontology. Currently our process is a bit backwards because we make the change first in the RDF/OWL and then we go back and update our conceptual model in UML.

What I have been considering lately is the following quesion: What is the cost of having all of these languages in place in one system? Maybe a better question is: What metrics do we use to measure the cost of dealing with data in multiple languages?

Probably the biggest cost involved is language translation. For example, in CDP-Curator, our current thinking is to ingest XML, load it into a RDBMS, populate the triple store periodically (e.g., nightly) from the RDBMS, and have the interface query the triple store. This involves the following translations:

  • XML to relational. This involves parsing the XML and writing SQL statements to insert the data into the RDBMS. Some RDBMSs may take the XML directly and do the conversion internally. A possible tradeoff here is a lack of control over the translation process.
  • Relational to RDF/OWL. Certainly many folks have already done this, although it is probably not understood as well as XML/relational translations. The translation could be done programmatically by requesting data from the RDBMS using SQL and then writing out the corresponding RDF. However, it may be difficult to do this serially because of the graph nature (triples) of RDF. A more suitable option might be to use an RDF/OWL library such as Jena. Jena will create an in-memory object model of the RDF/OWL and it can then be written out serially.
  • RDF/OWL to XHTML/DHTML. This seems to be more of a second-class translation since the XHTML will not be stored–it is just generated dynamically for presentation purposes. Nonetheless, it is a translation that we cannot ignore. Many of the latest GUI widgets are using JSON to move bits of data around because it is Javascript friendly. So, we might go RDF/OWL –> JSON –> XHTML. Another aspect of the latest GUI packages is that more and more code is moving into Javascript. This means that we are writing less HTML and more Javascript calls (i.e., manipulating the DOM manually). There are data-enabled widgets (such as the YUI DataSource utility) that automatically link a GUI element to some datastore. Again, this hides but does not avoid the need for language translation.

I guess the point that I am getting at is that our choice of languages for data/knowledge representation is definitely non-trivial, but at the same time it is hard to quantify which languages are suitable for which purposes. It is also hard to measure the impact of using one language over another, or one combination of languages verses a different combination. In a future post, I’ll attempt to talk about what kinds of questions we should ask when choosing a data/knowledge representation language and what kinds of metrics we could imagine.

Categories: Research
Tagged: , , , , ,