Rocky Dunlap’s Weblog

Entries tagged as ‘tagging’

Categorizing the Web

December 5, 2008 · 1 Comment

A perfect intelligence would not confine itself to one order of thought, but would simultaneously regard a group of objects as classified in all the ways of which they are capable.  

Stanley W. Jevons, 1874.

We have known for some time that we humans tend to mentally categorize objects in order to facilitate our understanding of the world and to enable reasoning capabilities.  Linguists and cognitive scientists suggest that categories are useful because they allow us to infer unobserved properties from observed properties.  While it is certainly impossible for us to know everything about an object, we can observe only a few things (e.g., furry with four legs and barks) but still determine a lot about the object (e.g., this is a dog and most dogs are nice so I should not run away).

Meanwhile, the Web can be seen as another “world” of observable objects.  The massive scale and complexity of the Web makes it essentially impossible for humans (and artificial cognitive agents) to come to grips with all but a very small slice of the content available.  It is truly information overload.  As one way to cope, Web scientists envision a “Semantic Web” in which content is given explicit meaning—a level of semantics greater than that provided by HTML, and through which intelligent agents can reason about the content found there.  

The task of creating a more intelligent Web is one of categorization.  Just as humans tend to work in categories as the basis for higher level reasoning, cognitive agents rely on categories to reason about objects on the Web.  To be sure, much work has already been done along these lines to project us toward the Semantic Web vision.  The notion of ontology as a formal description of concepts has caught on like wildfire.  The first set of standardized ontology languages has already emerged (e.g., RDF, OWL) and researchers continue to work on improving the expressivity and reasoning capabilities of these ontology languages.  While the need for formalized ontologies has gained enormous acceptance, there are still critics as to whether such an approach is even feasible given the scale of the Internet.  Who is going to classify the Web?  There are far too many objects and far too many categories for any individual or large corporation or government to make sense of it all.

Meanwhile, another phenomenon that has recently gained popularity is tagging.  Tagging—at least as far as the Web is concerned—is simply the unstructured, unconstrained labeling of objects (e.g., such as web pages, books, people, etc.) by people on the Internet.  While tagging is certainly useful personally (in the same way that labeling jars of spices in your kitchen is useful), much of the power of tagging is based on the fact that a single item may be tagged by hundreds or thousands of people.  Presumably, many of those tags will agree, but there will of course be some outliers.  But, despite the informal, unconstrained nature of tagging, it is still a form of categorization designed to help us come to grips with the Web.

Because each kind of categorization supports fundamentally different kinds of purposes and reasoning capabilities, it is unlikely that one scheme will emerge as the “end-all” mechanism for categorizing objects on the Web.  Instead, the Semantic Web will emerge as a synthesis of categorization schemes, where each scheme provides only the specific reasoning services that it is good at, and leaves the rest to other schemes.  The unanswered question, though, is how the different schemes will interact to form a fully integrated system.  In this article I present some possible interactions among categorization schemes found on the Web.

The full article is available here:  Categorizing the Web.

Categories: Research
Tagged: , , , , ,

“Standardization” and e-science

April 29, 2008 · Leave a Comment

Much of the work I have done on the Earth System Curator project is geared toward the standardization of a data model for describing climate modeling software and the output from climate simulations. (Okay, technically we are not creating a “standard” because we were not really chartered to do that nor do we wish to be prescriptive for the entire climate community. But, nonetheless, our task has been very much like a standardization effort.) For a moment, I want to step back from Curator and consider “standardization” itself.

Standardization is a task that leads us toward interoperability of systems. Although standardization is common in both industrial and scientific endeavors, it is interesting to consider what differences might arise between the standardization process for e-science vs. that of industry. The question I would like to answer is this: “What does standardization mean for e-science?” I contend that there are significant differences that affect how we should think about standardization in each arena.

This post is based on observations I have made while working on the Curator project. At the outset, our task was basically to create a common metadata formalism for describing climate models and output datasets. (I know this description of the project is far too short to be helpful, so please visit the website to read up on what were doing.) To be perfectly honest, the task of coming up with standardized metadata has proven to be very difficult. Lately I have been wondering whether standardization takes on a different meaning for e-science than for other kinds of communities (e.g., business-driven standardization).

Here are some observations that affect the way we look at standardization for e-science.

1. Users of scientific data are diverse and often anonymous.

This means that it is very difficult up front to say with certainty who exactly will be using scientific data once it is published (e.g., such as simulation output or observations from sensors, etc.) Certainly, there is an immediate set of users in mind before we begin collecting data for a scientific endeavor, but before long we realize that folks working in other domains might also benefit from the collected data.

So, in the name of interoperability, we set out to standardize our data so that when others acquire it, they can actually interpret it. However, this can be very challenging since we do not know exactly who will ultimately be using the data. Additionally, most scientific communities have developed their own “lingo,” and the word for describing a particular phenomena depends on the “lingo” you are using. These “lingos” have deep roots, and we cannot ask that entire communities change vocabularies (even though many will admit the deficiencies in their own vernacular). For a real-life example of “lingo tension”, check out this thread in the CF Metadata mailing list archives.

Now, changing gears to an e-business perspective, you could argue that before a standardization effort even gets off the ground, there is a pretty clear idea of what players are involved and how they plan on using the resource being standardized. This makes (or should make) the whole process a bit more well-defined since we know the audience and the usage patterns up front.

2. Scientific data is often repurposed and applied in ways not intended by the data’s originator

The raw data collected or generated by a scientific community may be repurposed, used by scientists in other communities, and otherwise applied in new ways not intended by the data’s originator. In fact, science thrives in an environment where previous findings can be reapplied to new situations.

The impact on standardization is that it is not possible to know up front the context in which scientific data will be used. This points to a need to keep standards as general as possible while still being precise and informative. One way to resolve the tension between these two is to allow for customization through extension. In other words, the standard itself could serve as a framework allowing community members to provide domain-specific customizations and/or mappings to terms in other domains. The recent explosion of “tagging” might be one way to solicit terms from diverse community members. What is unclear is how the highly unstructured nature of tagging can be reconciled with the highly structured world of data standardization.

3. Complexity of “configuration” involved in scientific data collection

I have used the general term “configuration” here to refer to all of the many complexities involved in preparing to collect scientific data–either via simulation or observation. I have more experience on the simulation side of things, and I can say with confidence that there is an extreme amount of configuration involved before a large scale computer simulation is run. Everything is a parameterized and all those parameters have to be set. For example, it is not uncommon for a shell script that kicks off a global climate simulation to be over 1500 lines long.

Now, say you are a scientist and you are planning on downloading some dataset over the Web and using it to inform your own research. You had better be very sure about what all went into creating that dataset. The best way to gain trust of a dataset is to know exactly how it was produced. This kind of metadata is often called “provenance.”

The sheer complexity of configuration bleeds over into the standardization process. In other words, you don’t just want to get a dataset in a standardized format, you also want a nice description of the configuration that took place leading up to the generation of that dataset. This kind of description is likely much more complex than a typical purchase order XML document. A scientific dataset should be accompanied by more than just a set of standard field names. It should include a “deep description” of what each field means, how it was generated, how it was post-processed, etc.

Perhaps all of this is pointing to the fact that in a scientific setting, the process is just as important (if not more important!) than the resulting data. Therefore, standardization efforts must be involved with the process part of doing science. The focus on recording process information seems less evident in other settings (e.g., it doesn’t make much sense to talk about how a purchase order was generated). Compounding the problem is the fact that the configuration process differs greatly among scientists even in the same domain. If we cannot standardize the configuration processes themselves, how can we at least describe them in a standardized way?

Categories: Research
Tagged: , ,