Categorizing the Web

A perfect intelligence would not confine itself to one order of thought, but would simultaneously regard a group of objects as classified in all the ways of which they are capable.  

Stanley W. Jevons, 1874.

We have known for some time that we humans tend to mentally categorize objects in order to facilitate our understanding of the world and to enable reasoning capabilities.  Linguists and cognitive scientists suggest that categories are useful because they allow us to infer unobserved properties from observed properties.  While it is certainly impossible for us to know everything about an object, we can observe only a few things (e.g., furry with four legs and barks) but still determine a lot about the object (e.g., this is a dog and most dogs are nice so I should not run away).

Meanwhile, the Web can be seen as another “world” of observable objects.  The massive scale and complexity of the Web makes it essentially impossible for humans (and artificial cognitive agents) to come to grips with all but a very small slice of the content available.  It is truly information overload.  As one way to cope, Web scientists envision a “Semantic Web” in which content is given explicit meaning—a level of semantics greater than that provided by HTML, and through which intelligent agents can reason about the content found there.  

The task of creating a more intelligent Web is one of categorization.  Just as humans tend to work in categories as the basis for higher level reasoning, cognitive agents rely on categories to reason about objects on the Web.  To be sure, much work has already been done along these lines to project us toward the Semantic Web vision.  The notion of ontology as a formal description of concepts has caught on like wildfire.  The first set of standardized ontology languages has already emerged (e.g., RDF, OWL) and researchers continue to work on improving the expressivity and reasoning capabilities of these ontology languages.  While the need for formalized ontologies has gained enormous acceptance, there are still critics as to whether such an approach is even feasible given the scale of the Internet.  Who is going to classify the Web?  There are far too many objects and far too many categories for any individual or large corporation or government to make sense of it all.

Meanwhile, another phenomenon that has recently gained popularity is tagging.  Tagging—at least as far as the Web is concerned—is simply the unstructured, unconstrained labeling of objects (e.g., such as web pages, books, people, etc.) by people on the Internet.  While tagging is certainly useful personally (in the same way that labeling jars of spices in your kitchen is useful), much of the power of tagging is based on the fact that a single item may be tagged by hundreds or thousands of people.  Presumably, many of those tags will agree, but there will of course be some outliers.  But, despite the informal, unconstrained nature of tagging, it is still a form of categorization designed to help us come to grips with the Web.

Because each kind of categorization supports fundamentally different kinds of purposes and reasoning capabilities, it is unlikely that one scheme will emerge as the “end-all” mechanism for categorizing objects on the Web.  Instead, the Semantic Web will emerge as a synthesis of categorization schemes, where each scheme provides only the specific reasoning services that it is good at, and leaves the rest to other schemes.  The unanswered question, though, is how the different schemes will interact to form a fully integrated system.  In this article I present some possible interactions among categorization schemes found on the Web.

The full article is available here:  Categorizing the Web.


Tags: , , , , ,

About rsdunlapiv

Computer science PhD student at Georgia Tech

One response to “Categorizing the Web”

  1. Heather says :

    I love that you have tagged this post with the tag “tagging”. =)

    It’s funny; I’m in a knitting community and we can tag knitting projects (personal folksonomy). You’d be surprised how many variations there are in tags generated by a cohesive community of “experts” about related content. For instance, one person’s project might get the tags “cable”, “cables” and “cabled”. Another person’s, knit from the same pattern, might just be tagged “cables”. I may only think to search for “cables” but not “cabled”, and I’d miss out on some relavant content. Thanks to your article, I know this is an artifact of “stemming”. =) There are other issues, too…there’s a way to look up patterns and projects by designer, but sometimes projects will get tagged by designer as well….

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: