Rocky Dunlap’s Weblog

Entries categorized as ‘Research’

Who owns your Facebook profile?

February 3, 2009 · 3 Comments

Increasingly, our online presences define who we are.   Our lives have a sort of virtual counterpart as we report on what happens in our “real world” lives to the rest of the world in online social forums such as Facebook, MySpace, and blogs.  As our lives become increasingly exposed to the online world, you can’t help but wonder which is more important:  your real life, or the life the world sees through your online presence.  Just as your credit score (a number), not your real-life financial habits, is the primary mechanism for determining your creditworthiness, for better or worse, your online presence in many contexts is the true essence of who your are, and the side of you that matters most.  For example, to what degree is your LinkedIn network used by hiring managers to decide whether or not you are a good fit for the company with whom you are seeking employment?

Despite differences in the perceived influence of our online profiles, most of us will at least go so far as to say that the content you create and post online is at least an important part of  your life and is content that you wish to keep and control.  But, as important as our online profiles are, we are generally happy to give up rights to our data and transfer control of it to third parties.  As an example, let’s take a look at the Facebook Terms of Use.

When you post User Content to the Site, you authorize and direct us to make such copies thereof as we deem necessary in order to facilitate the posting and storage of the User Content on the Site. By posting User Content to any part of the Site, you automatically grant, and you represent and warrant that you have the right to grant, to the Company an irrevocable, perpetual, non-exclusive, transferable, fully paid, worldwide license (with the right to sublicense) to use, copy, publicly perform, publicly display, reformat, translate, excerpt (in whole or in part) and distribute such User Content for any purpose, commercial, advertising, or otherwise, on or in connection with the Site or the promotion thereof, to prepare derivative works of, or incorporate into other works, such User Content, and to grant and authorize sublicenses of the foregoing. You may remove your User Content from the Site at any time. If you choose to remove your User Content, the license granted above will automatically expire, however you acknowledge that the Company may retain archived copies of your User Content. Facebook does not assert any ownership over your User Content; rather, as between us and you, subject to the rights granted to us in these Terms, you retain full ownership of all of your User Content and any intellectual property rights or other proprietary rights associated with your User Content.  (http://www.facebook.com/terms.php).

According to this, “Facebook does not assert any ownership over your User Content.”  While this in many respects takes care of the legal side of things, it does not really address the practical issues of data ownership and control.  Legally I “own” my Facebook profile, but how do I “get” it?  How do I “save” it?  If Facebook servers went down tomorrow (perhaps an unlikely scenario), would I be able to retrieve my profile?  What about the hundreds of pictures that I have uploaded?  Or my messages?  So, while I do retain ownership of content I create, Facebook does not guarantee anything about accessing it.  On the other side of the coin, what if I want to remove some or all of my profile?  Let’s say I log in and delete some of my messages.  Are they really gone?  How many backup copies exist on Facebook servers?

Before going on, allow me to interject  a couple of things at this point.  First, I realize that the last paragraph is starting to sound a little conspiracy-theory-esque.  I do not think that Facebook is out to get us or that anyone is planning on using our profiles against us, etc.  Nor is this intended to be a rant against Facebook and I do not have any issues with the way Facebook has handled my own content.  On the contrary, I imagine that the original creators of Facebook had no idea that the size of its user base would become so incredibly massive and questions of data ownership and control probably seemed relatively inconsequential in its early phases of development.  Further, the privacy controls of Facebook seem quite reasonable insofar as you can decide which people get to see what content.  The data policies of Facebook are in line with the data policies of almost every other service that hosts user-generated content.  In fact, you can make the same observations of many other sites, such as LinkedIn or your favorite blog site.

The underlying issue here is bigger than just control over your social networking profile.  What I am exploring here is whether we need a technological and cultural shift in the way we think about user-generated data–including who owns it, who controls it, how it is accessed, and where it is stored.  The typical approach for architecting a site that delivers user-generated content is for the site to host both the application and the data.  The reasons for this are many.  For one, there is much technological inertia in that direction.  It fits the typical design pattern for building a web site:  get a web server, get a database server, get them to talk, and presto–you are ready to go.  Having the data close to the application is perhaps the basic premise for ensuring efficiency of data operations.  Consider the fact that Facebook serves over 15 billion images per day.  On average, that’s over 170,000 images per second.  You absolutely have to have the data close at hand to get that kind of throughput.  Also, most users are not really interested in managing their own data to begin with.  And, if site developers wish to make a change to the application (such as adding a new field to the profile) they can do so with ease because they have control over both the application and the data schema.  So, there is clearly good reason for sites like Facebook to manage the data for you.

But, let’s imagine another scenario.  Let’s say you are signing up for a new Facebook account.  After putting in some basic information, you are presented with a prompt:  “Where would you like to store your profile information and other user-generated content?”  You are then given a couple of choices:  1.  Have Facebook maintain my profile data.  2.  Allow Facebook to access my personal “cloud” storage area.  You select option 2.  At this point you provide Facebook with credentials to access part of your personal storage area “in the cloud.”  Facebook would then access your storage area and configure it as required for the application.  All of  your Facebook user data would be stored there and accessed by Facebook as needed.  To be clear, the user experience on the site would be no different than if Facebook stored all of your data locally.  But, in fact, your data is now sitting inside a storage area that you own and control.

Is such a thing technically possible?  Would Facebook ever agree to it?  Is there really a need or a demand for this?  I have much more to say on this subject, but let’s leave it here for now.

Some related links:

http://www.eweek.com/c/a/Enterprise-Applications/Who-Owns-Your-Social-Data-You-Do-Sort-of/

http://www.dataportability.org/

Categories: Everything Else · Research
Tagged: ,

Categorizing the Web

December 5, 2008 · 1 Comment

A perfect intelligence would not confine itself to one order of thought, but would simultaneously regard a group of objects as classified in all the ways of which they are capable.  

Stanley W. Jevons, 1874.

We have known for some time that we humans tend to mentally categorize objects in order to facilitate our understanding of the world and to enable reasoning capabilities.  Linguists and cognitive scientists suggest that categories are useful because they allow us to infer unobserved properties from observed properties.  While it is certainly impossible for us to know everything about an object, we can observe only a few things (e.g., furry with four legs and barks) but still determine a lot about the object (e.g., this is a dog and most dogs are nice so I should not run away).

Meanwhile, the Web can be seen as another “world” of observable objects.  The massive scale and complexity of the Web makes it essentially impossible for humans (and artificial cognitive agents) to come to grips with all but a very small slice of the content available.  It is truly information overload.  As one way to cope, Web scientists envision a “Semantic Web” in which content is given explicit meaning—a level of semantics greater than that provided by HTML, and through which intelligent agents can reason about the content found there.  

The task of creating a more intelligent Web is one of categorization.  Just as humans tend to work in categories as the basis for higher level reasoning, cognitive agents rely on categories to reason about objects on the Web.  To be sure, much work has already been done along these lines to project us toward the Semantic Web vision.  The notion of ontology as a formal description of concepts has caught on like wildfire.  The first set of standardized ontology languages has already emerged (e.g., RDF, OWL) and researchers continue to work on improving the expressivity and reasoning capabilities of these ontology languages.  While the need for formalized ontologies has gained enormous acceptance, there are still critics as to whether such an approach is even feasible given the scale of the Internet.  Who is going to classify the Web?  There are far too many objects and far too many categories for any individual or large corporation or government to make sense of it all.

Meanwhile, another phenomenon that has recently gained popularity is tagging.  Tagging—at least as far as the Web is concerned—is simply the unstructured, unconstrained labeling of objects (e.g., such as web pages, books, people, etc.) by people on the Internet.  While tagging is certainly useful personally (in the same way that labeling jars of spices in your kitchen is useful), much of the power of tagging is based on the fact that a single item may be tagged by hundreds or thousands of people.  Presumably, many of those tags will agree, but there will of course be some outliers.  But, despite the informal, unconstrained nature of tagging, it is still a form of categorization designed to help us come to grips with the Web.

Because each kind of categorization supports fundamentally different kinds of purposes and reasoning capabilities, it is unlikely that one scheme will emerge as the “end-all” mechanism for categorizing objects on the Web.  Instead, the Semantic Web will emerge as a synthesis of categorization schemes, where each scheme provides only the specific reasoning services that it is good at, and leaves the rest to other schemes.  The unanswered question, though, is how the different schemes will interact to form a fully integrated system.  In this article I present some possible interactions among categorization schemes found on the Web.

The full article is available here:  Categorizing the Web.

Categories: Research
Tagged: , , , , ,

Earth System Curator: Metadata Infrastructure for Climate Modeling

December 1, 2008 · Leave a Comment

The Earth System Curator is a National Science Foundation sponsored project developing a metadata formalism for describing the digital resources used in climate simulations. The primary motivating observation of the project is that a simulation/model’s source code plus the configuration parameters required for a model run are a compact representation of the dataset generated when the model is executed. The end goal of the project is a convergence of models and data where both resources are accessed uniformly from a single registry. In this paper we review the current metadata landscape of the climate modeling community, present our work on developing a metadata formalism for describing climate models, and reflect on technical challenges we have faced that require new research in the area of Earth Science Informatics.

Available on SpringerLink.

Categories: Research
Tagged: ,

Will cloud computing change the face of e-science?

November 21, 2008 · 2 Comments

First, a bit about cloud computing, and then some extrapolative thinking on what its impact will be on e-science.

Cloud computing is a buzz word that we are hearing more and more recently.  It’s one of those terms that people latch onto because they know there is really something lurking there, but they can’t really place their finger on what it actually is.  Wikipedia says cloud computing is “a style of computing in which IT-related capabilities are provided ‘as a service’, allowing users to access technology-enabled services from the Internet … without knowledge of, expertise with, or control over the technology infrastructure that supports them.”  The term cloud is presumably used as a metaphor for the Internet since it is usually depicted that way on network diagrams.

While I’m not sure if that definition would jive with everyone, it’s in line with Amazon’s Elastic Compute Cloud offering (EC2).  Amazon describes EC2 as “a web service that provides resizable compute capacity in the cloud.”  Essentially, you can design the computational architecture that you want and Amazon will provide it to you as a service on a pay-as-you-go basis.  Need 100 Linux nodes, but only for a week?  No problem–you only pay for what you use, and when you are done, just terminate your nodes and forget about them.  You choose the machine image that you want, the software, the memory size, and the required storage capacity.  Apparently, it can be configured very quickly so you can quickly scale your computational capacity with a very small incremental cost.  I admit the EC2 model is very impressive if it works as they state on the home page.

Assuming cloud computing services such as this come into the mainstream, there will be huge impacts in many domains that rely on IT infrastructure.  E-science is one area that might be radically transformed.

Much of what is impeding scientific progress are the computational and technical issues involved with conducting large scale simulations.  Incompatibilities among computational environments hinder the sharing of experiments and results.  Repeatability, a key tenet of the scientific method, is nearly impossible with respect to e-science computations (at least repeatability by other scientists in other labs using a different computational environment).  The cloud provides a needed layer of abstraction so that scientists can think about science and not about computer science.  Therefore, portability is a prerequisite to repeatability in the realm of e-science.

In almost all domains of e-science, results are disseminated by scientific publications in conferences, journals, and the like.  While many journals have moved to an electronic format, the underlying paradigm is still the same:  results are presented in a summarized format (e.g., plots and averages), but little information is provided on how to reproduce the computations that led up to the results.  And this is understandable.  It might take literally months of tweaking configurations followed by months of processor time followed by months of post-processing and analysis before the results are finally in.  How could you possibly provide enough information for someone else to reproduce the same experiment?  And even if you could, how do you get around the fact that everyone’s computational environment is different and your code might not even run on another platform?

The cloud computing platform sees the computational environment (e.g., operating system + compiler + processor + software + …) as a first class object that can be created, registered, shared, searched, and otherwise manipulated.  For example, Amazon’s EC2 service provides a registry of “Amazon Machine Images” that anyone can access and instantiate.  Custom AMIs can be added to the registry.  This is a paradigmatic shift because what used to be the “infrastructure” has been ripped out and parameterized.  (Imagine being able to change the foundation of a building with ease).  The computational environment becomes another configuration parameter to set along with your experiment’s scientific parameters.  In this sense, the cloud computing platform can be viewed as the “meta-infrastructure.”  Sure, it is an infrastructure at the same time, but for the first time it is an infrastructure that we can safely ignore.

The advantage to e-science?  With a parameterized infrastructure afforded by the cloud platform, we are well on the road to sharing much more than just scientific results.  Instead, we will share the experiments themselves–descriptions of scientific computations that anyone can execute and examine to validate results and extend them.  Admittedly, we have much work to do before this vision becomes a reality.  But, maturing cloud computing services like those offered by Amazon are a big first step toward a better way of doing science.

Categories: Research
Tagged: , ,

Beyond the Deep Web

July 9, 2008 · 1 Comment

Modern search engines are best equipped to handle the so-called “surface Web.” However, sitting below the static content on the surface of the Internet is a wealth of information that is much harder to index. This body of information has been called the “deep Web” because much of it is hidden in databases that can only be accessed via online forms that–while easy for humans to fill out–present a challenge for automated agents such as web-crawlers who need to determine what information is hiding behind the form.

But even if a web-crawler could determine how to fill out a form and could extract and index the “deep” content from a site–would such an index contain the full information potential of the Web?

Contrary to what you might think, the end goal of submitting a query to a search engine is not to find a particular web page. The goal is an answer to a question. How do I get from my house to the store? What time is a certain film playing at my local theater? While some kinds of questions are getting easier and easier to answer, most questions are far too sophisticated to ask a search engine and expect to get an accurate answer. For example, try Googling “How many Starbucks are between 2020 Broadway and 1732 W. 53 Street?” You’re not going to get the result you are looking for. Nor will you be directed to a web page where you can easily find the answer.

It seems unreasonable to ask a search engine these kinds of questions. Why?

  • For one, most search engines are keyword-oriented and you cannot really think of a way to write down the question. We’ve been bred to think of searches as sets of keyword combinations. What words can I put together to find the pages with the information I seek? Unfortunately, most real questions cannot be formulated as a set of keywords.
  • Another issue is that search engines are designed to return web pages. But since our primary need is answers–not web pages–search engines should be “answer-oriented” not web page-oriented. So, assuming we have solved the first problem–that is, specifying the question in a manner that the search engine can understand, we wish for the search engine to take the necessary steps to give me the answer I am looking for.

So, we see that the problem boils down to two measly problems: the input is wrong and the output is wrong! Yikes!

Let’s explore what we mean by “answer-oriented” a bit. One way of thinking about an “answer-oriented” search engine is the following. Assume my question is: How many movies has Francis Ford Coppola directed? Let’s say that using its web index, the search engine is able to find some relevant pages based on keywords in the question. Now, the obvious next step would be for the search engine to scrape the page for the number I am looking for (perhaps using the hint “how many”) and return to me that number. Now, this would be a helpful feature, but in reality it doesn’t do much for the searcher who could within a few seconds do a manual grep of the page and find the number he or she was looking for. But this entire scenario is still based on our current search paradigm–namely, that the results of searches are web pages.

Now here is another scenario. The user poses the question: How many Starbucks restaurants are between my house and my office? The first thing to note is that in all likelihood, no web page actually exists anywhere containing the answer we seek. It is also unlikely that there is a “deep Web” database somewhere with a row in it containing the needed information. But, it is highly likely that all the needed information is in fact available online. Certainly we could find the route from my house to the office using a mapping web site. And, we could find the addresses of Starbucks locations in the area. But what we need is more than information retrieval. We need information synthesis. Answering the question requires some computation. The hard question we’d like to answer is: can a search engine be smart enough to perform the needed computations (or outsource them) and then return the result?

Some challenges must be overcome to achieve this:

  • The search engine must “understand” the user’s question. As it stands today, search engines don’t really accept questions–just words. The words are string matched against an index. There are hardly any semantics associated with the query, and therefore the search engine has a very shallow understanding of what the user really wants.
  • The search engine must index more than just web pages. It must also index services that can perform computations. The search engine must also understand how the services work, most likely by having a description of the service interface. Alternatively, the search engine could somehow outsource the finding of the needed service. (UDDI could be considered a rudimentary version of this, but it is a “registry” based technology where the service provider must actively register the service. Instead, services should be “discovered” dynamically by the search engine so that a massive index can be built just like the index of static HTML pages. Obviously UDDI has not really caught on. When is the last time you searched a UDDI registry?)
  • If the answer to the query is not found on a static web page, but requires a service invocation, the computational resources must be allocated for the service to run. This search engine itself may take responsibility for assigning resources, or the service request could be floated into the “cloud” where processing would be assigned in a distributed fashion and the result returned asynchronously when it has been computed. (A side issue is the ability to estimate the computational cost for answering the question. Lower cost questions could be answered quickly, perhaps by the search engine itself. Higher cost questions would require more resource allocation and the result may not be returned for some time. Good estimation is essential here.)

Categories: Research
Tagged: , ,

Massively Parallel Collaboration

June 6, 2008 · 1 Comment

The face of science is changing as more and more experiments are moved out of the lab and onto the Grid. As the number of processors available for computation increases, scientists are able to simulate physical phenomena with higher spatial and temporal resolutions. But what is to become of all the data produced by computational Grids around the world? While much effort has been put into parallelization of computations for generating scientific data, there is much work left to be done on the other side of the fence where the data is analyzed.

Along with science, the rest of the world is changing, too. The Internet is becoming more dynamic than ever (see my Web 2.0 post) and the Web has become the place for social interactions. Folks such as James Surowiecki, author of “The Wisdom of Crowds,” have noticed the power and intelligence of large groups that–given the right set of circumstances–are able to solve problems, make decisions, and even predict the future much more accurately than an individual could.

You need not look far to find examples of the wisdom of crowds on the Web. A fairly obvious one is Wikipedia. This site is enormously popular for finding information about just about any topic, but it is not centrally maintained like a traditional encyclopedia. In fact, anyone can edit an entry as they please. And, maybe surprisingly, the result of thousands of people contributing in their own independent, unsupervised way is a very useful resource! Other sites such as Flickr, YouTube, del.icio.us, and Facebook also show the trend toward online collaboration of literally millions of people.

The question for e-science is: how do we leverage this technological and cultural trend toward massive collaboration? One possibility is to move much of the scientific analysis done by individual scientists out into Web space. As it stands today, the steps required to find some new trend in the data or some interesting plot are almost always done by a single scientists working at his own machine. The final results are published in a scientific journal, or presented at a conference for many to see, but by and large, the analysis itself is done by an individual or a very small group of individuals.

There is another way to think about scientific analysis. Consider a recent site I stumbled upon called Many Eyes. The idea of the site is simple: you upload your own data (in tabular format) and it can be visualized by anyone on the web using a large number of visualization types (bar chart, scatterplot, world map, pie chart, etc.). According to the Many Eyes website, the goal of the site is to “bet on the power of human visual intelligence to find patterns… to ‘democratize’ visualization and to enable a new social kind of data analysis.”

Check it out. Here are two examples of visualizations that I was able to create in a matter of minutes. The first one shows the number and total valuation of residential building permits issued for Boulder, Colorado from 1993 to 2003. This visualization uses a standard bar chart.

The second visualization is a tag cloud of all the content currently on my blog.

Once a dataset is uploaded, it is public. Users can view existing visualizations of datasets (like the ones I created) or they can create entirely new visualizations. The philosophy of Many Eyes is that you can tap into the “wisdom of crowds” by allowing many people to create their own kinds of visualizations of the same dataset. Users can elect to “watch” datasets or visualizations to be notified of new activity. Additionally, users can post public comments about datasets and visualizations.

Can this type of massively parallel collaboration be harnessed for sophisticated scientific analyses? I think so, and I think this is where we are heading. I had a conversation today with two students attending the Numerical Techniques for Global Atmospheric Models workshop at NCAR. When I proposed to them the idea of social scientific data analysis, they were very interested. In particular, I explained to them the Many Eyes concept and asked if such a site would be useful to atmospheric modelers if the site supported netCDF (a popular data format for atmospheric data) and more sophisticated visualizations. They agreed that such a site would be helpful if it could actually work over the Web. One of the students commented that a big win for such a site would be the ability for scientists to easily find and repeat the post-processing steps of another scientist. (See the El Nino scenario here.)

Surely, many questions remain to be answered. Will the Web infrastructure support sophisticated scientific analyses? Does the sheer size of datasets prevent scientists from working in online Web spaces? What are the cultural impacts of massively parallel collaborations? Would scientists even care to participate for fear of someone else “stealing” their discoveries?

At the end of the day, though, it is clear that the Web has enabled a whole new level of socialization and collaboration that was previously impossible. It’s up to us to determine whether science will embrace this new cultural shift and embrace the “wisdom of crowds.”

Categories: Research
Tagged: , , ,

Sometimes It’s Time to Organize the Closet

May 22, 2008 · 2 Comments

Not surprisingly, life in academia often involves a lot of thinking. Sometimes you will sit for many minutes or more (hours?) just thinking. While I won’t go so far as to say you are “paid to think,” I think it’s important to just ponder your research every so often, bringing to mind the various things you are working on, or would like to work on, and trying to make connections.

One of the most fruitful results of such pondering is when you have an “ah ha!” moment. This occurs when a new connection is made somewhere in your brain. The new connection is exciting because it often means you have a whole wealth of new inferences to explore. For example, let’s say you have an “ah ha” and realize that idea A and idea B are connected in some way. You have never thought about A and B in the same context, but you realize that you should be. Then, you take everything you know about A and say “What does it mean for B?” Likewise, you take everything you know about B and say “What does this mean for A?”

So, what does this have to do with organizing the closet? Well, after a long day of thinking, you sometimes get the itch to quit thinking, and go do something productive! There’s something rewarding about getting the closet organized or pulling the weeds or painting the shed. And I think the same applies to your research. Some of us are “thinkers,” and some of us are “doers.” I will humble myself and admit to being too much of a thinker. If you are a thinker, sometimes you need to quit thinking and start doing.

But, I’m willing to bet that most of us in academia are wired the other way. My only data point is a book I read recently entitled “A Ph.D. Is Not Enough!” by Peter Feibelman. In this book, he points out that many researchers are too focused on techniques, methods, and certain technologies with little regard to how their products fit in with the big picture, or how they are helping to answer the “big questions”, or what the “big questions” are for that matter. Maybe the issue here is so much “doing” that we are forgetting to just stop and think a bit about all the “doing” and what it means. So, my advice to the thinkers is to start doing, and my advice to the doers is to stop and think every once in a while.

Well, enough of this for now. I’m going to organize the closet.

Categories: Research

How many languages do you speak?

May 7, 2008 · Leave a Comment

An essential problem facing all areas of computing is that of managing multiple ways of representing data. Recently, I’ve started wondering if there are too many languages for representing knowledge. Let me give you an idea of what I mean.

We are developing a prototype portal for finding and downloading datasets generated by climate models. The name of the system is CDP-Curator because it is an extension to an existing system called the Community Data Portal (CDP).

Just for kicks, I’m going to briefly outline all of the data representations I can think of that we have to deal with in hosting the climate model datasets. I will also list our motivations for using each one.

  • NetCDF - This is the network Common Data Format developed at Unidata. It serves as a common data format for array-oriented scientific data. Although there are other similar representations, almost all of the datasets we are working with are already in NetCDF. In a sense, NetCDF is really outside of the CDP-Curator system boundary. We are pretty much forced to use this format because that’s what the climate modeling community is using and that’s the format of existing datasets. I should also point out that NetCDF files have a “header” containing metadata about the fields contained in the file.
  • XML - This is the eXtensible Markup Language. It is an extremely popular, tag-based syntax for data exchange. It is particularly popular as a format for exchanging data among web-based systems. Thus far, XML will serve as the syntax used for metadata crossing the system boundary. This simply means that when someone wants to submit a new dataset (or climate model description) we expect the metadata to be delivered in XML. Our motivations for using XML include its wide acceptance throughout the climate community, the fact that it is human and machine readable/writeable, and the maturity of tools and APIs for manipulating XML.
  • W3C XML Schema – The schema language constrains the XML by defining what elements and attributes we expect to appear in a given XML document. Clearly, an XML schema language of some sort is required in order to let data contributors know the expected format of the metadata. Our specific choice of W3C XML Schema is based on the fact that it has wide tool support and the fact that other community members are already comfortable with it. Another option would be the Relax NG schema language.
  • RDF/OWL – Although technically distinct, I am treating RDF/OWL as one language. OWL (Web Ontology Langauge) is an ontology language built on top of RDF (Resource Description Framework). These two languages are (or will be, in theory) at the heart of the Semantic Web. The RDF layer describes “resources” using subject-predicate-object triples. OWL sits on top of RDF and is a full-blown ontology language with a theoretical basis in Description Logics. The metadata we receive in XML will be translated into RDF/OWL and stored in a Sesame triple store. Our motivations for using RDF/OWL: it is a “web-friendly” (XML syntax, URIs as identifiers) language, it is good for representing lots of dense relationships (arbitrary graphs), it is conceptual in nature, good support for class hierarchies, and it seems to work well with our faceted search interface.
  • RDBMS – We also plan on integrating with an existing relational database (RDBMS) for long term storage of the metadata (but not the climate data itself). RDBMSs are very mature, reliable, and have been around for a while. They are highly scalable, very fast for most querying needs, connect well with Java and web-based programming languages, and have sophisticated backup and replication capabilities. This is a natural choice for ensuring that the metadata will not be lost.
  • UML – We are using UML (Unified Modeling Language) class diagrams to model the RDF/OWL ontology. Currently our process is a bit backwards because we make the change first in the RDF/OWL and then we go back and update our conceptual model in UML.

What I have been considering lately is the following quesion: What is the cost of having all of these languages in place in one system? Maybe a better question is: What metrics do we use to measure the cost of dealing with data in multiple languages?

Probably the biggest cost involved is language translation. For example, in CDP-Curator, our current thinking is to ingest XML, load it into a RDBMS, populate the triple store periodically (e.g., nightly) from the RDBMS, and have the interface query the triple store. This involves the following translations:

  • XML to relational. This involves parsing the XML and writing SQL statements to insert the data into the RDBMS. Some RDBMSs may take the XML directly and do the conversion internally. A possible tradeoff here is a lack of control over the translation process.
  • Relational to RDF/OWL. Certainly many folks have already done this, although it is probably not understood as well as XML/relational translations. The translation could be done programmatically by requesting data from the RDBMS using SQL and then writing out the corresponding RDF. However, it may be difficult to do this serially because of the graph nature (triples) of RDF. A more suitable option might be to use an RDF/OWL library such as Jena. Jena will create an in-memory object model of the RDF/OWL and it can then be written out serially.
  • RDF/OWL to XHTML/DHTML. This seems to be more of a second-class translation since the XHTML will not be stored–it is just generated dynamically for presentation purposes. Nonetheless, it is a translation that we cannot ignore. Many of the latest GUI widgets are using JSON to move bits of data around because it is Javascript friendly. So, we might go RDF/OWL –> JSON –> XHTML. Another aspect of the latest GUI packages is that more and more code is moving into Javascript. This means that we are writing less HTML and more Javascript calls (i.e., manipulating the DOM manually). There are data-enabled widgets (such as the YUI DataSource utility) that automatically link a GUI element to some datastore. Again, this hides but does not avoid the need for language translation.

I guess the point that I am getting at is that our choice of languages for data/knowledge representation is definitely non-trivial, but at the same time it is hard to quantify which languages are suitable for which purposes. It is also hard to measure the impact of using one language over another, or one combination of languages verses a different combination. In a future post, I’ll attempt to talk about what kinds of questions we should ask when choosing a data/knowledge representation language and what kinds of metrics we could imagine.

Categories: Research
Tagged: , , , , ,

“Standardization” and e-science

April 29, 2008 · Leave a Comment

Much of the work I have done on the Earth System Curator project is geared toward the standardization of a data model for describing climate modeling software and the output from climate simulations. (Okay, technically we are not creating a “standard” because we were not really chartered to do that nor do we wish to be prescriptive for the entire climate community. But, nonetheless, our task has been very much like a standardization effort.) For a moment, I want to step back from Curator and consider “standardization” itself.

Standardization is a task that leads us toward interoperability of systems. Although standardization is common in both industrial and scientific endeavors, it is interesting to consider what differences might arise between the standardization process for e-science vs. that of industry. The question I would like to answer is this: “What does standardization mean for e-science?” I contend that there are significant differences that affect how we should think about standardization in each arena.

This post is based on observations I have made while working on the Curator project. At the outset, our task was basically to create a common metadata formalism for describing climate models and output datasets. (I know this description of the project is far too short to be helpful, so please visit the website to read up on what were doing.) To be perfectly honest, the task of coming up with standardized metadata has proven to be very difficult. Lately I have been wondering whether standardization takes on a different meaning for e-science than for other kinds of communities (e.g., business-driven standardization).

Here are some observations that affect the way we look at standardization for e-science.

1. Users of scientific data are diverse and often anonymous.

This means that it is very difficult up front to say with certainty who exactly will be using scientific data once it is published (e.g., such as simulation output or observations from sensors, etc.) Certainly, there is an immediate set of users in mind before we begin collecting data for a scientific endeavor, but before long we realize that folks working in other domains might also benefit from the collected data.

So, in the name of interoperability, we set out to standardize our data so that when others acquire it, they can actually interpret it. However, this can be very challenging since we do not know exactly who will ultimately be using the data. Additionally, most scientific communities have developed their own “lingo,” and the word for describing a particular phenomena depends on the “lingo” you are using. These “lingos” have deep roots, and we cannot ask that entire communities change vocabularies (even though many will admit the deficiencies in their own vernacular). For a real-life example of “lingo tension”, check out this thread in the CF Metadata mailing list archives.

Now, changing gears to an e-business perspective, you could argue that before a standardization effort even gets off the ground, there is a pretty clear idea of what players are involved and how they plan on using the resource being standardized. This makes (or should make) the whole process a bit more well-defined since we know the audience and the usage patterns up front.

2. Scientific data is often repurposed and applied in ways not intended by the data’s originator

The raw data collected or generated by a scientific community may be repurposed, used by scientists in other communities, and otherwise applied in new ways not intended by the data’s originator. In fact, science thrives in an environment where previous findings can be reapplied to new situations.

The impact on standardization is that it is not possible to know up front the context in which scientific data will be used. This points to a need to keep standards as general as possible while still being precise and informative. One way to resolve the tension between these two is to allow for customization through extension. In other words, the standard itself could serve as a framework allowing community members to provide domain-specific customizations and/or mappings to terms in other domains. The recent explosion of “tagging” might be one way to solicit terms from diverse community members. What is unclear is how the highly unstructured nature of tagging can be reconciled with the highly structured world of data standardization.

3. Complexity of “configuration” involved in scientific data collection

I have used the general term “configuration” here to refer to all of the many complexities involved in preparing to collect scientific data–either via simulation or observation. I have more experience on the simulation side of things, and I can say with confidence that there is an extreme amount of configuration involved before a large scale computer simulation is run. Everything is a parameterized and all those parameters have to be set. For example, it is not uncommon for a shell script that kicks off a global climate simulation to be over 1500 lines long.

Now, say you are a scientist and you are planning on downloading some dataset over the Web and using it to inform your own research. You had better be very sure about what all went into creating that dataset. The best way to gain trust of a dataset is to know exactly how it was produced. This kind of metadata is often called “provenance.”

The sheer complexity of configuration bleeds over into the standardization process. In other words, you don’t just want to get a dataset in a standardized format, you also want a nice description of the configuration that took place leading up to the generation of that dataset. This kind of description is likely much more complex than a typical purchase order XML document. A scientific dataset should be accompanied by more than just a set of standard field names. It should include a “deep description” of what each field means, how it was generated, how it was post-processed, etc.

Perhaps all of this is pointing to the fact that in a scientific setting, the process is just as important (if not more important!) than the resulting data. Therefore, standardization efforts must be involved with the process part of doing science. The focus on recording process information seems less evident in other settings (e.g., it doesn’t make much sense to talk about how a purchase order was generated). Compounding the problem is the fact that the configuration process differs greatly among scientists even in the same domain. If we cannot standardize the configuration processes themselves, how can we at least describe them in a standardized way?

Categories: Research
Tagged: , ,

What does Web 2.0 mean for e-science?

April 10, 2008 · 1 Comment

First, let me define a few terms up front. By “Web 2.0,” I mean the evolution of the Web from relatively static pages, to highly responsive, dynamic online applications and the resulting changes in Web culture. There is some disagreement about why Web 2.0 has arrived now, but one thing many folks point to is the maturity of technologies for making web sites act more like regular applications. AJAX is certainly a player here, along with DHTML and sophisticated GUI toolkits such as Yahoo’s YUI. The result is a more interactive, collaborative, and dynamic Web (as evidenced by the recent extreme success of social networking sites). While I do not argue that technological advances are the only players leading to the advent of Web 2.0, I doubt many will argue that it is not a fundamental part.

By “e-science,” I mean networks of scientists in a community (or even cross-community) using highly advanced computing techniques (such a Grid computing) to accomplish the tasks of scientific research. An overwhelmingly large number of scientific communities have leveraged recent advances in network speed, processor speed, data storage, etc. to help them accomplish their research. For a few examples, see some of the following sites:

The question I want to consider is: “What does Web 2.0 mean for e-science?” My hypothesis is that there is a nice marriage between the two, although most e-science communities have yet to embrace Web 2.0. My argument is simply that science by nature is collaborative and therefore we should be building tools that facilitate collaboration among scientists.

As a first step in this direction, we have seen many scientific communities that have made very large repositories of datasets available online. Many of these can be freely downloaded by anyone in the world for their own personal exploration (or at least to a very large audience of registered users). Of the sites listed above, the only one I have personal experience with is the Earth System Grid. From ESG you can access the datasets used by the Intergovernmental Panel on Climate Change (IPCC) for their latest assessment report.

Similar things are happening in other domains. For example, you no longer have to have a telescope to take a peek at points in the sky. The US Virtual Observatory DataScope application allows you to input a particular point or region and with a simple mouse click you are looking at the requested location!

While I admit that this trend toward more accessibility of data is a huge step forward, I think that applying the Web 2.0 philosophy to e-science may help to increase interactivity by moving some of the “science” that happens on individual machines out into online collaborative spaces. For example, consider this scenario presented to me by a colleague of mine working in the climate modeling domain. He pointed out that many analysts download datasets to study the effects of El Nino. Once a dataset is retrieved (from a site like ESG) it must undergo a series of processing steps to isolate the correct region of the globe, the right time periods, and the right variables. What’s not surprising is that much of the same processing is repeated by every analyst that downloads the dataset. That’s because once you have the dataset locally, it has lost all connections with the site where you found it.

Now imagine a scenario where much of the processing has been moved onto the Web (perhaps by a set of Web Services for climate data?). When scientist B visits the site for her El Nino exploration, she finds that scientist A has already performed much of the needed post processing and she grabs that dataset instead of the original. She also notices some comments made by the scientist A that the El Nino phenomenon is best seen during a certain year. Finally, scientist A has posted some plots that scientist B compares with her own plots.

So, in conclusion it seems that Web 2.0 philosophy and e-science could be good friends. It may be a few years in the making, but when it happens science will benefit from a whole new level of interactivity and collaboration that was previously not possible.

Categories: Research
Tagged: , ,