The Earth System Curator is a National Science Foundation sponsored project developing a metadata formalism for describing the digital resources used in climate simulations. The primary motivating observation of the project is that a simulation/model’s source code plus the configuration parameters required for a model run are a compact representation of the dataset generated when the model is executed. The end goal of the project is a convergence of models and data where both resources are accessed uniformly from a single registry. In this paper we review the current metadata landscape of the climate modeling community, present our work on developing a metadata formalism for describing climate models, and reflect on technical challenges we have faced that require new research in the area of Earth Science Informatics.
Entries tagged as ‘e-science’
Earth System Curator: Metadata Infrastructure for Climate Modeling
December 1, 2008 · Leave a Comment
Massively Parallel Collaboration
June 6, 2008 · 1 Comment
The face of science is changing as more and more experiments are moved out of the lab and onto the Grid. As the number of processors available for computation increases, scientists are able to simulate physical phenomena with higher spatial and temporal resolutions. But what is to become of all the data produced by computational Grids around the world? While much effort has been put into parallelization of computations for generating scientific data, there is much work left to be done on the other side of the fence where the data is analyzed.
Along with science, the rest of the world is changing, too. The Internet is becoming more dynamic than ever (see my Web 2.0 post) and the Web has become the place for social interactions. Folks such as James Surowiecki, author of “The Wisdom of Crowds,” have noticed the power and intelligence of large groups that–given the right set of circumstances–are able to solve problems, make decisions, and even predict the future much more accurately than an individual could.
You need not look far to find examples of the wisdom of crowds on the Web. A fairly obvious one is Wikipedia. This site is enormously popular for finding information about just about any topic, but it is not centrally maintained like a traditional encyclopedia. In fact, anyone can edit an entry as they please. And, maybe surprisingly, the result of thousands of people contributing in their own independent, unsupervised way is a very useful resource! Other sites such as Flickr, YouTube, del.icio.us, and Facebook also show the trend toward online collaboration of literally millions of people.
The question for e-science is: how do we leverage this technological and cultural trend toward massive collaboration? One possibility is to move much of the scientific analysis done by individual scientists out into Web space. As it stands today, the steps required to find some new trend in the data or some interesting plot are almost always done by a single scientists working at his own machine. The final results are published in a scientific journal, or presented at a conference for many to see, but by and large, the analysis itself is done by an individual or a very small group of individuals.
There is another way to think about scientific analysis. Consider a recent site I stumbled upon called Many Eyes. The idea of the site is simple: you upload your own data (in tabular format) and it can be visualized by anyone on the web using a large number of visualization types (bar chart, scatterplot, world map, pie chart, etc.). According to the Many Eyes website, the goal of the site is to “bet on the power of human visual intelligence to find patterns… to ‘democratize’ visualization and to enable a new social kind of data analysis.”
Check it out. Here are two examples of visualizations that I was able to create in a matter of minutes. The first one shows the number and total valuation of residential building permits issued for Boulder, Colorado from 1993 to 2003. This visualization uses a standard bar chart.
The second visualization is a tag cloud of all the content currently on my blog.
Once a dataset is uploaded, it is public. Users can view existing visualizations of datasets (like the ones I created) or they can create entirely new visualizations. The philosophy of Many Eyes is that you can tap into the “wisdom of crowds” by allowing many people to create their own kinds of visualizations of the same dataset. Users can elect to “watch” datasets or visualizations to be notified of new activity. Additionally, users can post public comments about datasets and visualizations.
Can this type of massively parallel collaboration be harnessed for sophisticated scientific analyses? I think so, and I think this is where we are heading. I had a conversation today with two students attending the Numerical Techniques for Global Atmospheric Models workshop at NCAR. When I proposed to them the idea of social scientific data analysis, they were very interested. In particular, I explained to them the Many Eyes concept and asked if such a site would be useful to atmospheric modelers if the site supported netCDF (a popular data format for atmospheric data) and more sophisticated visualizations. They agreed that such a site would be helpful if it could actually work over the Web. One of the students commented that a big win for such a site would be the ability for scientists to easily find and repeat the post-processing steps of another scientist. (See the El Nino scenario here.)
Surely, many questions remain to be answered. Will the Web infrastructure support sophisticated scientific analyses? Does the sheer size of datasets prevent scientists from working in online Web spaces? What are the cultural impacts of massively parallel collaborations? Would scientists even care to participate for fear of someone else “stealing” their discoveries?
At the end of the day, though, it is clear that the Web has enabled a whole new level of socialization and collaboration that was previously impossible. It’s up to us to determine whether science will embrace this new cultural shift and embrace the “wisdom of crowds.”
Categories: Research
Tagged: collaboration, e-science, many eyes, web 2.0
“Standardization” and e-science
April 29, 2008 · Leave a Comment
Much of the work I have done on the Earth System Curator project is geared toward the standardization of a data model for describing climate modeling software and the output from climate simulations. (Okay, technically we are not creating a “standard” because we were not really chartered to do that nor do we wish to be prescriptive for the entire climate community. But, nonetheless, our task has been very much like a standardization effort.) For a moment, I want to step back from Curator and consider “standardization” itself.
Standardization is a task that leads us toward interoperability of systems. Although standardization is common in both industrial and scientific endeavors, it is interesting to consider what differences might arise between the standardization process for e-science vs. that of industry. The question I would like to answer is this: “What does standardization mean for e-science?” I contend that there are significant differences that affect how we should think about standardization in each arena.
This post is based on observations I have made while working on the Curator project. At the outset, our task was basically to create a common metadata formalism for describing climate models and output datasets. (I know this description of the project is far too short to be helpful, so please visit the website to read up on what were doing.) To be perfectly honest, the task of coming up with standardized metadata has proven to be very difficult. Lately I have been wondering whether standardization takes on a different meaning for e-science than for other kinds of communities (e.g., business-driven standardization).
Here are some observations that affect the way we look at standardization for e-science.
1. Users of scientific data are diverse and often anonymous.
This means that it is very difficult up front to say with certainty who exactly will be using scientific data once it is published (e.g., such as simulation output or observations from sensors, etc.) Certainly, there is an immediate set of users in mind before we begin collecting data for a scientific endeavor, but before long we realize that folks working in other domains might also benefit from the collected data.
So, in the name of interoperability, we set out to standardize our data so that when others acquire it, they can actually interpret it. However, this can be very challenging since we do not know exactly who will ultimately be using the data. Additionally, most scientific communities have developed their own “lingo,” and the word for describing a particular phenomena depends on the “lingo” you are using. These “lingos” have deep roots, and we cannot ask that entire communities change vocabularies (even though many will admit the deficiencies in their own vernacular). For a real-life example of “lingo tension”, check out this thread in the CF Metadata mailing list archives.
Now, changing gears to an e-business perspective, you could argue that before a standardization effort even gets off the ground, there is a pretty clear idea of what players are involved and how they plan on using the resource being standardized. This makes (or should make) the whole process a bit more well-defined since we know the audience and the usage patterns up front.
2. Scientific data is often repurposed and applied in ways not intended by the data’s originator
The raw data collected or generated by a scientific community may be repurposed, used by scientists in other communities, and otherwise applied in new ways not intended by the data’s originator. In fact, science thrives in an environment where previous findings can be reapplied to new situations.
The impact on standardization is that it is not possible to know up front the context in which scientific data will be used. This points to a need to keep standards as general as possible while still being precise and informative. One way to resolve the tension between these two is to allow for customization through extension. In other words, the standard itself could serve as a framework allowing community members to provide domain-specific customizations and/or mappings to terms in other domains. The recent explosion of “tagging” might be one way to solicit terms from diverse community members. What is unclear is how the highly unstructured nature of tagging can be reconciled with the highly structured world of data standardization.
3. Complexity of “configuration” involved in scientific data collection
I have used the general term “configuration” here to refer to all of the many complexities involved in preparing to collect scientific data–either via simulation or observation. I have more experience on the simulation side of things, and I can say with confidence that there is an extreme amount of configuration involved before a large scale computer simulation is run. Everything is a parameterized and all those parameters have to be set. For example, it is not uncommon for a shell script that kicks off a global climate simulation to be over 1500 lines long.
Now, say you are a scientist and you are planning on downloading some dataset over the Web and using it to inform your own research. You had better be very sure about what all went into creating that dataset. The best way to gain trust of a dataset is to know exactly how it was produced. This kind of metadata is often called “provenance.”
The sheer complexity of configuration bleeds over into the standardization process. In other words, you don’t just want to get a dataset in a standardized format, you also want a nice description of the configuration that took place leading up to the generation of that dataset. This kind of description is likely much more complex than a typical purchase order XML document. A scientific dataset should be accompanied by more than just a set of standard field names. It should include a “deep description” of what each field means, how it was generated, how it was post-processed, etc.
Perhaps all of this is pointing to the fact that in a scientific setting, the process is just as important (if not more important!) than the resulting data. Therefore, standardization efforts must be involved with the process part of doing science. The focus on recording process information seems less evident in other settings (e.g., it doesn’t make much sense to talk about how a purchase order was generated). Compounding the problem is the fact that the configuration process differs greatly among scientists even in the same domain. If we cannot standardize the configuration processes themselves, how can we at least describe them in a standardized way?
Categories: Research
Tagged: e-science, standardization, tagging
What does Web 2.0 mean for e-science?
April 10, 2008 · 1 Comment
First, let me define a few terms up front. By “Web 2.0,” I mean the evolution of the Web from relatively static pages, to highly responsive, dynamic online applications and the resulting changes in Web culture. There is some disagreement about why Web 2.0 has arrived now, but one thing many folks point to is the maturity of technologies for making web sites act more like regular applications. AJAX is certainly a player here, along with DHTML and sophisticated GUI toolkits such as Yahoo’s YUI. The result is a more interactive, collaborative, and dynamic Web (as evidenced by the recent extreme success of social networking sites). While I do not argue that technological advances are the only players leading to the advent of Web 2.0, I doubt many will argue that it is not a fundamental part.
By “e-science,” I mean networks of scientists in a community (or even cross-community) using highly advanced computing techniques (such a Grid computing) to accomplish the tasks of scientific research. An overwhelmingly large number of scientific communities have leveraged recent advances in network speed, processor speed, data storage, etc. to help them accomplish their research. For a few examples, see some of the following sites:
- US National Virtual Observatory
- Grid Physics Network
- Network for Earthquake Engineering Simulation
- Earth System Grid
The question I want to consider is: “What does Web 2.0 mean for e-science?” My hypothesis is that there is a nice marriage between the two, although most e-science communities have yet to embrace Web 2.0. My argument is simply that science by nature is collaborative and therefore we should be building tools that facilitate collaboration among scientists.
As a first step in this direction, we have seen many scientific communities that have made very large repositories of datasets available online. Many of these can be freely downloaded by anyone in the world for their own personal exploration (or at least to a very large audience of registered users). Of the sites listed above, the only one I have personal experience with is the Earth System Grid. From ESG you can access the datasets used by the Intergovernmental Panel on Climate Change (IPCC) for their latest assessment report.
Similar things are happening in other domains. For example, you no longer have to have a telescope to take a peek at points in the sky. The US Virtual Observatory DataScope application allows you to input a particular point or region and with a simple mouse click you are looking at the requested location!
While I admit that this trend toward more accessibility of data is a huge step forward, I think that applying the Web 2.0 philosophy to e-science may help to increase interactivity by moving some of the “science” that happens on individual machines out into online collaborative spaces. For example, consider this scenario presented to me by a colleague of mine working in the climate modeling domain. He pointed out that many analysts download datasets to study the effects of El Nino. Once a dataset is retrieved (from a site like ESG) it must undergo a series of processing steps to isolate the correct region of the globe, the right time periods, and the right variables. What’s not surprising is that much of the same processing is repeated by every analyst that downloads the dataset. That’s because once you have the dataset locally, it has lost all connections with the site where you found it.
Now imagine a scenario where much of the processing has been moved onto the Web (perhaps by a set of Web Services for climate data?). When scientist B visits the site for her El Nino exploration, she finds that scientist A has already performed much of the needed post processing and she grabs that dataset instead of the original. She also notices some comments made by the scientist A that the El Nino phenomenon is best seen during a certain year. Finally, scientist A has posted some plots that scientist B compares with her own plots.
So, in conclusion it seems that Web 2.0 philosophy and e-science could be good friends. It may be a few years in the making, but when it happens science will benefit from a whole new level of interactivity and collaboration that was previously not possible.
Categories: Research
Tagged: collaboration, e-science, web 2.0


