# Local and Global Performance Optimization of Coupled Models

Load balancing is important for achieving good performance of coupled models. The purpose of load balancing the model is to reduce processor idle time as much as possible. From a scientific productivity standpoint, the ultimate goal of performance optimizations is to increase model throughput–i.e., how much can be simulated per unit of wall clock time.

A recent paper about performance engineering of the Community Earth System Model (CESM) discusses current approaches to load balancing and performance optimization for that model [1]. The authors describe a two-step process to performance optimization: First, some empirical data is gathered about the performance of each constituent at different processor counts for the given platform and problem specification. Second, the local performance information is used to decide how to assign processors to maximize the coupled model’s throughput. Like most climate models, CESM’s constituent models are parallel applications in their own right with a number of different parameters that can be tuned to optimize performance.

At the current time, climate model performance is dominated by computation, not communication. Therefore, model throughput is improved best by ensuring that computations of the constituent models overlap as much as possible so that processor idle time is reduced to a minimum. This tuning is done primarily by carefully choosing the number of processors to allocate to each constituent. The whole process is more or less a trial-and-error tuning guided by empirical performance data for each of the constituents.

For current processor counts, the “enlightened trial-and-error” procedure works relatively well. However, as we approach the exascale era, it is expected that the performance characteristics of the models will change significantly. At today’s typical processor counts, the coupler overhead does not introduce a significant performance penalty. A study by the Earth System Modeling Framework shows that for one typical configuration of CESM, time spent in the coupler is less than 3% of the total running time [2]. However, at very large processor counts, the number of messages required for both intra- and inter-model communication increases dramatically such that optimization of the coupler will become important to maintain high model throughput. The last set of experiments in the CESM report supports to this prediction: For the highest resolution runs, the percentage of time spent inside the coupler increases dramatically when compared to typical resolution runs. At some processor counts, the coupler accounts for nearly half of the execution time [1].

The current performance tuning approach is convenient because it allows the user to focus on local optimizations while still giving reasonably good global performance characteristics. Unfortunately, the changes introduced by the next generation of supercomputer will likely require a change in performance tuning strategies. Namely, more thought will have to be put into global performance optimization of the models because it is less likely that executing locally optimized models in parallel will give the best model throughput. Much of this cost will be due to a shift from computation to communication as the performance bottleneck. Effective performance tuning will require the assistance of tools such as automatic load balancers. Such tools should take as input a set of requirements (e.g., the set of constituent models) and local performance characteristics of each constituent, and generate a configuration that is globally optimized for maximum throughput. Unlike local optimization, global optimizations take into account knowledge of the system as a whole, such as the pair-wise communication requirements among the constituent models. In some cases, it might be that the globally optimized configuration is not locally optimal. For example, we might configure the decomposition of the land model to more closely match that of the atmosphere in order to reduce land/atmosphere communication. In this case it is possible that the land’s decomposition results in a reduced throughput for that constituent, even though it ensures higher throughput of the coupled system.

There are a number of “tunable” performance parameters for each constituent. For a global configuration of several constituents, the state space of possible solutions becomes too large to manage manually. Taking into account just one parameter, number of processes, let’s look at a global optimization problem for a fictitious coupled system with three models. This figure below shows the scaling curves of the three constituents: A, B, and C. The horizontal axis is number of processes and the vertical is hours of wall clock time required to simulate one year.

The optimization problem is this: how many processes should I assign to each model to maximize throughput? Note that if we draw a horizontal line, say at 2.5 hours, then we can automatically get a load balanced model because each model will take the same amount of time for its computation (assuming the empirical curves are good predictors, of course). Keep in mind, however, that this load-balanced configuration may not necessarily be the optimal solution in terms of model throughput.

Another parameter we have to play with is whether models are executed sequentially or in parallel. With three constituents there are 5 different configurations. The notation AB means that A and B run sequentially, one after the other in the same process. The notation A||B means that A and B run in parallel. Assume that the sequential operator takes precedence over ||, that is, AB||C means (AB)||C.

Possible configurations:

- A||B||C – fully parallel
- ABC – fully sequential
- AB||C – hybrid
- AC||B – hybrid
- A||BC – hybrid

As a simplification, we will assume that models executed sequentially have independent performance characteristics (i.e., we will neglect any gains due to the potential for these constituents to share memory). This allows us to sum the scaling curves of two or more constituents that execute sequentially and treat the set as if it were a single constituent. (This assumption should be relaxed!) We then wind up with one (fully sequential), two (hybrid), or three (fully parallel) sequential sets of constituents executing in parallel. The best throughput for a particular configuration can now be achieved by overlapping computation as much as possible for all the constituents running in parallel. The best global configuration can be found by taking the best throughput found among the five configurations.

In addition to relaxing the assumption that sequentially executing constituents have independent performance characteristics, both additional parameters (e.g., decomposition strategy) and additional constraints (e.g., memory available on each node) need to be taken into account to improve the quality of the optimization.

My question is if we add a realistic set of parameters to the global coupled model optimization problem, is it computationally tractable?

[1] Patrick H. Worley, Arthur A. Mirin, Anthony P. Craig, Mark A. Taylor, John M. Dennis, and Mariana Vertenstein. 2011. Performance of the community earth system model. In *Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis* (SC ’11). ACM, New York, NY, USA, , Article 54 , 11 pages. DOI=10.1145/2063384.2063457 http://doi.acm.org/10.1145/2063384.2063457. (pdf)

[2] Peggy Li. ESMF Component Overhead in CCSM4.0. (pdf)