An Autonomic Approach to Integrated HPC Grid and Cloud Usage

Clouds are rapidly joining high-performance Grids as viable computational platforms for scientific exploration and discovery, and it is clear that production computational infrastructures will integrate both these paradigms in the near future. As a result, understanding usage modes that are meaningful in such a hybrid infrastructure is critical. For example, there are interesting application workflows that can benefit from such hybrid usage modes to, per- haps, reduce times to solutions, reduce costs (in terms of currency or resource allocation), or handle unexpected runtime situations (e.g., unexpected delays in scheduling queues or unexpected failures). The primary goal of this paper is to experimentally investigate, from an applications perspective, how autonomics can enable interesting usage modes and scenarios for integrating HPC Grid and Clouds. Specifically, we used a reservoir characterization application workflow, based on Ensemble Kalman Filters (EnKF) for history matching, and the CometCloud autonomic Cloud engine on a hybrid platform consisting of the TeraGrid and Amazon EC2, to investigate 3 usage modes (or autonomic objectives) – acceleration, conservation and resilience.

Application Overview

A schematic overview of the CometCloud-based autonomic application management framework for enabling hybrid HPC Grids-Cloud usage modes is presented below.

The framework is composed of autonomic managers that coordinate using Comet coordination spaces that span, and can be transparently accessed across the integrated execution environment. The key components of the management framework are described below.

Workflow Manager: The workflow manager is responsible for coordinating the execution of the overall application workflow, based on user-defined polices, using Comet spaces.

Estimators: The cost estimators are responsible for translating hints about computational complexity provided by the application into runtime and/or cost estimates on a specific resource.

Autonomic Scheduler: The autonomic scheduler performs key autonomic management tasks.

Grid/Cloud Agents: The Grid/Cloud agents are responsible for provisioning the resources on their specific platforms, configuring workers as execution agents on these resources, and appropriately assigning tasks to these workers.

Experiment and results

The goal of the experiments presented in this section is to investigate how possible usage modes for hybrid HPC Grids-Cloud infrastructure can be supported by a simple policy-based autonomic scheduler. Specifically, we investigate experimentally, implementations of three usage modes – acceleration, conservation and resilience, which are the different objectives of the autonomic scheduler.

Our experiments use a single stage EnKF workflow with 128 ensemble members (tasks) with heterogeneous computational requirement. The heterogeneity is illustrated in Figure 3; which is a histogram of the runtimes of the 128 ensemble members within a stage on 1 node of a TG compute system (Ranger), and 1 EC2 core (a small VM instance, 1.7 GB memory, 1 virtual core,160 GB instance storage, 32-bit platform) respectively. The distribution of tasks is almost Gaussian, with a few significant exceptions. These plots also demonstrate the relative computational capabilities of the two platforms. Note that when a task is assigned to a TG compute node, it runs as a parallel application across the node’s 16 cores with linear scaling. However on an EC2 node, it runs as a sequential simulation, which (obviously) will run for longer.

We use two key metrics in our experiments: Total Time to Completion (TTC), which is the wall-clock time for the entire (1-stage) EnKF workflow (i.e., all the 128 ensemble members) to complete and the results are consumed by the KF stage, and may include both TG and EC2 execution. The Total Cost of Completion (TCC), which is the total EC2 cost for the entire EnKF workflow.

Our experiments are based on the assumption that for tasks that can use 16-way parallelism, the TG is the platform of choice for the application, and gives the best performance, but is also the relatively more restricted resource. Furthermore, users have fixed allocation on this expensive resource, which they might want to conserve for tasks that require greater node counts. On the other hand, the EC2 is a relatively more freely available, but is not as capable.

Note that the motivation of our experiments is to understand each of the usage scenarios and their feasibility, behaviors and benefits, and not to optimize the performance of any one scenario (or experiment). In other words, we are trying to establish a proof-of-concept, rather than a systematic performance analysis.