CometCloud [1] is an autonomic computing engine for cloud and Grid environments. It is based on the Comet [2] decentralized coordination substrate, and supports highly heterogeneous and dynamic cloud/Grid infrastructures, integration of public/private clouds and autonomic cloudbursts. A schematic overview of the architecture is presented in Figure 1.

Conceptually, CometCloud is composed of a programming layer, service layer, and infrastructure layer. The infrastructure layer uses the Chord self-organizing overlay [3], and the Squid [4] information discovery and content-based routing substrate build on top of Chord. The routing engine supports flexible content-based routing and complex querying using partial keywords, wildcards, or ranges. It also guarantees that all peer nodes with data elements that match a query/message will be located. Note that resources (nodes) in the overlay have different roles (and accordingly, access privileges) based on their credentials and capabilities. This layer also provides replication and load balancing services, and handles dynamic joins and leaves of nodes as well as node failures. Every node maintains a replica of its successor node’s state, and reflects changes to this replica whenever its successor notifies it of changes. If a node fails, the predecessor node merges the failed node’s replica into its own state and then makes a new replica of its new successor. If a new node joins, the joining node’s predecessor updates its replica to reflect the joining node’s state, and the successor gives its state information to the joining node. Note that, to maintain load balancing, load should be redistributed among the nodes whenever a node joins and leaves. The service layer provides a range of services to supports autonomics at the programming and application level. This layer supports a Linda-like [5] tuple space coordination model, and provides a virtual shared-space abstraction as well as associative access primitives. Dynamically constructed transient spaces are also supported to allow applications to explicitly exploit context locality to improve system performance. Asynchronous (publish/subscribe) messaging and event services are also provides by this layer. Finally, online clustering services support autonomic management and enable self-monitoring and control. Events describing the status or behavior of system components are clustered and the clustering is used to detect anomalous behaviors. The programming layer provides the basic framework for application development and management. It supports a range of paradigms including the master/worker/BOT. Masters generate tasks and workers consume them. Masters and workers can communicate via virtual shared space or using a direct connection. Scheduling and monitoring of tasks are supported by the application framework. The task consistency service handles lost/failed tasks. Even though replication is provided by the infrastructure layer, a task may be lost due to, for example, network congestion. In this case, since there is no failure, infrastructure level replication may not be able to handle it. These cases can be handled by the programming layer (e.g., the master), for example, by waiting for the result of each task for a pre-defined time interval, and if it does not receive the result back, regenerating the lost task. If the master receives duplicate results for a task, it selects one (the first) and ignores other (subsequent) results. Other supported paradigms include workflow-based applications as well as Mapreduce [6] (and Hadoop [7]).


Autonomic Cloud-bridging & Cloudbursts in CometCloud

The goal of autonomic cloudbursts is to seamlessly (and securely) bridge private enterprise clouds and data-centers with public utility clouds on-demand, to provide an abstraction of resizable computing capacity that is driven by user-defined high-level policies. It enables the dynamic deployment of application components, which typically run on internal organizational compute resources, onto a public cloud (i.e., cloudburst) to address dynamic workloads, spikes in demands, economic/budgetary issues, and other extreme requirements. Furthermore, given the increasing application and infrastructure scales, as well as their cooling, operation and management costs, typical over-provisioning strategies are no longer feasible. Autonomic cloudbursts can leverage utility clouds to provide on-demand scale-out and scale-in capabilities based on a range of metrics.

The overall support for autonomic cloudbursts in CometCloud is presented in Figure 2. CometCloud considers three types of clouds based on perceived security/trust and assigns capabilities accordingly. The first is a highly trusted, robust and secure cloud, usually composed of trusted/secure nodes within an enterprise, which is typically used to host masters and other key (management, scheduling, monitoring) roles. These nodes are also used to store state. In most applications, the privacy and integrity of critical data must be maintained, and as a result, tasks involving critical data should be limited to cloud nodes that have required credentials. The second type of cloud is one composed of nodes with such credentials, i.e., the cloud of secure workers. A privileged Comet space may span these two clouds and may contain critical data, tasks and other aspects of the application-logic/workflow. The final type of a cloud consists of casual workers. These workers are not part of the space but can access the space through the proxy to obtain (possibly encrypted) work units as long as they present required credentials and these credentials also define the nature of the access and type of data that can be accessed. Note that while nodes can be added or deleted from any of these clouds, autonomic cloudbursts primarily target worker nodes, and specifically worker nodes that do not host the Comet space as they are less expensive to add and delete.


References

[1] H. Kim, M. Parashar, D. J. Foran and L. Yang, “Investigating the use of cloudbursts for high-throughput medical image registration,” in the 10 th IEEE/ACM International Conference on Grid Computing (GRID2009), Banff, Canada , Oct. 2009.

[2] Z. Li and M. Parashar, “A computational infrastructure for grid-based aternational symposium on High performance distributed computing. New York , NY , USA : ACM, 2007, pp. 229–230.

[3] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup protocol for internet applications,” in ACM SIGCOMM, 2001, pp. 149–160.

[4] C. Schmidt and M. Parashar, “Squid: Enabling search in dhtbased systems,” J. Parallel Distrib. Comput., vol. 68, no. 7, pp. 962–975, 2008.

[5] N. Carriero and D. Gelernter, “Linda in context,” Commun. ACM, vol. 32, no. 4, pp. 444–458, 1989. [6] J. Dean and S. Ghemawat , “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [7] Hadoop. http://hadoop.apache.org/core/.