Choosing A Cloud Software Partner - Page 4

By Jay Judkowitz (Profile)
Share
Friday, August 12th 2011
Advanced

Storage Management

Like compute, storage needs to be aggregated into large pools for access by end users so that they are hidden from the details of the different storage devices and what objects are placed on what storage device.

Unlike compute, there are widely varying capabilities and prices for different storage devices, so some aggregation system is needed to create pools with different service levels where customers can decide the capabilities they need and are wiling to pay for to store their storage objects. This customer decision then drives automated pool selection and ultimately, device selection.

Like with all other cloud resources, the end-user created storage objects need to be created through self-service workflows without administrator interaction, but must also be governed by a robust permission and delegation system governing which storage can be used by which users.

The storage objects created by end-users in those pools need to be managed independently of the instances that mount and access them. This way, creating, updating or deleting workloads does not affect the core information the customer needs to preserve over time. Workloads that create data can be killed, redeployed from an updated template and reattached to storage without impact to the storage object itself. Also, storage objects should be able to be cloned or snapshotted for use by future instances or for rollback processes without any interaction with the running instance accessing it.

Billing and Chargeback

Core to the economic model of cloud is the ability to have end customers either pay for their usage or at the very least to understand their impact on datacenter costs. To that end, there needs to be complete metering APIs and a chargeback or showback system for the cloud.

Hands-Off Infrastructure Management

Management of the physical infrastructure should be as low touch as possible. This includes many aspects:

  • Installation of the nodes: Node installation becomes a frequent operation in a big, fast-growing, and/or mature cloud where parts need to be replaced regularly. Manually installing or configuring servers will be too expensive and error-prone in this world. The only proper experience is for the servers to be racked and connected, then powered on – and nothing else. The cloud needs to auto-discover the server, install it, and make it ready to accept workloads.
  • Intelligent workload placement: Workloads should be automatically, and without administrator involvement, placed such that they are:
    • Loosely packed enough that bottlenecks and performance problems are not generated as dealing with those problems reactively will be problematic at scale.
    • Tightly packed enough that hardware, power, and cooling are not wasted.
    • Strategically placed so that related workloads cohabitate for enhanced inter-workload communication and that redundant workloads are separated to eliminate single points of failure for the service.
    • Placed based on constraints such as the requirement to be on a node with GPUs or a node that is certified PCI compliant.
  • Capacity tracking: There needs to be cloud-wide tracking of resources so that the datacenter operators are aware of cloud capacity and when they need to acquire more hardware.
  • Isolating and retiring equipment: All systems should have a lifetime and a health status associated with them (due to length of maintenance contract, expected lifetime of component parts, and/or length of lease). When that lifetime is exceeded or when the part is failing or has failed, it is automatically isolated from the cloud and flagged for replacement. The cloud should be aware of datacenter layout so that administrators never have a problem locating the equipment at replacement time.
  • Managing planned and unplanned downtimes within the datacenter: If you generally deploy cloud-ready applications (see last article in this series), most datacenter events should be transparent to the end users of the services. Scale-out applications can be scaled up to repopulate lost instances and chunks of huge compute jobs can be automatically respun. However, downtimes associated with persistent data need to be managed as well as compute or network downtimes that affect any of your more monolithic applications. The datacenter should recover whatever it can on its own, and for what it can’t, end users need to be alerted to upcoming planned downtime or recent unplanned downtime and made capable of adjusting their workload deployments accordingly.