Tuesday, February 16, 2010

A manifesto on swarm and utility based cloud computing

Introduction

So why is swarm and utility based cloud computing so interesting? The answer is because businesses wish to reduce cost, and increase scalability of their system. The only way to achieve this, is by removing the overhead of managing systems, and having them sit idle.


In my mind, the largest problem with distributed computing is a mindset shift. Right now, we are focusing mainly on pushing data to the cloud, with data size and transfer being left as an afterthought. My goal with a swarm based framework is to focus on moving the process, but not the data. The way I believe this can be achieved is to structure the storage strategy to optimise and limit the amount of data packet transfer, followed by moving of process. Execution of process is not a consideration at this point in time, as a client CPU is rarely a limiting factor.


The goal of this blog post series is to hold an open and frank discussion about how to best achieve the goals outlined here, and to avoid any further pitfalls that may arise.



There are several major considerations that I have thought out that need to be considered in a successful cloud or swarm framework.


Data Shards
The framework requires the data to be segmented into meaningful segments of knowledge, or "Shards". Basically a shard is a predefined set of data that has an explicit relationship dictated by the programmer. For instance, in a banking home loan system, a Person's contact details combined with their home loan information would be considered to belong to a particular shard of data.

A shard of data can belong to another shard. For example, the particular home loan data for a person can belong to a set of home loan data that belongs to a mortgage broker, as well as a bank. This becomes particularly useful for null dressing as well as Garbage collection, as you can explicitly pull information from the cloud, as well as delete objects that are referenced by shard. I will expand on this in a later post.

Using Data Shards to reduce the amount of movement of information
My thoughts on how a shard applies to limiting the movement of data goes as follows:

Moving data has 2 dimensions of expense. The amount of data, and how expensive it is to move the data. The former is self explanatory, but the latter becomes interesting with Flex 4's capability to handle peer to peer between clients. Take an office situation. Generally similar people tend to cluster together both geographically and from a network perspective that access similar types of data. For instance, in a bank, insurance brokers work on the same network, sales work in the same building, and mortgage brokers do the same.

It makes no sense that these guys should all be contacting the server to pull down the same data. This is where the concept of data gravity comes in. When a client accesses a piece of data, a counter is incremented as to the amount of times the particular shard has been accessed in total. An implicit total get tallied down the children from the root source shard.

This counter is used to rank the amount of carpet wear on a shard, making it more important, akin to increasing its mass. The client will cache the data according to level of importance, and discard and stop listening to changes occurring on the least important objects. Updating the data is another problem, but I think we will leave that discussion for later as well.

I have thought about naming this algorithm Data Gravity, as the importance of the object is akin to a planet increasing its mass, and clustering objects together in space.

This will make further sense when I expand on my ideas with the incorporation of Hyper Planes, Neural Networks and Convex hull structures with respect to Data Shard clustering in a future blog post and introduce the concept of a Gravity Well.

Private Clouds and Data Security

A large sales barrier to Cloud computing is data security and integrity. Private clouds seem to be gathering momentum. However, the ideal should be a mix of the two.

Non-sensitive data can be pushed to third party cloud vendors, while sensitive information should be prioritised by the company's own infrastructure.

What you pay for should be used as close to maximum capacity, while there is always option to load shed processes such as serving public web pages, public document storage and so forth to an external provider should your resources be overloaded.

The idea of adding a security clearance level to a data shard allows the private cloud framework to manage the distribution of data shards by servers and members. For instance, a server inside the organisation may be instructed to store top secret level shards of information down to public level access information, while say the Amazon or Google cloud may be set to store only public level access information.

A more complex security policy might add access to particular types of data shards. For instance, if you are the creator of a data shard, you would get access, even if it is top secret. Furthermore, you might provide other people direct access to the file on this level, as well as transitive access, by belonging to a team or group.

Data Recovery Management

DRM is a huge area that cost organisations a very pretty penny. It is also an area where if a pretty penny is not spent, should a risk materialise into an issue, it becomes a significant unexpected expense, or worse.



Cloud has brought forward a solution that is more like fools gold than anything else.



The problem is that on small scale amounts of data, we can definitely push it to the cloud for backup purposes, however, for larger scales of data, such as movie files or images etc, cloud storage is simply not an option.



Security is also a major issue here again, as it is very foolish for an organisation to push something like their complete client list, or their accounts to the cloud. Hence, these files tend not to be backed up, or stored in a more traditional DRM strategy such as on a file server, or copied to CD.



The problem here is that an organisation that moved their DRM reliance to the cloud will experience an atrophied traditional DRM strategy. In my days working as a systems administrator, I came across many sites where my predecessor diligently made backups to cd's or removable storage drives, but never tested whether they worked.



In such a circumstance, it becomes a dangerous placebo.



A swarm based DRM could be very interesting, in that the data movement and update solution can be applied to storing files in an encrypted fashion on a peer to peer basis in a private swarm cloud. This way, if a client is knocked out by disaster, or worse, a server is knocked out by disaster, a recovery process can automatically reconstruct data from a swarm of clients.



I will discuss this further in a later post.



Unidentified Issues or Benefits

At this point I would like to open the floor to any and all who read this post to provide thoughts and feedback on the ideas I have outlined above.



I feel that there are many untapped possibilities that have not been thought of yet in this space, and I would love to explore them with the community.

2 comments:

  1. Its a useful Information. Thanks for sharing that. But previously i happened to attend a Cloud Computing and Virtual Conference 2009 which is the World's largest and virtual conference on Cloud computing. I got a good opportunity to meet and talk with the World's leading experts on Cloud computing.

    ReplyDelete