Having worked with SAP landscapes in various IaaS platforms, I have come to a disturbing conclusion – they are damn hard to keep control of and manage on a medium to long term basis. This has become something of an elephant in the room for many of us Cloud evangelists, but I feel that it is something that must be addressed in order to allow Cloud environments to progress from great finite lifespan systems to systems that are fully integrated into normal landscapes. discussed below are some of the major challenges that can effect Cloud projects/implementations.
It is one of the biggest selling points of IaaS environments is the level of flexibility that they provide. Through this flexibility, we have the ability to do things like
- Cloning systems – creating clones of systems is as easy as a few mouse clicks, similarly creating instances from these clones is just as easy. This is a double edged sword as creating snapshots of instances requires additional storage, which needs monitored, managed and paid for. By creating a clone, we have doubled the amount of resources being used, if we then create an instance from that clone, we have now tripled the amount of resources being used. As you can see it is very easy to increase the amount of resources being charged for by the IaaS provider.
- Allocate new infrastructure – creating/allocating new infrastructure is deceptively easy, this is because although it is easy to create an additional 100Gb volume – it requires discipline/processes to make sure it is labelled and catalogued properly to ease administration. The diagram below shows the nightmare that can be unleashed through a lack of discipline.
The graphs below show the growth month by month of the number of volumes against the number of servers of an implementation I managed recently. In July and August, the system was implemented and stable, in Sept it underwent some DR testing which increased the number of servers and the number of volumes. Despite this testing being complete in October, the number of volumes has not returned to the baseline, in fact it is not even close – even though the number of servers has dropped to baseline.
The graph below shows in more detail the spread between those volumes which are Available and those In-Use, this confirms that in October the number of volumes which were not attached to servers increased. This indicates that although the servers were terminated, people are not deleting the associated storage – because “you never know if you’ll reuse it”.
- Create new snapshots – snapshots are the “get out of jail free” card of data backups, most IaaS platforms have native snapshot capability which can be used as a replacement for normal backup applications. Although these like backup media need to managed and aged properly to make sure that backup snapshots do not become en exponential mess. Like the diagram above, this ease of creation means that people performing any changes will snapshot a volume ‘just in case’ something goes wrong.
Security has been and continues to be a worry for some on IaaS platforms, and in my opinion a little unfairly. Many service providers provide deep and granular controls of their services, for example Amazon has the IAM, which provides granular security. Within the AWS platform, each user gets a log on for the AWS console as well as an X509 certificate for signing web service calls. This X509 certificate can be used by any 3rd party application or service and maintains the permissions defined by the IAM. Often people focus on the platform security issues without talking about the security of the OS and application layers, it is easy to hypothesize why this might be the case and many articles have been written to compare IaaS security with on-premise security. Due to the self-service nature of IaaS providers, their desire to make security as easy as possible and the “Jack of all trades – Master of none” approach taken by many IaaS practitioners, it is understandable why companies and people are wary of it. In order to provide good assurances, IaaS platform security must provide auditing and inspection of configuration using existing deployed toolsets, otherwise the security which is not transparent will never be fully trusted.
In order to move IaaS landscapes from temporary/finite systems to systems that are properly integrated into landscapes, they need to be able to be managed in the same way. This includes tasks like –
Backups – although it is possible to use the native snapshot ability on data volumes, this is not a great solution. This is because ageing the snapshots is difficult but not impossible, take a look at a service called Skeddly.com, this allows you to age and delete snapshots on a scheduled basis. For many operations people, using a proper managed and integrated backup product is still the right way to go.
Startup/Shutdown – in order to achieve the savings quoted by many people, systems should be run only for the periods for which they are required. This means that instances need to be started and stopped according to a defined schedule, for example my own template systems run between 6am and 10pm. In order to achieve this something needs to run the start and stop scripts, two options exist
- Run a single instance 24*7 to run command line tools to start and stop the other instances – this goes against the principle of what we want to achieve but it can be used for other purposes as well.
- Use a web based service to start and stop the instances remotely, for me this is an attractive option and I have used a service called Skeddly.com to perform scheduled actions on my AWS EC2 landscape.
The biggest bug bear I and anyone I have spoken to has, is the lack of a toolset which captures and enables system owners and maintainers to quickly and easily find out how every resource is connected and utilised. All the information is present in every management interface provided, but in every one of them I have used, all the infrastructure components are on different pages – see the diagram below.
As you can see from above, I can see the status of all my instances, but if I want to see all the volumes attached I need to go to a different page. This assumes that I have correctly populated the Meta-Data tags from the instances page so I can determine what each volume is attached to (see volume storage nightmare picture above)
Several people have suggested a number of applications like Chef or Puppet, which I have not had a chance to deploy as they are quite outside my core area of expertise – but I do know that Rightscale uses Chef to manage customers’ infrastructures.
Ultimately, Cloud environments will always walk the fine line between flexibility and uncontrollability. This is simply because if it was easy to provide a simple, flexible and controllable service all host providers and data centres would have them. In order to maximise the benefits of IaaS, there needs to be a clear consensus between the business and IT to define what they want from each system. This will enable IT to create a flexible wrapper round these systems to provide solid management without too much overhead. The really good IT departments will drive this work themselves and automate as much as possible so they can drive their own efficiencies whilst still serving the business. The explosion of IaaS services is partly because businesses got tired of IT departments telling them ‘No’ or it’ll take 4 weeks to create that 10Gb volume.