| By Birendra Gosai | Article Rating: |
|
| February 4, 2011 07:30 AM EST | Reads: |
4,296 |
IT organizations today are experiencing pressure to not only adopt new and emerging technologies like virtualization, but also reduce costs and do more with fewer resources (thus reducing CapEx) - all while delivering assurance of capacity and performance to the business.
In the first part of this article, we provided a brief overview of the CA Technologies virtualization maturity lifecycle and focused on the server consolidation stage. Although server consolidation helps efficient use of available compute resources and reduces the total number of physical servers in the data center, organizations that have successfully consolidated their server environment and are progressing on their virtualization journey often find it difficult to virtualize tier 1 workloads. They also face significant challenges in utilizing the hosts at a higher capacity. This happens because they lack the confidence to move critical applications onto the virtual environment, or utilize servers to capacity.
In this second part of the article, we focus on building and maintaining a mature and optimized infrastructure that is essential for IT organizations to virtualize tier 1 workloads and achieve increased capacity utilization on the virtual hosts - thus helping them reap the true CapEx savings promised by virtualization.

Gain Visibility and Control
Organizations face significant challenges in trying to achieve the visibility and control necessary to optimize their virtual infrastructure. These include:
- Providing performance and Service Level Agreement (SLA) assurance to the business.
- Deploying and maintaining capacity on an automated basis.
- Securing access to the virtual environment and facilitating compliance.
- Providing business continuity in the event of a failure.
The following are the tasks and capabilities required to optimize the infrastructure and gain visibility and control into the availability and performance of the virtual environment.
Project Plan
The following is a high-level plan for an infrastructure optimization project. The timelines and tasks mentioned in Table 1 present a broad outline for a tier 1 infrastructure optimization project that targets setting up an optimized infrastructure and adding approximately 10 critical production workloads to about 40 virtual server hosts (with existing workloads) - thus resulting in a 80-90% capacity utilization on those servers. The 3-4 person implementation team suggested for the project is expected to be proficient in project management, virtualization design and deployment, and systems management.

Table 1: Infrastructure Optimization project plan
A successful infrastructure optimization project necessitates a structured approach that should consist of the following high-level tasks. For each of these tasks we will discuss the key objectives and possible challenges, articulate a successful outcome, and more.
Performance and Fault Monitoring
Prior to moving critical workloads onto the virtual environment, IT operations teams need to ensure that they have clear visibility and control into the availability and performance of the virtual environment. To foster this visibility and control, application / systems consultants should use performance management tools to:
- Discover the virtual environment and create an aggregate view of the virtual infrastructure. This discovery should be dynamic and not static - i.e., the aggregate view should automatically reflect changes in the virtual environment that result from actions such as vMotion. In addition, this discovery should not only reflect the virtual environment, but also components surrounding the virtual network.
- Set up event correlation. In a production environment where hundreds of events may be generated every second by the various components, event correlation is extremely essential to navigate through the noise and narrow down the root cause of active or potential problems.
- Enable real-time performance monitoring and historical trending. The performance monitoring should go beyond the basic metrics like CPU / memory consumption and provide insight into the traffic responsiveness across hosts. Trending capabilities are also essential to monitor and be cognizant of historical performance.
Capabilities like the ones mentioned above provide IT administrators and business/application owners the confidence to move critical production applications into the virtual environment.
Continuous Capacity Management
Critical applications depend on multiple components in the virtual environment. Given the dynamic nature of the virtual environment and the high volume of workloads processed by virtual servers, it's almost impossible for administrators to create and manage capacity plans on a project-by-project basis. Therefore, managing critical workloads requires automating the manual steps of capacity management, thus enabling continuous capacity management. A continuous capacity management environment should:
- Collect and correlate data from multiple data sources, update dashboards with the current state of utilization across virtual and physical infrastructure, and publish reports on the efficiency of resource utilization for each application / business service.
- Highlight opportunities for optimization, solve resource constraints, update baselines in predictive models, and utilize the predictive model to produce interactive illustrations of future conditions.
- Integrate with provisioning solutions for intelligent automation, and eco-governance solutions to help maintain compliance with environmental mandates.
The level of continuous capacity management described above, along with comprehensive analytic and simulation modeling capabilities, will allow the IT administrator to effectively manage the capacity of critical applications / services on an ongoing basis.
Change and Configuration Management (CCM)
Pre / post migration configuration discovery and testing is essential to enable successful server consolidation. However, IT organizations that support tier 1 workloads cannot afford to perform these activities on a one-time project basis. Optimized infrastructures need continuous CCM not only for the workloads, but also for the infrastructure. In a highly dynamic environment, erroneous virtual infrastructure configuration can have drastic effects on VM performance. Comprehensive CCM involves:
- Providing ongoing configuration compliance with system hardening guidelines from the Center for Internet Security (CIS), hypervisor vendors, etc.
- Tracking virtual machines, infrastructure components, applications, and the dependencies between them on a continuous basis.
- Monitoring virtual infrastructure configuration and its association with workload performance.
Implementing comprehensive CCM for the virtual environment will not only help avoid configuration drift and its impact on workload performance, but also facilitate compliance with vendor license agreements and regulatory mandates such as Payment Card Industry Data Security Standards (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), Sarbanes-Oxley Act (SOX), etc.
Workload Migration
The workload migration process is easily the most complex component of an organization's virtualization endeavor. The migration process refers to the "copying" of an entire server / application stack. IT organizations face many challenges during workload migration - most end up migrating only 80-85% of target workloads successfully, and that too with considerable problems. Some of the challenges include:
- Migration in a multi-hypervisor environment, and possible V2P and V2V scenarios.
- The flexibility of working with either full snapshots or performing granular data migration.
- In-depth migration support for critical applications such as AD, Exchange, and SharePoint
- Application / system downtime during the migration process.
There are free tools for migration available from some hypervisor vendors, but these don't work well and require system shutdown for several hours for the conversion. They might also limit the amount of data supported or require running tests on storage to uncover and address bad storage blocks in advance. Backup, High Availability (HA) and IP-based replication tools serve as a very good option for successful workload migrations as they not only help overcome / mitigate the above mentioned challenges, but also can be used for comprehensive BCDR (Business Continuity and Disaster Recovery) capabilities.
From a process standpoint, ensure that the migrations are performed per a pre-defined schedule and include acceptance testing and sign-off steps to complete the process. Ensure contingency plans, and factor in a modest amount of troubleshooting time to work out minor issues in real-time and complete the migration of that workload at that time rather than rescheduling downtime again later.
Privileged-User Management and System Hardening
Privileged users enjoy much more leverage in the virtual environment as they have access to most virtual machines running on a host - hence tight control of privileged user entitlements is essential. This task should ensure that:
- Access to critical system passwords is only available to authorized users and programs.
- Passwords are stored in a secure password vault and not shared among users or hard coded in program scripts.
- Privileged user actions are audited and the audit-logs are stored in a tamper-proof location.
In addition to privileged user management, which protects from internal threats, IT organizations need to ensure that their servers are secure from malicious external threats. This includes installing antivirus/anti-malware software to protect against these external threats, and making sure that the systems conform to the comprehensive system hardening guidelines provided by the hypervisor vendors.
Business Continuity and Disaster Recovery (BCDR)
BCDR has long been an essential requirement for critical applications and services. This includes backup, high availability and disaster recovery capabilities. However, server virtualization has changed the way modern IT organizations view BCDR. Instead of the traditional methods of installing and maintaining backup agents on each virtual machine, IT organizations should utilize tools that integrate with snapshot and off-host backup capabilities provided by most hypervisor vendors - thus enabling backups without disrupting operations on the VM and offloading workload from production servers to proxy ones. Activities within this task should ensure that:
- Machines are backed up according to a pre-defined schedule, and granular restores using push button failback are possible.
- Critical applications and systems are highly available, and use automated V2V or V2P failover for individual systems / clusters.
- Non-disruptive recovery testing capabilities are available for the administrators, etc.
The one week timeline scheduled for this task assumes the existence of comprehensive BCDR plans for the physical workloads, which then only need to be translated into the virtual environment.
Production Testing and Final Deliverables
The breadth and depth of post-migration testing will vary according to the importance of the migrated workload; less critical workloads might require only basic acceptance tests, while critical ones might necessitate comprehensive QA tests. In addition, this task should include follow up on any changes that the migration teams should have applied to a VM but are unable to perform due to timing or need for additional change management approval. All such post-migration recommendations should be noted, as appropriate, within the post-test delivery document(s).
This final stage of the implementation process should include the delivery of documentation on the conversion and migration workflow and procedures for all workloads. Doing so will remove dependency on acquired tribal knowledge and allow staffing resources to be relatively interchangeable. These artifacts and related best practices documents will also allow the continuation of the migration process for additional workloads in an autonomous fashion in the future if desired.
Conclusion
In Part 2, we focused on building and maintaining a mature and optimized infrastructure that is essential for IT organizations to virtualize tier 1 workloads and achieve increased capacity utilization on the virtual hosts. In Part 3, we will focus on tackling problems such as "VM sprawl" (the problem of uncontrolled workloads), increased provisioning and configuration errors, and the lack of a detailed audit trail - all of which significantly increase the risk of service downtime.
Published February 4, 2011 Reads 4,296
Copyright © 2011 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Birendra Gosai
Birendra Gosai has a Masters degree in Computer Science and over ten years of experience in the enterprise software industry. He has worked extensively on data warehousing, network & systems management, and security management technologies. He currently works in the virtualization management business at CA Technologies. You can view his blogs at: http://community.ca.com/members/Birendra-Gosai.aspx, or follow him on Twitter @BirendraGosai.

