Tuesday, July 23, 2013

Understanding and Validating vC Ops Sizing Recommendations

I got the same (well, similar) question today from two different customers so I thought a post explaining the vC Ops sizing recommendation views and reports was in order.  You may be familiar with the Oversized VM and Undersized VM reports available in vC Ops.  I typically caution that these are RECOMMENDATIONS and some additional research should be done before making a final call on resizing.

Let's look at an example from my lab.  I have a SQL database server which is showing up as undersized in vC Ops.


Note that the "% CPU Undersized" is slight (less than 3%) and there is no undersized concern for memory.

Now, a frequent misunderstanding is that the Workload score for a VM can be used to verify the sizing recommendation - this is not the case.  The capacity analysis is an average of the CPU and memory utilization over the time period specified in your configuration, by default daily average over 30 days (you can change this by going into "Configuration > Manage Display Settings > Edit > Non-Trend Views and adjusting "Interval to use:" and "Number of intervals to use:" settings).

On the other hand, the Workload badge reflects the current state of resource utilization.  For example my vCMDB server shows that memory is the most utilized resource when looking at Operations > Details > Workload Badge.  Memory is at 15% workload and also shows as the "BOUND BY" resource area. 

Note that vC Ops will display the BOUND BY for the highest scoring workload resource - it is a good indicator of a bottleneck if the workload is high, but in this case it is nothing to be concerned about.  Note also that the Dynamic Threshold for CPU (highlighted in yellow) is a pretty wide threshold.  This usually indicates a "peaky" behavior for that resource and thus a hint that while CPU is pretty docile at the current time, it's been known to spike.

So, it is important to understand this key difference - Health/Workload/Anomalies/Faults are all real-time indicators (well, nearly real-time - actually 5 minute granularity).  The capacity reports are AVERAGES over a given time period (by default 30 days).

Given that, how can I validate the sizing recommendations?

This is where the Operations > All Metrics feature comes into play.  From here, I can chart any metric collected from vCenter about any of the objects in a given time series.  For my purposes, I will chart the vCMDB server's CPU Demand and Memory Demand metrics over the past 30 days.




First, notice that there is a blue line indicating the metric five minute samples and a grey area behind the blue line.  The grey area is the Dynamic Threshold and I have displayed it for the entire 30 day period by selecting the option indicated by the red arrow in the image.  The blue arrow indicates the default of displaying only the 24 hour Dynamic Threshold.

Now as you look at this, you should notice that overall CPU (the top graph) demand regularly and predictably hits a CPU demand of 103% - that correlates very nicely with the information from the sizing recommendation.  Recall that we were undersized by about 2%-3% in that report.

Another thing to note is that memory usage is pretty consistent at somewhere between 1.5GB and 2GB (with a noteable exception around July 15th where a server reboot caused an anomaly).  Thus, the recommendation that memory is not undersized is spot on as well.

Finally, the Dynamic Threshold for memory makes a really nice granular shadow behind the blue line - our vC Ops analytics have pretty well nailed "normal" behavior for memory and that's a good thing.  On the other hand, CPU seems to have recalculated somewhere around July 18th - why would this happen?  Well, if you look closely at the graph, you may notice that the lower end of CPU utilization dropped after the server reboot on July 15th.  After about three days of this consistent "new" normal behavior, the analytic engine in vC Ops decided it was time to calculate this new normal and reset the Dynamic Threshold for this metric to a broader "high/low" while the detailed granularity is being figured out.

A couple of final thoughts:

  - Always validate the recommendations against the metric graphs, and I recommend using the "Demand" based metrics since they show what a VM would LIKE to use versus what it is actually USING.  Very important difference!
  - Consider your SLA, for example, even though I'm undersized during the peak workload that may not be impacting the ability to deliver the required service (think about a report that runs after hours, or a DB update - if I can still do it within the SLA window, why add resources?)


No comments:

Post a Comment