Monday, December 5, 2016

The Self-Healing Data Center Part 5: Configure an Alert to Trigger the vRO Workflow

Now that we have configured the Translation Shim and vRealize Orchestrator the next task before testing is to configure vR Ops to send an alert to the shim.  This will be the final post in this series and when you complete this, you will be ready to apply this solution to automate any alerts you desire by simply creating the appropriate workflow and alert settings.

I am using vR Ops version 6.4 in the steps below, but this should work with any 6.0 or higher version of vR Ops.  We are also using Endpoint Operations to monitor the state of a service on a Linux OS.  Endpoint Operations is NOT required to use this shim, I am only using that because it provides a way to easily trigger an alert by stopping a service on a monitored OS.  It also shows that any automated remediation or activity is possible, not just automation of virtual infrastructure.



Monitor a Service with Endpoint Operations

In this case, I'll assume you have a Linux OS with the Endpoint Operations agent installed.  If you need information, see the documentation.  In this example, we will monitor the VMtools service on the OS.  To set this up, browse to the the OS resource in vR Ops and select Actions > Monitor OS Object > Monitor Process.


The Monitor Process form appears and you just need to add the display name and process.query values.  I also open Advanced Settings and set the Collection Interval to 1 minute - this is optional, and I only do it here to speed up the testing.  Otherwise you'll wait the normal 5 minute collection interval for the alert to fire.


Now you have a new resource, named vmtoolsd.


Create REST Alert Notification and Alert

Now that we have an object to monitor, we can alert on it.  First, we need to set up the REST notification plugin instance so that the alert can be routed to the shim (and then to Orchestrator).  Browse to Administration > Outbound Settings and click the green plus sign to add a notification instance.  You need two bits of information here.  In the second post of the series we installed the Translation Shim - we need the IP address of the host running the shim.  Also, in the fourth post, we added a workflow to Orchestrator that will be used to restart a service on a VM and we need the ID of that workflow.  If you imported the example workflow, it should be a0b8b820-b8c9-454e-b128-c4eabb9d0015 but double-check this in Orchestrator.

Plugin Type = Rest Notification Plugin
Instance Name = Orchestrator Shim Hello World (or whatever you wish)
Url = http://{IP or hostname of your shim}:5001/endpoint/vro/{ID of workflow to run}
User Name = leave blank
Password = leave blank
Content Type = application/json
Certificate thumbprint = leave blank
Connection count = leave at default (20)


Click the save button.  If you try the Test button, it will fail, but the test does not work correctly with the Translation Shim, so do not worry about that.

Now we can create an alert to trigger when a monitored service is not running.  Browse to Content > Alert Definitions and click the green plus sign to add an alert definition.  Use the information that follows in the appropriate sections within the alert definition.

Section 1. Name and Description
  Name = Whatever you wish
  Description = Whatever you wish



Section 2. Base Object Type
  Base Object Type = MultiProcess (just type MultiProcess into the form and it will find the type)


Section 3. Alert Impact (these are suggestions)
  Impact = Risk
  Criticality = Symptom Based
  Alert Type and Subtype = Virtualization/Hypervisor : Compliance
  Wait Cycle = 1 
  Cancel Cycle = 1

Section 4. Add Symptom Definitions
  Defined On = Self
  Symptom Definition Type = Metric / Property
  Symptom = use the filter to find "Not running" and drag this over to the Symptom work area


Section 5. Add Recommendation
  Do nothing.

Save the alert definition.  The last thing we need to do is create a notification rule.  Click on Notifications and then the green plus icon to create a new notification rule.  Use the following information.

Name = Whatever you wish
Method = Rest Notification Plugin > Orchestrator Shim Hello World (or whatever you named this)
Scope = Object : vmtoolsd
Notification Trigger = Alert Definition : (whatever you named the alert definition)
Advanced Filters > Alert Status = New (we only want this to trigger once)


Save the rule.  We need to add the rule to a policy so that it will be enforced.  I'm using the default policy, which is the effective policy for my Linux OS monitored resource.

Testing the Automated Remediation

Now, we can trigger the alert by stopping the service on the monitored OS.  Below we can see that the vmtools service is running - let's stop it.


Watching the vR Ops UI, refreshing for a few minutes until we see the alert trigger.


The alert triggered.  If we look at the shim, we can see that the alert was received by the shim and is translated, then forwarded to Orchestrator successfully.


Looking at Orchestrator, we can see that the workflow did indeed run and was successful at starting the service.


If this wasn't successful for you, be sure to check the workflow logs (shown above) and note the reason.  It is likely due to an authentication failure for the SSH command - you will need to edit the workflow and modify the attributes for SSH username and password to an account that has permission to SSH and start a service on the target OS.

Let's check back on the OS to make sure VMtools is running again.


Looks good - what about in vR Ops?  Has the alert cleared itself?


We can see that it did.

Summary

Congratulations!  If you followed each post in this series you now have the tools to automate practically any alert response using vRealize Orchestrator and the Translation Shim.  There are many other use cases and examples for this:

- Automatically fix vSphere hardening guide alerts
- Open service desk tickets
- Update CMDB when configuration values change
- In addition to restarting a service, adding the capability to notify someone if the service restart failed

I would like to hear about your own use cases - share them here in the comments or on twitter (mention me @johnddias if you do).

No comments:

Post a Comment