Link to part 2
Link to part 3

Alerts v Alarms

In the world of vROPs we often spend time talking about ‘Alerts’ and the need to respond to them, tune them, disable them or create custom ones etc. What I don’t read much about is the Alarms? Are they the same thing?

The answer is NO.

To be honest, I am not 100% clear on everything that generates an Alarm but my understanding is that these Alarms are actually Symptoms that have triggered having met a condition they are configured for.

I am sure you are aware that Alerts are made up of Symptoms (maybe many symptoms) and that these Symptoms can be linked to multiple alerts. They can also be tuned, created, copied etc. Some of these are not issues as such, they can be there to identify simple things like being a member of a group for example. These Symptoms are essentially what set the rules of when an alarm should be triggered/cancelled.

In most cases, you probably don’t need to spend time worrying about the volume of active or cancelled alarms but there are times when it becomes very important.

Large Environments

Large environments can often generate millions of alarms per month and this creates a lot of data in the database. Once again this is not typically an issue in most cases.

  • They will be groomed from the database once they reach their expiry date.
  • The expiry date is controlled from within the Global Settings.
  • Prior to Version 6.5 the default retention period for Alerts and Alarms was 90 Days.
  • The new default for 6.5 Greenfield deployments is 45 days I believe.

However, sometimes the volume of alarms can start to have a very negative impact on the performance of your cluster and I have experienced this first hand. Infact although I was not witness to this it eventually resulted in the UI and Admin UI becoming unavailable.

The root cause was initially identified by the excessive IOPs that each node was experiencing (>7k) and regular ‘stop the world garbage collection‘ notifications.
How do I know if the alarms are growing to excessive levels and what can I do to proactively deal with it?

If you have not yet enabled the vROPs Self performance monitoring dashboards I recommend you do, they are very useful!

If/When you have enabled them, you will notice on the ‘Self Cluster Statistics’ Dashboard that amongst the information presented, it shows you the volume of alarms. Selecting this will also give you a time period chart showing the trend.Post6-Img1

What this is showing you is actually just the ACTIVE alarms and it is perfectly acceptable to have MANY MANY active alarms. I have seen environments with 600,000 active alarms at any one time and sits there consistently around that figure.

What next?

In my experience so far, the best way to identify that performance is starting to drop is by looking at the IOPs for each of the nodes. In the chart below you can see the trend. The IOPs grow and grow then drop, this is when we clear the Alarms out of the database. It then starts to grow again. THis was over a time period of about 5 months.

Post6-Img2

So what about all the closed/cancelled alarms?

The alarms are stored in one of the databases on each node and to get the total count you need to SSH to each node and run a query against the database. From there you can also clear down the Alarms and Alerts.

**WARNING – I take no responsibility for any actions you take in your environment, always consult your support provider if you are not sure!

 

Connect and login to vROps using an ssh client and run the commands as detailed below

Get Alert Count

  • su – postgres -c “/opt/vmware/vpostgres/9.3/bin/psql -d vcopsdb -A -t -c ‘select count(*) from alert'”

Get Alarm Count

  • su – postgres -c “/opt/vmware/vpostgres/9.3/bin/psql -d vcopsdb -A -t -c ‘select count(*) from alarm'”

*You will need to run on each node to get the total count.
Clear down Alarm and Alert history.

Connect and login to vROps using an ssh client and run the commands as detailed below

*Warning
– I take no responsibility for the impact of the commands detailed below!

– Running these commands as is will remove all the Alarm and Alert history which is not   particularly desirable in my opinion.

– You can actually choose to truncate alarms only – VMware support confirmed this was safe to do.

– DO NOT TRUNCATE ALERTS ONLY! if you are truncating alerts, you must also truncate alarms to prevent issues.
To delete the Alert and Alarm History follow this guide

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2123921

service vpostgres start
su – postgres -c “/opt/vmware/vpostgres/9.3/bin/psql -U vcops -d vcopsdb”
truncate table alert cascade;
truncate table alarm cascade;
\q
service vpostgres stop

*You will need to repeat on each node.

As I said, this is not particularly desirable as you lose all the history. Additionally, you are only masking the problem and buying yourself some time.

So what else can I do? I will cover this in part 2

Link to part 2
Link to part 3