Using Seyren/Graphite to Monitor Cluster

Asked by RafaelRP

Hi,

In our environment, Graphite receives metrics from various nodes running CollectD. The metrics that are sent by collectD have the following format: collectd.service_<name>.cluster_<name>.<hostname>.<metric-name>
For example, system cpu percentage: collectd.service_helloworld.cluster_hello.host1.cpu-0.cpu-system

I'm using Seyren as the alerting dashboard and would like to setup an alert when the average cpu for a machine exceeds a certain threshold. One way of accomplishing this is to define a Seyren check for each host as follows:

averageSeries(collectd.service_A.cluster_A.hostname_1.cpu-*.cpu-system)
averageSeries(collectd.service_A.cluster_A.hostname_2.cpu-*.cpu-system)
averageSeries(collectd.service_A.cluster_A.hostname_3.cpu-*.cpu-system)
averageSeries(collectd.service_A.cluster_A.hostname_4.cpu-*.cpu-system)

However, as you can see, if you have a large cluster then defining and maintaining such checks becomes unwieldy.

So my question is, is it possible to formulate one graphite query that I could feed in as a target to a Seyren check that would accomplish what I described above? Has anyone else using Seyren/Graphite faced something similar? If so, how did you resolve it?

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Anatoliy Dobrosynets
Solved:
Last query:
Last reply:
Revision history for this message
Best Anatoliy Dobrosynets (anatolijd) said :
#1

its not very clear what exactly you want, but have you tried to use

averageSeriesWithWildCards(collectd.service_A.cluster_A.*.cpu-*.cpu-system,4)

or

averageSeriesWithWildcards(nonNegativeDerivative(collectd.service_A.cluster_A.*.cpu-*.cpu-system),4)

?

Revision history for this message
RafaelRP (rpolan01) said :
#2

Hi Anatoliy,

Essentially, I would like a query that will evaluate into multiple targets. In the example I gave above, each averageSeries function would be a target for one seyren check for a given host. So in this case, I'll have 4 seyren check corresponding to 4 machines.

I just experimented with you query, averageSeriesWithWildCards(collectd.service_A.cluster_A.*.cpu-*.cpu-system,4), and this is exactly what I was looking for. Thanks.

Revision history for this message
RafaelRP (rpolan01) said :
#3

Thanks Anatoliy Dobrosynets, that solved my question.

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#4

Are you sure that something like maxSeries() instead? It makes me sad to think you're alerting on the average of anything.

Revision history for this message
RafaelRP (rpolan01) said :
#5

Hi Jason, I'm looking to monitor a cluster that runs a business critical service. As a first step, I decided to setup some very basic alerts for each machine, such as cpu utilization (across all cores). My thoughts were that setting an alert if the average System CPU usage is above a certain threshold(~90%) would warrant taking a look at the workload and take some action (e.g. adding more resources).

What's wrong with alerting on the average?

Revision history for this message
RafaelRP (rpolan01) said :
#6

Moreover, I don't want to be alerted when there is a spike in utilization on one of the cores of the machine but only when there is a sustained spike upwards.

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#7
Revision history for this message
RafaelRP (rpolan01) said :
#8

Thanks. I could have Googled it my self but I was looking forward to read your explanation/opinion on the subject (blogpost suggestion?) ;)