Is there a Threshold plugin for Graphite?

Asked by Rich Francis

At the moment we are moving from Ganglia to Graphite, I had written a plugin for gmetad-python to simply generate an alert when any particular metric value breached a threshold.

It looks like this isn't possible currently in graphite, but I know I can't be the only person that would want to do something like this.

I just wondered what strategies people used out there to achieve this?

Is there a native way in Graphite to do this (which I've missed?)

Is there a plugin architecture planned (or a plugin architecture i've missed)?

Where would be a good starting point? - I'm thinking carbon would be the most logical place, although I can see you can add a threshold line to a graph - could something be extended here?

Thanks,
Rich

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Aman Gupta (tmm1) said :
#1

This belongs at a layer above graphite, so you can take advantage of various functions like summarize in the alerting. The url api supports returning json/csv raw data instead of a graph.

Revision history for this message
Rich Francis (rf-4) said :
#2

Thanks Aman, there are two issues I can see there, one is that you will only see new values each time you poll, which will be some time after a breach has been reported (e.g.1 minute after), and two, you are reliant on the web front end running in order to get the metrics, what do other people do, is this a common pattern?

I've read in a few places that people are running collectd and graphite in parallel - is this to keep threshold alerting?

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#3

First, Graphite is not a monitoring tool. It's a Graphing tool. It just happens that the data to make graphs is usually the same data people want to alarm off of ; )

Second, This is a very common pattern. Most people quickly find out that simple "If X goes above Y" alerts are too chatty and just spam your alarm console and send useless mails. The SLA (Service Level Agreement) at my organization for detecting problems is 5 minutes, so we look at 5 minute averages vs. the average of the same time last week, week before, week before that (and longer sometimes) to detect problems. Cry wolf too many times and people stop reading the alarms coming from your system.

As far as being 'reliant on the web frontend running', we're talking apache here. It's pretty rock-solid, and I've never seen it crash. Not even after requesting the past 5 minutes on 5,000 metrics as a test. You may get some errors if you're stringing together some strange functions or something, but that just makes a single request fail, not the whole system. If we were using nagios, we'd be 'reliant on the email system' and 'reliant on the SNMP system'.

I don't use collectd, so I can't speak to that.

Revision history for this message
Michael Leinartas (mleinartas) said :
#4

You may want to check out https://github.com/wayfair/Graphite-Tattle which was just introduced at ESC conf. It's a PHP webapp that supports dashboarding and alerting based off of graphite data.

Revision history for this message
chrismd (chrismd) said :
#5

It is useful to think of Graphite as a big time-series database that happens to have a UI (because that's pretty much all it is). If you were storing your datapoints in a mysql database, would you implement your alarming using select statements? For alarming logic that needs to look at history, compare many metrics, or to calculate trends that might be the way to go, but for simple threshold comparisons I would handle that before I even store the data in the database (or at least separately). An approach I've taken in the past is to aggregate all realtime monitoring data in one place (well, one sharded service) for the purposes of realtime analysis and then stream that aggregated data into Graphite. I also have used cronjob scripts that pull a larger dataset to do the aforementioned fancier examples.

Can you help with this problem?

Provide an answer of your own, or ask Rich Francis for more information if necessary.

To post a message you must log in.