Collecting system metrics

Asked by Pete Emerson

In order to collect generic system metrics (CPU usage, load averages, cpu wait I/O, memory usage, network bytes in / out, tcp connections, et cetera) I'm using gmond from the Ganglia project to send that data to a gmond collector. I then have a script that runs every minute and talks directly to the carbon agent running on port 2003.

I'd like to get rid of the dependency on gmond if possible, and I'm wondering what you recommend to pull this sort of data out of a server and get it in to graphite.

Thanks,
Pete

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
chrismd (chrismd) said :
#1

Unfortunately I have always used custom built monitoring agents for this purpose and so far have been unsuccessful in convincing my various employers to let me release them publicly. I have heard of some people working on integrating with collectd (http://collectd.org/) but I have never tried it personally.

I've found that building a monitoring agent generally isn't too hard and can be quite educational, its also fun because you can do it one piece at a time. Plugin-based is always the way to go, and simplicity & flexibility are the keys to making a good plugin API. The tool I am currently using is a python daemon that executes plugin scripts once a minute and simply captures their stdout, which is a pickled dict (this is very flexible but tied to python, json would be more general-purpose) containing lists of metrics, errors, and descriptive information about the application the plugin is monitoring. My plugins use a simple library of convenience functions for common operations (outputting their results, scraping a log, saving/loading their state, etc). My agents send the data collected by their plugins to a central server which applies alarming logic, computes synthetic metrics, and ultimately forwards the data on to Graphite.

If you or anyone else is interested in working on such a tool, I would be more than happy to help.

Revision history for this message
chrismd (chrismd) said :
#2

Another approach for any shell scripting ninjas out there, you could conceivably do something as simple as this:

vmstats 60 | awk '{big-line-of-awk-code}' | nc graphiteserver 2003

And so on for sar, iostat, etc...

Revision history for this message
Pete Emerson (pete-appnexus) said :
#3

How big have you seen talking to the carbon-agent via TCP on port 2003 scale up?

Moving to this model means all of the servers hitting that port at the same time instead of having gmond collect the data and iterating through all of the hosts at once. I can see advantages and disadvantages to both.

Pete

Revision history for this message
chrismd (chrismd) said :
#4

It should scale well up to at least a few thousand clients. I haven't tested with more than ~2,000 or so but at that level CPU usage was still nominal. This is assuming you're on Linux and are using the twisted epoll reactor. Try this code from a python interpreter:

from twisted.internet import epollreactor
epollreactor.install()

If you get no errors then you are good to go. epoll is far more scalable than the default reactor and carbon attempts to use it but will fall back to the default select reactor if it is not available. The select reactor does not work above approximately 1,024 clients and is much more CPU intensive.

Also this is assuming you're on at least version 0.9.6 where carbon-cache.py is built on the twisted framework.

Revision history for this message
Pete Emerson (pete-appnexus) said :
#5

When you tested with a few thousand clients, were your whisper files on a RAM disk or on a platter? We're currently using a RAM disk, but I'd like to use a SCSI RAID setup if possible so that in the event of a hard crash I don't have to restore from a nightly backup.

Revision history for this message
chrismd (chrismd) said :
#6

I've never actually tried using a RAM disk with Graphite. I've always used RAID arrays of disks. I've heard good things about solid state drives but never tried them myself. In theory they should be ideal because they don't have the seek-time bottleneck that traditional hard-drives do. But as I said, I've never tried it personally and I don't know of any hard data that shows graphite's performance on such a system.

If you're using epoll then the number of clients really isn't much of a factor. It is extremely efficient at handling the network traffic (the efficiency is thanks to the kernel). If the data you are sending to carbon isn't changing, that is you're sending the same number of metrics just from more clients, then you shouldn't notice much difference from carbon's perspective. However if you are increasing the number of metrics then that is different.

I can give you some more advice if you tell me a bit more about your system and the change you are making.
What is your current rate of datapoints being sent to carbon? What do you estimate the new rate to be?
How much cpu does carbon-cache currently use on average?
Are you using the plain-text protocol (port 2003) or the pickle protocol (port 2004)? If carbon-cache is using too much CPU, switching from the plain-text protocol to the pickle protocol can save you a bunch of overhead.

Revision history for this message
Pete Emerson (pete-appnexus) said :
#7

I'm still running 0.9.4 and am starting in on migrating to 0.9.6.

We're currently running a single server, using gmond / gmetric to collect our stats, and then a cron job whips through all of those stats on the metrics box and then opening TCP sockets to the carbon-agent on port 2003. A small set of metrics are direct-to-carbon.

I'm migrating to 0.9.6 with federated servers (one per datacenter), and having stats sent directly to the carbon-agent. I'm not sure whether I'll pickle or not, I suppose if pickling saves overhead I'd do that.

I'd love to help provide a collection process per Question 136096; I'll see if I can get permissions to open source what I write.

Currently the cron job is updating over 100k metrics every minute (it runs in @6 seconds) to a 15GB tmpfs (not solid state drives, so if the box goes down, we lose all the metrics).

The hardware underneath is currently a Dell PowerEdge m600 with a Quad-core Xeon L5420, 2.5GHz, 2x6MB cache with 16GB, 667MHz Memory and 2x94GB 10k RPM SAS, RAID-1

I can move this to 8 disks in a RAID-10 configuration if I need to, and *might* be able to get dual-proc quad cores as well.

In addition, the server is running Xen, so there is a bit of overhead there. If I really need to I can probably get this layer stripped out, but I'd definitely rather leave it in.

Pete

Revision history for this message
Pete Emerson (pete-appnexus) said :
#8

As far as a collection agent is concerned, I'd love to contribute back to Graphite in this way. My python is very raw, however (I'm probably writing very perlish python). I'd also need to get it approved from the powers-that-be.

It seems that something as simple as:

for file in glob.glob('path/to/plugins/*'):
        exit_code, result = commands.getstatusoutput(file)
        if exit_code != 0:
                print "Failed to get info from " + file
                continue
        # SEND DATA TO GRAPHITE

would be a minimum viable product. Add a configuration file via ConfigParser for settings (like debug mode, graphite server / port, et cetera) and it would be in good shape. Add daemon support for those who don't want it running in cron and it'd be even better.

Pete

Revision history for this message
kraig (kamador) said :
#9

I've written a plugin based poller that we use at to collect JMX data, performance from URLs, regex values of of URLs, and probably some other things I am forgetting. It builds its configurations by building a host:port list of targets from our VIPs and a manually list of hosts that are not VIP targets. From there the ports are mapped back to a service definition yaml which is used to track port allocations. In there the different plugins are configured for each service.

I'll see if I can open source that although I imagine the VIP configuration parsing will be useless to most people, and potentially the service configuration file structure.

In the mean time, I committed a simple collector I wrote which you can run to collect system metrics. Some of it is a little kludgy, you will find it in the contrib directory.

http://bazaar.launchpad.net/~graphite-dev/graphite/main/annotate/head:/contrib/demo-collector.py

--
Kraig Amador

From: Pete Emerson <<email address hidden><mailto:<email address hidden>>>
Reply-To: "<email address hidden><mailto:<email address hidden>>" <<email address hidden><mailto:<email address hidden>>>
Date: Wed, 1 Dec 2010 16:30:10 -0800
To: "<email address hidden><mailto:<email address hidden>>" <<email address hidden><mailto:<email address hidden>>>
Subject: Re: [Graphite-dev] [Question #136096]: Collecting system metrics

Question #136096 on Graphite changed:
https://answers.launchpad.net/graphite/+question/136096

Pete Emerson posted a new comment:
As far as a collection agent is concerned, I'd love to contribute back
to Graphite in this way. My python is very raw, however (I'm probably
writing very perlish python). I'd also need to get it approved from the
powers-that-be.

It seems that something as simple as:

for file in glob.glob('path/to/plugins/*'):
        exit_code, result = commands.getstatusoutput(file)
        if exit_code != 0:
                print "Failed to get info from " + file
                continue
        # SEND DATA TO GRAPHITE

would be a minimum viable product. Add a configuration file via
ConfigParser for settings (like debug mode, graphite server / port, et
cetera) and it would be in good shape. Add daemon support for those who
don't want it running in cron and it'd be even better.

Pete

--
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.

_______________________________________________
Mailing list: https://launchpad.net/~graphite-dev
Post to : <email address hidden><mailto:<email address hidden>>
Unsubscribe : https://launchpad.net/~graphite-dev
More help : https://help.launchpad.net/ListHelp

Revision history for this message
Matt O'Keefe (matthewokeefe) said :
#10

Even better than SSDs behind a RAID card would be something like a Fusion-io
PCIe card. Expensive, but very fast. Might save $$ in terms of reduced
#servers.

-Matt

On Wed, Dec 1, 2010 at 7:06 PM, kraig
<email address hidden>wrote:

> Question #136096 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/136096
>
> Status: Open => Answered
>
> kraig proposed the following answer:
> I've written a plugin based poller that we use at to collect JMX data,
> performance from URLs, regex values of of URLs, and probably some other
> things I am forgetting. It builds its configurations by building a
> host:port list of targets from our VIPs and a manually list of hosts
> that are not VIP targets. From there the ports are mapped back to a
> service definition yaml which is used to track port allocations. In
> there the different plugins are configured for each service.
>
> I'll see if I can open source that although I imagine the VIP
> configuration parsing will be useless to most people, and potentially
> the service configuration file structure.
>
> In the mean time, I committed a simple collector I wrote which you can
> run to collect system metrics. Some of it is a little kludgy, you will
> find it in the contrib directory.
>
> http://bazaar.launchpad.net/~graphite-
> dev/graphite/main/annotate/head:/contrib/demo-collector.py
>
>
> --
> Kraig Amador
>
> From: Pete Emerson <<email address hidden><mailto:
> <email address hidden>>>
> Reply-To: "<email address hidden><mailto:
> <email address hidden>>" <
> <email address hidden><mailto:
> <email address hidden>>>
> Date: Wed, 1 Dec 2010 16:30:10 -0800
> To: "<email address hidden><mailto:
> <email address hidden>>" <<email address hidden>
> <mailto:<email address hidden>>>
> Subject: Re: [Graphite-dev] [Question #136096]: Collecting system metrics
>
> Question #136096 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/136096
>
> Pete Emerson posted a new comment:
> As far as a collection agent is concerned, I'd love to contribute back
> to Graphite in this way. My python is very raw, however (I'm probably
> writing very perlish python). I'd also need to get it approved from the
> powers-that-be.
>
> It seems that something as simple as:
>
>
> for file in glob.glob('path/to/plugins/*'):
> exit_code, result = commands.getstatusoutput(file)
> if exit_code != 0:
> print "Failed to get info from " + file
> continue
> # SEND DATA TO GRAPHITE
>
> would be a minimum viable product. Add a configuration file via
> ConfigParser for settings (like debug mode, graphite server / port, et
> cetera) and it would be in good shape. Add daemon support for those who
> don't want it running in cron and it'd be even better.
>
> Pete
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden><mailto:
> <email address hidden>>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Can you help with this problem?

Provide an answer of your own, or ask Pete Emerson for more information if necessary.

To post a message you must log in.