Can't seem to be able to tune carbon for AMQP

Asked by Patrick O'Brien

Hey all,

We've been trying out different solutions for getting data from collectd to graphite and looks like we have hit a wall with hooking AMQP into carbon. No matter what I try I can't seem to get carbon to consume nearly as fast as we produce metrics. I've tried various combos of both small and large caches with both small and large writes_per_second and creates_per_minute but no matter what we do the message queue fills up a lot faster than carbon can consume.

Right now we're seeing about 34920 messages backlogged per minute in our message queue (582 per second).

The carbon has four 2 TB FC LUNs (it's 3PAR so it's a lot of 15k FC drives backing it, but the files are spread out over the entire drive cluster) striped on the host using LVM (for a total of ~8 TB) and regardless of what combo of cache settings we put in, we only see about 0.7% IO wait on the carbon host and LVM is set up to spread the load across all 4 LUNs, so I don't think that's the issue. From what I've read the sweet spot should be around ~50% IO wait for the disk.

I've also tried this going to local disks (15k SAS in RAID1) and we get the same results.

We're using ext4 with the deadline scheduler on RHEL6. The carbon host itself is a Dell M610, 8 core : Xeon(R) CPU X5570 @ 2.93GHz, and 48GB of RAM. While carbon is off the IO utilization is 0%

Our rabbitmq cluster is 2 hosts with the same specs sans the FC LUNs.

here is what we are using to get collectd data => message queue: https://github.com/poblahblahblah/collectd-http-carbon

here is our storage-schemas.conf file: https://gist.github.com/e88bc325926940d300d6

here is our carbon.conf file: https://gist.github.com/be63c1beae01b067600d

here is a bonnie++ run: https://gist.github.com/001eee920613aa30b42a

If we kill the unicorn process of sending the processed metrics to rabbitmq carbon eventually catches up and the graphs are updated as expected. Are we doing too many updates or do we just need to look into some kind of intelligent way to split the data up amongst multiple servers with different patterns per carbon-cache process?

Let me know if there is any other data which would help out. I am sure I am just doing something dumb with carbon.

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
chrismd
Solved:
Last query:
Last reply:
Revision history for this message
Best chrismd (chrismd) said :
#1

Unfortunately there is a major performance bottleneck in the txamqp library carbon uses to speak AMQP, at least there was last time I tested it several months ago. Basically its because the serialization/framing of the AMQP protocol is all done in pure python, so it is not very CPU efficient. You'll probably see carbon using 100% CPU all the time even when you send it only a couple thousand metrics.

My suggestion is to send datapoints either via the plaintext protocol, or if that too becomes a CPU bottleneck then via the pickle protocol (described in a few past questions on this answers forum). If you want to press on with AMQP though I would suggest setting AMQP_METRIC_NAME_IN_BODY = True in your carbon.conf and change the format of each message to be the same as the plaintext protocol (ie. metric value timestamp\n(rinse & repeat)). This will let you pack more datapoints into each message to reduce overhead.

Revision history for this message
chrismd (chrismd) said :
#2

Also you could write a script to pull your metrics out of the AMQP queues using a more efficient AMQP library and then feed it into carbon through plaintext or pickle. The reason carbon itself cannot switch to a more efficient AMQP library is because no other implementations exist for the Twisted framework, which carbon is built on.

Revision history for this message
Patrick O'Brien (obrien-patrickl) said :
#3

Thanks chrismd, that solved my question.