help tuning carbon

Asked by harabk

We are running one server with carbon on it - it is a 2core 2GB linux vm - storage is on an SSD SAN. With the following configuration:

CACHE_WRITE_STRATEGY = max
MAX_CACHE_SIZE = inf
USE_FLOW_CONTROL = True
WHISPER_FALLOCATE_CREATE = True
MAX_CREATES_PER_MINUTE = 100
MAX_UPDATES_PER_SECOND = 1000
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = True
LOG_LISTENER_CONNECTIONS = False
LOG_UPDATES = False
ENABLE_LOGROTATION = True
WHISPER_AUTOFLUSH = False

We are seeing carbon crash multiple times a day because it runs out of ram. Additionally, we are seeing 100% of 1 core utilized (with the other core sitting idle). Other relevant stats - iostat utilization is below 5%, avgUpdateTime increases, metricsReceived is between 60k-150k.

The latest change we have tried is changing CACHE_WRITE_STRATEGY = naive which has helped CPU utilization (dropped to 15%) and increased iostat utilization to between 5-30%. Haven't seen a crash thus far, but the change was made not too long ago.

We have been reading through what we think are relevant forum posts in trying to tune this configuration and would really appreciate any insight/tips on how to move forward. We would also like to avoid dropping data points, if possible.

Question information

Language:
English Edit question
Status:
Expired
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Denis Zhdanov (denis-zhdanov) said :
#1

Just short reply from my phone - stick with naive write strategy (max is
very CPU consuming and not usable because of that) and limit max_cache_size
to some sane value (depend on your load, e.g. to keep 10 or 20-30 minutes
of metrics flow) - it will help with RAM consumption.

On Thu, 8 Oct 2015 at 22:53, harabk <email address hidden>
wrote:

> New question #272229 on Graphite:
> https://answers.launchpad.net/graphite/+question/272229
>
> We are running one server with carbon on it - it is a 2core 2GB linux vm -
> storage is on an SSD SAN. With the following configuration:
>
> CACHE_WRITE_STRATEGY = max
> MAX_CACHE_SIZE = inf
> USE_FLOW_CONTROL = True
> WHISPER_FALLOCATE_CREATE = True
> MAX_CREATES_PER_MINUTE = 100
> MAX_UPDATES_PER_SECOND = 1000
> LOG_CACHE_HITS = False
> LOG_CACHE_QUEUE_SORTS = True
> LOG_LISTENER_CONNECTIONS = False
> LOG_UPDATES = False
> ENABLE_LOGROTATION = True
> WHISPER_AUTOFLUSH = False
>
> We are seeing carbon crash multiple times a day because it runs out of
> ram. Additionally, we are seeing 100% of 1 core utilized (with the other
> core sitting idle). Other relevant stats - iostat utilization is below 5%,
> avgUpdateTime increases, metricsReceived is between 60k-150k.
>
> The latest change we have tried is changing CACHE_WRITE_STRATEGY = naive
> which has helped CPU utilization (dropped to 15%) and increased iostat
> utilization to between 5-30%. Haven't seen a crash thus far, but the change
> was made not too long ago.
>
> We have been reading through what we think are relevant forum posts in
> trying to tune this configuration and would really appreciate any
> insight/tips on how to move forward. We would also like to avoid dropping
> data points, if possible.
>
> --
> You received this question notification because your team graphite-dev
> is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#2

As Denis says, you need to limit MAX_CACHE_SIZE. You're attempting to write to a SAN which is a big no-no; your IOPS are almost certainly inadequate, which is causing your cache memory to fill with datapoints and carbon-cache to crash.

You need to determine how many IOPS your SAN is actually capable of delivering over the wire, then setting your MAX_UPDATES_PER_SECOND somewhere below that. If you don't have disk capable of IOPS at the same rate as the number of datapoints coming into Carbon, then your goal should be to increase the number of batched datapoints per write (pointsPerUpdate). By adjusting MAX_UPDATES_PER_SECOND you should be able to achieve a higher batch rate.

Revision history for this message
harabk (kharabasz) said :
#3

Thanks Dennis - switching to naive has really helped the situation. We are seeing low CPU utilization, low memory consumption, and iostat utilization shows that we are writing substantially more data to storage. We are a bit concerned with how spiky the iostat utilization graph is - every hour or so we see large spikes from 5-30% to around 90-95%.

We have switched a parallel environment to CACHE_WRITE_STRATEGY = sorted to compare - iostat utilization looks much better here - without sharp spikes, and the memory consumption is even lower than with naive.

Is sorted preferred over naive?

Some additional questions:
Is there a way to monitor lost data points? (is this what cacheOverflow is for?) We are looking for a good strategy/approach to figure out MAX_CACHE_SIZE for our use case.

Is there a way to have carbon use more than one core? Should we be worried about the carbon process competing with the os for the first core?

Revision history for this message
Launchpad Janitor (janitor) said :
#4

This question was expired because it remained in the 'Open' state without activity for the last 15 days.