Missing Metrics in Carbon Cache

Asked by Ryan Addington

We are seeing some missing metrics (null) from the cache when we query from graphite-web but some data is retrieved from the cache and displayed. Once everything is flushed to disk all metrics appear. Both members of the cluster exhibit the same behavior.

We have a new Graphite Cluster of two servers, each server is running carbon-relay, one carbon cache instance per server and graphite-web sitting behind a load balancer. We are using consistent-hashing with a replication-factor of 2 so that both servers have the same data.

Carbon-cache settles between 100k-300k metrics, the fewer metrics in cache results in fewer metrics missing. Restarts of carbon-cache will mask the issue for a few hours as the cache grows. We are writing ~275k metrics per minute.

The server referenced below uploads metrics every 5 seconds, matching what is set in the storage-scheme.conf. We have other servers that upload metrics every 30 seconds and the behavior is similar.

Attached below are some of the configs and logs illustrating the issue. If any other info is needed please let me know and any assistance is greatly appreciated.

tail -f query.log
07/11/2017 12:40:43 :: [127.0.0.1:36494] cache query for "web.servers.web.HOST1.perf.processor.pct_processor_time" returned 7 values

Below the oldest ~16 metrics have been persisted to disk.. all showing up correctly in graphite-web. We then have ~21 null metrics that are not being pulled from the cache but I presume to be there since they get written to the disk a few moments later. We then have the newest 7 metrics being pulled from the cache and shown in graphite-web correctly.

curl "http://SERVER1/render/?target=web.servers.web.HOST1.perf.processor.pct_processor_time&format=json"
[7.0, 1510079840], [6.0, 1510079845], [4.0, 1510079850], [5.0, 1510079855], [5.0, 1510079860], [8.0, 1510079865], [5.0, 1510079870], [6.0, 1510079875], [9.0, 1510079880], [5.0, 1510079885], [4.0, 1510079890], [3.0, 1510079895], [4.0, 1510079900], [3.0, 1510079905], [null, 1510079910], [null, 1510079915], [null, 1510079920], [null, 1510079925], [null, 1510079930], [null, 1510079935], [null, 1510079940], [null, 1510079945], [null, 1510079950], [null, 1510079955], [null, 1510079960], [null, 1510079965], [null, 1510079970], [null, 1510079975], [null, 1510079980], [null, 1510079985], [null, 1510079990], [null, 1510079995], [null, 1510080000], [null, 1510080005], [4.0, 1510080010], [8.0, 1510080015], [6.0, 1510080020], [6.0, 1510080025], [4.0, 1510080030], [5.0, 1510080035], [4.0, 1510080040]],

python ~/whisper-info.py /opt/graphite/storage/whisper/web/servers/web/HOST1/perf/processor/pct_processor_time.wsp
maxRetention: 63072000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 10172212

Archive 0
retention: 2592000
secondsPerPoint: 5
points: 518400
size: 6220800
offset: 52

Archive 1
retention: 15552000
secondsPerPoint: 60
points: 259200
size: 3110400
offset: 6220852

Archive 2
retention: 63072000
secondsPerPoint: 900
points: 70080
size: 840960
offset: 9331252

carbon.conf
[cache]
DATABASE = whisper
ENABLE_LOGROTATION = True
USER =
MAX_CACHE_SIZE = 5000000
MAX_UPDATES_PER_SECOND = 750
MAX_CREATES_PER_MINUTE = 500
MIN_TIMESTAMP_RESOLUTION = 1
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2004
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2004
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2005
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CREATES = False
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60
GRAPHITE_URL = http://127.0.0.1:80
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 2
DESTINATIONS = 127.0.0.1:2005, 10.1.1.12:2005
MAX_QUEUE_SIZE = 100000
MAX_DATAPOINTS_PER_MESSAGE = 500
QUEUE_LOW_WATERMARK_PCT = 0.8
TIME_TO_DEFER_SENDING = 0.0001
USE_FLOW_CONTROL = True
USE_RATIO_RESET=False
MIN_RESET_STAT_FLOW=1000
MIN_RESET_RATIO=0.9
MIN_RESET_INTERVAL=121
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
FORWARD_ALL = True
DESTINATIONS = 127.0.0.1:2004
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5

local_settings.py
SECRET_KEY = 'Edited'
CLUSTER_SERVERS = ["10.1.1.12:80"]
REMOTE_FIND_TIMEOUT = 3.0 # Timeout for metric find requests
REMOTE_FETCH_TIMEOUT = 3.0 # Timeout to fetch series data
REMOTE_RETRY_DELAY = 60.0 # Time before retrying a failed remote webapp

Question information

Language:
English Edit question
Status:
Expired
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#1

Hello Ryan,

That's behavior is quite odd. Graphite is not required that metric should be flushed to disk - metrics should be available directly from carbon cache using carbonlink protocol (CARBONLINK_HOSTS=['127.0.0.1:7002'] by default)
But probably cache in your case is broken, because you're using '127.0.0.1' as a destination, although consistent hash algorithm uses *hostname* as a part of the hashing key. Theoretically, it doesn't matter for RF=2, but...
So, please change:

 DESTINATIONS = 127.0.0.1:2005, 10.1.1.12:2005

to

 DESTINATIONS = <IP of the node>:2005, 10.1.1.12:2005

This line should be the same (including order) for all cache daemons.

Try also to add
CARBONLINK_HOSTS=['127.0.0.1:7002']
explicitly to local_settings.py on both servers.

Revision history for this message
Ryan Addington (jryana) said :
#2

Thanks for the reply Denis,
I changed the Destinations and Carbonlink hosts to match your recommendation and restarted all the services. However we are still seeing the issue.

DESTINATIONS = 10.1.1.11:2005, 10.1.1.12:2005
CARBONLINK_HOSTS = ["127.0.0.1:7002"]

Any other suggestions?

Thanks,
Ryan

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#3

Hi Ryan,

No, unfortunately I have no other ideas if I understand your issue correctly - but probably I'm not.
So, you have sequence of nulls in your metric, right?
But this sequence is in the middle of metric and it never "heals" later? Or it does?

Revision history for this message
Ryan Addington (jryana) said :
#4

I have a sequence of nulls as well as valid data returned out of the cache but all are valid once flushed to disk.

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#5

That's really strange, TBH I have no idea how it can be. Your symptoms look like your cache is not working for some reason, and Graphite is reading data from disk only, which is of course slow. But in this case you should have null series at the end of metric - but you have some fresh data at the end - how that could be?

Revision history for this message
Launchpad Janitor (janitor) said :
#6

This question was expired because it remained in the 'Open' state without activity for the last 15 days.