What's the deal with server->next_retry and server->server_failure_counter

Asked by Massive Media

Hi all!

I'm trying to get a hold on the logic between server->next_retry and server->server_failure_counter, especially in the function memcached_mark_server_for_timeout.

What I noticed on our environment is that as soon as a connection issue occurs, the function memcached_mark_server_for_timeout is called. This function (in 1.0.17) does the following:

1) set the next_retry to whatever timeout is configured (default 2 seconds)
2) mark the server as MEMCACHED_SERVER_STATE_IN_TIMEOUT
3) increment the failure counter if it's a different query_id (or operation)
4) update the internal logging of the last disconnected host

Why does the server go into timeout directly, and not after failing server_failure_limit times? I can see it makes sense before marking it as DEAD, but when not in _is_auto_eject_host mode it makes no sense to me.

What I believe is right to do (not to cause an internal denial of service or flood) is to backup for next_retry time (default 2 seconds) when we had server_failure_limit amount of issues, and then start over again. Marking the server as MEMCACHED_SERVER_STATE_IN_TIMEOUT for every single query and keeping this behaviour is for next_retry amount of time makes no sense to me.

Is it intended to operate like this (and what's the reason behind it in this case), or is it a bug?

Please let me know, if it's intended then I'd propose a new behaviour (I'm happy to provide the patch) where we retry x amount of times before we backoff. We're running memcached on a high traffic website, having thousands of memcached calls per second. We run php with persistent memcached connections, and a glitch causes a backoff 2 seconds, being multiple requests where we don't can access our caching, resulting in database calls. (Ofcourse we tuned the retry time but right now only seconds precision is allowed).

Thanks for your positive feedback!

Best,
Nicolas

Question information

Language:
English Edit question
Status:
Expired
For:
libmemcached Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#1

I added a merge request https://code.launchpad.net/~493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli/libmemcached/feature-server_timeout/+merge/195974

This adds another counter before going into MEMCACHED_SERVER_STATE_IN_TIMEOUT mode. Since the default is 0 it won't harm other users, but at least we can retry before failing the whole connection.

Revision history for this message
Launchpad Janitor (janitor) said :
#2

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#3

Bump

Revision history for this message
Brian Aker (brianaker) said :
#4

Let me take a fresh look at this today.

Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#5

Thanks, we're running this in production now and it's going great. We were able to reduce our memcached errors by 95%. Now only real network glitches or timeouts disable servers from the pool.

Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#6

Hi Brian,

I'd love to see this upstream and get your thoughts on it. Right now the behaviour is weird, especially with persistent connections.

We're running PHP with the pecl memcached extension (https://github.com/php-memcached-dev/php-memcached). The biggest bottleneck is opening and closing connections, so we enable persistent connections on our side. The thing is that as soon as a single error occurs (and we set our timeouts tight since we have a goal of 100ms server-side page generation) the instance gets disabled and marked as TIMEOUT. But if we do 10 requests per second, and the timeout value is set to the default 2 seconds, the next 20 requests will have a memcached server that's marked as timeout because of the default behaviour. Ofcourse we want to prevent slamming the memcached server, so this mechanism is great to prevent an internal Denial of Service on the memcached server, but we want it to fail multiple times before putting it in timeout. Especially when we're using persistent connections.

I hope this clarifies our need for this a bit.

Best regards,
Nicolas

Revision history for this message
Launchpad Janitor (janitor) said :
#7

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#8

Any thoughts on this Brian?

Revision history for this message
Launchpad Janitor (janitor) said :
#9

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Brian Aker (brianaker) said :
#10

(still need to answer)

Revision history for this message
Launchpad Janitor (janitor) said :
#11

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Massive Media (493pocbrcycmdw7yksonho9o2qzz-o18bz-d18ecat4t1b76tkfi3vttrkfngli) said :
#12

Bump

Revision history for this message
Launchpad Janitor (janitor) said :
#13

This question was expired because it remained in the 'Open' state without activity for the last 15 days.