swift-dispersion-tool slow with unreachable storage nodes

Asked by Walter Huf on 2013-01-31

If a Swift storage node is down, the swift-dispersion-tool complains that the host is unreachable, for every single object and container that is hosted on that host, multiplied by the retry setting. This causes the tool to run very slowly, and the Nagios check based on this tool then fails.
This patch prevents the swift-dispersion-tool from trying hosts that were found to be down. A useful setting is to set the retries = 1 in /etc/swift/dispersion.conf

--- swift-dispersion-report 2012-10-11 14:19:43.000000000 +0000
+++ swift-dispersion-report.new 2013-01-31 16:38:48.959641951 +0000
@@ -33,6 +33,7 @@
 from swift.common.utils import compute_eta, get_time_units, TRUE_VALUES

+unreachable = []
 unmounted = []
 notfound = []
 json_output = False
@@ -60,6 +61,9 @@
                 stderr.flush()
         if not hasattr(msg_or_exc, 'http_status') or \
                 msg_or_exc.http_status not in (404, 507):
+ ip = prefix.split(':')[0]
+ if ip not in unreachable:
+ unreachable.append(ip)
             print >>stderr, 'ERROR: %s: %s' % (prefix, msg_or_exc)
             stderr.flush()
     return error_log
@@ -85,6 +89,8 @@
     def direct(container, part, nodes):
         found_count = 0
         for node in nodes:
+ if node['ip'] in unreachable:
+ continue
             error_log = get_error_log('%(ip)s:%(port)s/%(device)s' % node)
             try:
                 attempts, _junk = direct_client.retry(
@@ -183,6 +189,8 @@
     def direct(obj, part, nodes):
         found_count = 0
         for node in nodes:
+ if node['ip'] in unreachable:
+ continue
             error_log = get_error_log('%(ip)s:%(port)s/%(device)s' % node)
             try:
                 attempts, _junk = direct_client.retry(

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
2013-01-31
Last reply:
2013-01-31
Samuel Merritt (torgomatic) said : #1

Seems like a fairly reasonable change.

Two things:

1) make the variable "unreachable" a set, not a list (it's a micro-optimization, but still...)

2) please submit this as a changeset in Gerrit. You'll have to sign the Contributor License Agreement (CLA), which is a bit tedious, but at least it only has to be done once. See http://wiki.openstack.org/HowToContribute#If_you.27re_a_developer.2C_start_here: for details.

Can you help with this problem?

Provide an answer of your own, or ask Walter Huf for more information if necessary.

To post a message you must log in.