Unable to PUT objects after reliability test has run for a while

Asked by offline

Hello,

I've been running a reliability test on Swift which consists of a number of threads repeatedly putting and getting objects from containers. After a while Swift seems to get itself in a tangle that it can't get out of, but we can't see why. It seems to be triggered by one node reporting it has run out of space, but it should be nowhere near its limit.

The problem appears as 404 : Not Found returned on every attempt to PUT an object into any container, although the containers do exist. Examination of the Swift nodes shows the following example errors. The errors are reported repeatedly. These are just what seem to be typical examples to show what is going on.

1. Data node inexplicably runs out of space:
a. Jun 25 08:08:15 swiftdatadev3 object-server ERROR __call__ error with PUT /sdb/81/AUTH_f81d644f94ce4048bb708a87f4a247cf/9-1/10734 : [Errno 28] No space left on device: '/srv/node/sdb/tmp/tmpTINoba.tmp' (txn: txfd2c8c406dff4d208ab3e1ce81a4e7f9)
b. Jun 25 08:08:16 swiftdatadev3 object-updater UNCAUGHT EXCEPTION#012Traceback (most recent call last):#012 File "/usr/bin/swift-object-updater", line 23, in <module>#012 run_daemon(ObjectUpdater, conf_file, **options)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 88, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 54, in run#012 self.run_forever(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/updater.py", line 87, in run_forever#012 self.object_sweep(os.path.join(self.devices, device))#012 File "/usr/lib/python2.6/site-packages/swift/obj/updater.py", line 150, in object_sweep#012 self.process_object_update(update_path, device)#012 File "/usr/lib/python2.6/site-packages/swift/obj/updater.py", line 197, in process_object_update#012 write_pickle(update, update_path, os.path.join(device, 'tmp'))#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 859, in write_pickle#012 fd, tmppath = mkstemp(dir=tmp, suffix='.tmp')#012 File "/usr/lib64/python2.6/tempfile.py", line 293, in mkstemp#012 return _mkstemp_inner(dir, prefix, suffix, flags)#012 File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner#012 fd = _os.open(file, flags, 0600)#012OSError: [Errno 28] No space left on device: '/srv/node/sdb/tmp/tmptHZair.tmp'

2. Then the account-replicator can’t access its database file(s)
a. Jun 25 08:08:18 swiftdatadev3 account-replicator ERROR reading db /srv/node/sdb/accounts/259/565/40c77d708513caaf9440333ed0892565/40c77d708513caaf9440333ed0892565.db: #012Traceback (most recent call last):#012 File "/usr/lib/python2.6/site-packages/swift/common/db_replicator.py", line 342, in _replicate_object#012 time.time() - (self.reclaim_age * 2))#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 1397, in reclaim#012 conn.commit()#012 File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__#012 self.gen.throw(type, value, traceback)#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 318, in get#012 self.possibly_quarantine(*sys.exc_info())#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 310, in get#012 yield conn#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 1384, in reclaim#012 ''', (container_timestamp,))#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 81, in execute#012 return self._timeout(lambda: sqlite3.Connection.execute(#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 74, in _timeout#012 return call()#012 File "/usr/lib/python2.6/site-packages/swift/common/db.py", line 82, in <lambda>#012 self, *args, **kwargs))#012OperationalError: unable to open database file

3. These errors propagate through all nodes (possibly independently) eventually container replicators on all nodes stop responding to container replicator
a. Jun 26 08:33:14 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 4, 'weight': 100.0, 'ip': '10.2.1.4', 'id': 3, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
b. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 6, 'weight': 100.0, 'ip': '10.2.1.6', 'id': 5, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
c. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 4, 'weight': 100.0, 'ip': '10.2.1.4', 'id': 3, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
d. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 6, 'weight': 100.0, 'ip': '10.2.1.6', 'id': 5, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
e. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 4, 'weight': 100.0, 'ip': '10.2.1.4', 'id': 3, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
f. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 6, 'weight': 100.0, 'ip': '10.2.1.6', 'id': 5, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
g. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 4, 'weight': 100.0, 'ip': '10.2.1.4', 'id': 3, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)
h. Jun 26 08:33:15 swiftdatadev3 container-replicator ERROR reading HTTP response from {'zone': 6, 'weight': 100.0, 'ip': '10.2.1.6', 'id': 5, 'meta': '', 'device': 'sdb', 'port': 6001}: Timeout (10s)

4. On inspection the disk is not full, but it is likely that it filled up then reduced, but the reason for which is unknown

A restart of all nodes stops the timeout issues, but we have no idea why the data nodes run out of space or why space is freed. Is there any reason for which Swift would decide to temporarily consume a lot of disk space?

The test consists of 50 threads doing a PUT with an object, then a GET on it some time later. Each thread will eventually PUT 30,000 objects in each of 2 containers per thread. The data objects are very small, e.g.,
    "Content of object 11234 in container 15-1 \n"
The test is rate limited to 2,100 HTTP requests (GET or PUT) per minute which is the expected traffic rate we want it to support. The problem appears at about object 18,000 out of 30,000 in the first of the 2 containers.

The Swift cluster consists of a load balancer in front of 2 x Swift proxies, in turn connected to 6 Swift data nodes. All these systems are VM's in a managed cluster of physical servers and so may compete for physical resources, but we think they are provisioned adequately for this phase of testing.

Inspecting disk use shows the following.

[root@swiftproxydev1 ~]# for n in 1 2 3 4 5 6 ; do echo $n ; ssh root@swiftdatadev$n df /srv/node/sdb; done
1
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 3851464 6624056 37% /srv/node/sdb
2
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 4035788 6439732 39% /srv/node/sdb
3
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 3996576 6478944 39% /srv/node/sdb
4
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 3946076 6529444 38% /srv/node/sdb
5
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 4258268 6217252 41% /srv/node/sdb
6
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 10475520 3932064 6543456 38% /srv/node/sdb

[root@swiftproxydev1 ~]# for n in 1 2 3 4 5 6 ; do echo $n ; ssh root@swiftdatadev$n df -i /srv/node/sdb; done
1
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 8114024 1489273 6624751 19% /srv/node/sdb
2
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 8161356 1487630 6673726 19% /srv/node/sdb
3
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 7969216 1488848 6480368 19% /srv/node/sdb
4
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 8251796 1488203 6763593 19% /srv/node/sdb
5
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 7934648 1486239 6448409 19% /srv/node/sdb
6
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 8032216 1487262 6544954 19% /srv/node/sdb
[root@swiftproxydev1 ~]#

Question information

Language:
English Edit question
Status:
Expired
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
gholt (gholt) said :
#1

I wonder if the problems you are experiencing are due to this bug:

https://bugs.launchpad.net/swift/+bug/1012714

The fix just got committed so you might retry using master.

Revision history for this message
Launchpad Janitor (janitor) said :
#2

This question was expired because it remained in the 'Open' state without activity for the last 15 days.