Simultaneous Post requests of large objects could fail with "503 Service Unavailable"

Asked by Balachandra S T

When uploading multiple 5gb files in parallel, some of the post requests fail with "503 Service Unavailable".
I am seeing this on Ubuntu when trying to send 10 post requests in parallel 7/8 request are successful and the remaining 2/3 give 503 error.

I would like to know if this an expected behaviour.
Also please suggest if there is a way to increase the number of parallel uploads of large objects.

Thanks for any help.

Question information

Language:
English Edit question
Status:
Expired
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Peter Portante (peter-a-portante) said :
#1

Can you provide any "Traceback" messages from the various server log files related to the 503s?

What version of Swift are you using?

Revision history for this message
Balachandra S T (balu9463) said :
#2

I am using swift Grizzly(1.81) version on a 2 node setup.

Following are the traces from the /var/log/messages for 2 such requests.

Nov 11 07:28:20 testnode1 account-server XX.XX.XX.X1 - - [11/Nov/2013:14:28:20 +0000] "HEAD /sdb1/217909/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e" 204 - "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" "-" 0.0014 ""
Nov 11 07:28:21 testnode1 container-server XX.XX.XX.X1 - - [11/Nov/2013:14:28:21 +0000] "HEAD /sdb3/77424/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2" 204 - "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" "-" 0.4304
Nov 11 07:28:42 testnode1 container-server XX.XX.XX.X2 - - [11/Nov/2013:14:28:42 +0000] "PUT /sdb3/77424/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2/obj-0-2-3" 201 - "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" "-" 0.0004
Nov 11 07:28:42 testnode1 container-server XX.XX.XX.X2 - - [11/Nov/2013:14:28:42 +0000] "PUT /sdb1/77424/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2/obj-0-2-3" 201 - "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" "-" 0.0004
Nov 11 07:28:52 testnode1 swift-proxy ERROR with Object server XX.XX.XX.X1:6011/sdb2 re: Trying to get final status of PUT to /v1/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2/obj-0-2-3: Timeout (10s) (txn: tx89711b77dbda4bc4bc2c1e2a3b9137fe) (client_ip: 10.42.2.3)
Nov 11 07:28:52 testnode1 swift-proxy ERROR with Object server XX.XX.XX.X1:6011/sdb2 re: Trying to write to /v1/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-3/obj-0-3-3: ChunkWriteTimeout (10s)
Nov 11 07:29:04 testnode1 container-server XX.XX.XX.X1 - - [11/Nov/2013:14:29:04 +0000] "PUT /sdb2/77424/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2/obj-0-2-3" 201 - "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" "-" 0.0004
Nov 11 07:29:04 testnode1 object-server XX.XX.XX.X1 - - [11/Nov/2013:14:29:04 +0000] "PUT /sdb2/6480/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-2/obj-0-2-3" 201 - "-" "tx89711b77dbda4bc4bc2c1e2a3b9137fe" "-" 42.3358
.
.
.
Nov 11 07:29:31 testnode1 swift-proxy ERROR with Object server XX.XX.XX.X1:6011/sdb2 re: Trying to get final status of PUT to /v1/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-3/obj-0-3-4: Timeout (10s) (txn: tx37b8f33623144ad8a89b13d9a4efc7b7) (client_ip: 10.42.2.3)
Nov 11 07:29:31 testnode1 swift-proxy Object PUT returning 503 for [201, 503, 503] (txn: tx37b8f33623144ad8a89b13d9a4efc7b7) (client_ip: 10.42.2.3)
Nov 11 07:29:38 testnode1 container-server XX.XX.XX.X1 - - [11/Nov/2013:14:29:38 +0000] "PUT /sdb2/153194/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-3/obj-0-3-4" 201 - "tx37b8f33623144ad8a89b13d9a4efc7b7" "-" "-" 0.0005
Nov 11 07:29:48 testnode1 object-server XX.XX.XX.X1 - - [11/Nov/2013:14:29:48 +0000] "PUT /sdb3/245538/AUTH_f70bca6b4ad0465aa9e547b0ee16e89e/cont-0-3/obj-0-3-4" 499 - "-" "tx37b8f33623144ad8a89b13d9a4efc7b7" "-" 45.9426

Thanks for any help.

Revision history for this message
Peter Portante (peter-a-portante) said :
#3

It appears that your object servers are taking a long time to respond, 42.3 and 45.9 seconds respectively from the previously provided log output. The proxy server appears to be configured to wait 10 seconds for an object server to respond, so it has already responded to the client with the 503 (but keeps the connection to the object server open, this is fixed in Havana).

Are there any logs from the other node?

And have you checked the load on the systems? Sometimes the auditor is configured to run too often and can really bog down a system for a given disk.

How are you disks configured?

Revision history for this message
Launchpad Janitor (janitor) said :
#4

This question was expired because it remained in the 'Needs information' state without activity for the last 15 days.

Revision history for this message
Balachandra S T (balu9463) said :
#5

Apologies for a late turnaround. I had been testing for performance numbers on a 2 node default swift configuration. The results were not encouraging though.
I am using 3 local disks on each server and having default configuration on the Auditor. Logs from the other node did not give any information on the error.

With 10 concurrent users when uploading 15 objects per user with a mixed object size varying from 1KB to 5GB, I see a failure rate of 10% mainly with 503 error.
Increasing the timeout to 60 seconds and more only marginally reduced the failure rate.
I would be interested to know if these numbers are expected or there are possible configuration tweaks to improve the success rate for multiple parallel requests.
Any performance results for swift that I could be pointed to would be of great help.

Revision history for this message
Launchpad Janitor (janitor) said :
#6

This question was expired because it remained in the 'Open' state without activity for the last 15 days.