PUT cost a lot of time

Asked by kalrey on 2013-08-02

When I put a large number of 4K files, the performance is terrible.

My testing configuration:
[bench]
concurrency = 200
object_size = 4096
num_objects = 200000
num_containers = 200

Now I trace the code of PUT operation and find some problems.
I seem looks like that three step will cost a lot of time in PUT operation.

in function "ObjectController::PUT":

for chunk in iter(lambda: reader(self.network_chunk_size), ''):
        upload_size += len(chunk)
        if time.time() > upload_expiration:
                self.logger.increment('PUT.timeouts')
                return HTTPRequestTimeout(request=request)
        etag.update(chunk)
        while chunk:
                written = os.write(fd, chunk)
                chunk = chunk[written:]
        sleep()

per lambda:reader will cost 600ms when I put 4K files. And also eventlet.sleep() will cost 400ms.
At last, fsync() and async_update() will const 400ms.

Is there anyone who had faced this problem? How to fix this issue?

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Solved by:
Chuck Thier
Solved:
2013-08-12
Last query:
2013-08-12
Last reply:
2013-08-06
Samuel Merritt (torgomatic) said : #1

The reader is what's pulling data off the network. That's expected to take a long time.

The fsync() and async_update() will take a long time because they're doing a bunch of disk IO. fsync() makes sure any data in buffer cache is written out to disk, so if that's taking a while, it's because that call makes a bunch of disk heads move around and wait for spinning platters.

eventlet.sleep() may take a while, but that's because it's yielding to another greenlet. The only way the timing measurement there will accurately measure the eventlet overhead is if you have exactly one request being processed. Otherwise, what looks like wasted time is actually time that is spent on running another greenlet.

kalrey (249524324-q) said : #2

Thank you for your help.But I think it does not seem to be a good idea that holding on current request to running another greenlet.The proxy will cost longer time to wait the response because request be holding on.I don't understand why swift does sleep the current greenlet.

Samuel Merritt (torgomatic) said : #3

It's basically because disk IO won't trigger a context switch, so when disk is slower than network, you get starvation.

See https://github.com/openstack/swift/commit/783f16035a8e251d2138eb5bbaa459e9e4486d90 for details.

kalrey (249524324-q) said : #4

Thank you for your answer~
In fact, the performance of GET is very well. It can reach 5000/s stability . But the speed of PUT is keeping on 100/s stability at last.The object size is 4KB.
What may cause the performance degradation of PUT while GET works well?

Samuel Merritt (torgomatic) said : #5

Well, you've got a bunch of tiny objects, so they're probably being read out of buffer cache on GET requests, whereas on PUT, you're guaranteed some actual disk IO.

kalrey (249524324-q) said : #6

As far as I know, the object will be drop from buffer cache immediate after GET.

There are some code to drop cache in DiskFile::put(swift/obj/server.py).

if 'Content-Length' in metadata:
            self.drop_cache(fd, 0, int(metadata['Content-Length']))

I think there is no cache with object.

Samuel Merritt (torgomatic) said : #7

> As far as I know, the object will be drop from buffer cache immediate after GET.

Nope. There's a threshold (5 MiB by default) for dropping cache on reads. Under that threshold isn't flushed from cache.

See https://github.com/openstack/swift/blob/master/swift/obj/server.py#L496 for details.

kalrey (249524324-q) said : #8

Sorry just made it wrong
"As far as I know, the object will be drop from buffer cache immediate after GET"
I should be "As far as I know, the object will be drop from buffer cache immediate after PUT".

So there is no cache after PUT.
And every one of the objects GET that was different. So I think those object should not be cached.

Chuck Thier (cthier) said : #9

So getting back to the PUT performance, as GET performance has always been much faster than PUT performance, as there is a bit more going on with PUTs (updating containers, fsync, etc.).

To really get down to the bottom of this, we will need to know more information about your config. From your email to openstack-dev I gather that you have 1 proxy, 5 storage nodes, each with 4TB drives.

Can you please provide the following information:

1. What concurrency are you running each service at (posting your configs would be great)?
2. Are you using a RAID controller with a cache? Are the drives in a RAID configuration?
3. Are the drive caches on or off?
4. What Filesystem are you using (XFS, EXT4, etc.) and what params are you using to format?
5. What is the partition power of your ring?

You mention in the email that you start getting some errors like 503 when the performance starts lagging. Looking at the logs will will help you figure out what is going on (look for things like Timeouts, errors, etc.)

Also, how many objects do you PUT before the performance starts degrading?

Thanks,

--
Chuck

kalrey (249524324-q) said : #10

> From your email to openstack-dev I gather that you have 1 proxy, 5 storage nodes, each with 4TB drives.
Yes, 1 proxy and 5 storage node, but each storage have 4TB * 4 drivers.

> 1. What concurrency are you running each service at (posting your configs would be great)?
A: proxy workers= 32 , object-server workers = 32 account workers = 16 ,container workers = 32

> 2. Are you using a RAID controller with a cache? Are the drives in a RAID configuration?
A: NO

> 3. Are the drive caches on or off?
A: How to find it's on or off?

> 4. What Filesystem are you using (XFS, EXT4, etc.) and what params are you using to format?
A: XFS, inode size = 1024 and other params like "type xfs (rw,noatime,nodiratime,nobarrier,logbufs=8)".

> 5. What is the partition power of your ring?
A: partition power is 18. Would it affect performance?

The errors include 'ConnectTimeout' and 'get final status timeout(10s)'.

> Also, how many objects do you PUT before the performance starts degrading?
In fact, it started at the beginning.It kept increasingly down from 1900/s to 100/s.

some bench logs:
2013-08-04 02:32:57,281 INFO 3527 PUTS [0 failures], 1763.3/s current:1763.4/s
2013-08-04 02:33:12,282 INFO 22057 PUTS [0 failures], 1297.4/s current:1235.3/s
2013-08-04 02:33:27,284 INFO 39651 PUTS [0 failures], 1239.0/s current:1172.7/s
2013-08-04 02:33:42,291 INFO 48089 PUTS [0 failures], 1023.0/s current:562.3/s
2013-08-04 02:33:57,292 INFO 64144 PUTS [0 failures], 1034.4/s current:1070.3/s
2013-08-04 02:34:12,298 INFO 75191 PUTS [0 failures], 976.3/s current:736.2/s
and then it keep between 500/s ~ 900/s for a period of time.
And then it looks like:
2013-08-04 05:33:06,339 INFO 5416916 PUTS [0 failures], 501.1/s current:683.1/s
2013-08-04 05:33:21,339 INFO 5420003 PUTS [0 failures], 500.6/s current:205.8/s
2013-08-04 05:33:36,350 INFO 5429288 PUTS [0 failures], 500.8/s current:618.6/s
2013-08-04 05:33:51,381 INFO 5433035 PUTS [0 failures], 500.5/s current:249.3/s

then:
2013-08-04 10:41:24,565 INFO 12095430 PUTS [0 failures], 412.7/s current:368.1/s
2013-08-04 10:41:39,604 INFO 12099253 PUTS [0 failures], 412.6/s current:254.2/s
2013-08-04 10:41:54,606 INFO 12105207 PUTS [0 failures], 412.6/s current:396.9/s
2013-08-04 10:42:09,615 INFO 12108471 PUTS [0 failures], 412.5/s current:217.5/s

At last:
2013-08-04 13:58:32,020 INFO 15393452 PUTS [0 failures], 374.2/s current:336.8/s
2013-08-04 13:58:47,160 INFO 15395591 PUTS [0 failures], 374.1/s current:141.3/s
2013-08-04 13:59:02,172 INFO 15399491 PUTS [2 failures], 374.1/s current:259.8/s
2013-08-04 13:59:17,177 INFO 15402982 PUTS [2 failures], 374.0/s current:232.7/s
A few errors appeared and the speed keep at 100 ~ 200 stably.

Also I removed ‘fsync’ operation for testing.And the result looks like that above.

Best Chuck Thier (cthier) said : #11

So a couple of thoughts:

The Timeout errors lead me to believe that you are running out of disk io.

Your object-server workers are a lot higher than I would run for a server with 4 drives. That means there are 8 workers for each drive. I would cut it back to 4-8, and only increment if further testing doesn't cause issues. If things are stable at that point, then you can look at increasing workers incrementally with further testing.

I would also look at turning your disk caches off. They are likely masking the issue at the beginning. The methods will vary by os and if you are using RAID controller, but is pretty easy to google.

The other thing that I would check is to see if the background processes are running too agressively. You could try running with replication/auditors/etc. turned off just as a test to see if that produces any effect. If so, then it would be worth tuning their concurrency, and run delays.

That should be a good start.

--
Chuck

kalrey (249524324-q) said : #12

Thanks Chuck Thier, that solved my question.