Question about Swift Storage Nodes CPU usage

Asked by Fatih Güçlü Akkaya

Hi all,

For 3 months we have been using Openstack swift in our test environment and while monitoring the CPU usage and I/O traffic on the storage nodes we realized that even at night in which no one makes a request CPU usage is like 40 to 50% on the servers and I/O throughput is higher and expected.

Test environment:

1 node (swift-proxy + keystone)
5 storage nodes (account,container and object servers)

Server specifications:

Virtual machine VMWare
OS: Ubuntu 10.04 64 bit LTS
CPU: 4 cores
RAM: 6 GB

On the storage nodes for each module (account,container and object) auditor,replicator,updater and server processes are running.

Configuration files:

account-server.conf

[DEFAULT]
bind_ip = 10.1.1.152
workers = 2
log_facility = LOG_LOCAL1

[pipeline:main]
pipeline = account-server

[app:account-server]
use = egg:swift#account

[account-replicator]

[account-auditor]

[account-reaper]

container-server.conf

[DEFAULT]
bind_ip = 10.1.1.152
workers = 2
log_facility = LOG_LOCAL2

[pipeline:main]
pipeline = container-server

[app:container-server]
use = egg:swift#container

[container-replicator]

[container-updater]

[container-auditor]

object-server.conf

[DEFAULT]
bind_ip = 10.1.1.152
workers = 2
log_facility = LOG_LOCAL3

[pipeline:main]
pipeline = object-server

[app:object-server]
use = egg:swift#object

[object-replicator]

[object-updater]

[object-auditor]

rsyncd.conf

uid = swift

gid = swift

log file = /var/log/rsyncd.log

pid file = /var/run/rsyncd.pid

address = 10.1.1.152

[account]

max connections = 2

path = /srv/node/

read only = false

lock file = /var/lock/account.lock

[container]

max connections = 2

path = /srv/node/

read only = false

lock file = /var/lock/container.lock

[object]

max connections = 2

path = /srv/node/

read only = false

lock file = /var/lock/object.lock

Since we have not run a perfomance test yet i think that the amount of resouce usage is too much and will cause problems during our tests.
From the logs i can only understand that replicator is always run for a short periods of times. What can you recommend for improving our environment?

Thanks

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Launchpad Janitor (janitor) said :
#1

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Fatih Güçlü Akkaya (gucluakkaya) said :
#2

Update:

After increasing run_pause to 900 from 30 in account_replicator, container_replicator and object_replicator sections in the configuration files the CPU usage has been decrease considerably. However increasing run_pause indicates that replication process stops for the given time. What can be drawback for increasing this value with respect reliability and what would you recommend for run_pause value?

Updated configuration:

account_server.conf

[DEFAULT]
bind_ip = 0.0.0.0
workers = 8

[pipeline:main]
pipeline = account-server

[app:account-server]
use = egg:swift#account

[account-replicator]
run_pause=900

[account-auditor]

[account-reaper]

container_server.conf

[DEFAULT]
bind_ip = 0.0.0.0
workers = 8

[pipeline:main]
pipeline = container-server

[app:container-server]
use = egg:swift#container

[container-replicator]
run_pause=900

[container-updater]

[container-auditor]

object_server.conf

[DEFAULT]
bind_ip = 0.0.0.0
workers = 8

[pipeline:main]
pipeline = object-server

[app:object-server]
use = egg:swift#object

[object-replicator]
run_pause=1500
ring_check_interval=900

[object-updater]

[object-auditor]

Revision history for this message
steve A (dafridgie) said :
#3

Hi, I've also found this with swift. Another value you may to play with are the object auditor file settings as follows:

/etc/swift/object-server/1.conf

[DEFAULT]
devices = /srv/node/
bind_port = 6066
user = swift
workers = 2
log_facility = LOG_LOCAL2
mount_check = true

[pipeline:main]
pipeline = recon object-server

[app:object-server]
use = egg:swift#object

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift
recon_lock_path = /var/lock

[object-replicator]
concurrency = 4
vm_test_mode = no
run_pause = 60
recon_enable = yes
recon_cache_path = /var/cache/swift

[object-updater]
concurrency = 4

recon_enable = yes
recon_cache_path = /var/cache/swift

[object-auditor]
files_per_second = 5 <<<<<<<<<<<< Adjusting this down from the default of 20
bytes_per_second = 2500000 <<<<<<<<<<<<< Adjusting this down from a default of 10000000
concurrency = 25
recon_enable = yes
recon_cache_path = /var/cache/swift

I had experienced the occasional last write failure as well as the odd write chunk timeout. Adjusting hte settings above reduced the overall read workload the auditor processes placed on the storage nodes.

Also I've found that increasing the run_pause settings from default of 30 on account/container and object settings also reduced both CPU and Io workload, particulalry when large chunked file uploads are done.

Here are all my account/container storage node config files if they are of help:

/etc/swift/Account-server/1.conf

[DEFAULT]
devices = /srv/node/
bind_port = 6068
user = swift
workers = 2
log_facility = LOG_LOCAL2
mount_check = true

[pipeline:main]
pipeline = recon account-server

[app:account-server]
use = egg:swift#account
set log_address = /dev/log

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

[account-replicator]
concurrency = 4
run_pause = 45

recon_cache_path = /var/cache/swift
vm_test_mode = no

[account-auditor]
recon_cache_path = /var/cache/swift

[account-reaper]
concurrency = 25

/etc/swift/container-server/1.conf
[DEFAULT]
devices = /srv/node/
bind_port = 6067
user = swift
workers = 2
log_facility = LOG_LOCAL2
mount_check = true

[pipeline:main]
pipeline = recon container-server

[app:container-server]
use = egg:swift#container

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

[container-replicator]
concurrency = 8
run_pause = 60

recon_cache_path = /var/cache/swift
vm_test_mode = no

[container-updater]
concurrency = 4

recon_cache_path = /var/cache/swift

[container-auditor]
recon_cache_path = /var/cache/swift

[container-sync]
interval = 300
container_time = 60

Overall I've found spending time tuning the rings and then testing with swift-bench + a script of sequential uploads provide the performance feedback needed to get the Io and CPu load on storage nodes more manageable

Hope this helps

Steve a

Revision history for this message
David Hill (david-hill-ubisoft) said :
#4

I can see the same behavior here. Every 30 minutes, for 30 minutes (approximately), we have 100% usage. Did you get any more inputs on this?

Dave

Revision history for this message
clayg (clay-gerrard) said :
#5

I think that maybe some of the runtime turning for auditor and replicator is not fully documented. I see run_pause mentioned in this bug. The new multi-disk concurrent auditor should allow you to do less auditing io per disk without reducing the overall cycle time of the auditor. You can also independently tune the disk_chunk_size for the auditor, which would allow it to do larger reads and use less cpu.

Can you help with this problem?

Provide an answer of your own, or ask Fatih Güçlü Akkaya for more information if necessary.

To post a message you must log in.