Comment 416 for bug 131094

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

It might not be good to stir up such an old bug, but it gets regularly updated and new complains so maybe a new approach might help.

So let us make one thing clear, IMHO if something overloads your machine with disk I/O it has to stall it.
So the solutions paths are more like this:
a) beat it with more Processsing / IO HW
b) mitigate the effect as far as possible
c) avoid the overload before it starts

The issue is a common one - so I'll keep my explanations general and not specific to trackerd or any other case that was mentioned before.

### a) beat it with more Processsing / IO HW ###
There are way more expensive machines out there which can handle way more I/O without being slown down. The reason is that they have more I/O Cards, virtual functions to spread over CPUs handling that and at the high end servers with totally different I/O IRQ designs.
We should agree that on cheap/slow or even medium machines I/O overload just *IS* an issue to responsiveness.
But that isn't important - the question is what can a normal user do about it and spending x000000 $ on a machine isn't the solution.

### b) mitigate the effect as far as possible ###
So regarding mitigation there were already some approaches in this bug discussion.
Like using ionice and several dirty ratio tunings, but all these don't prevent the I/O overload.
E.g. if you overload the system with only "Best Effort" I/O class, the only difference it makes is that "other I/O" might pass faster, but your system is still fairly busy => unresponsive
Also dirty ratios come down to spending the process remaining time slice to clean up dirty memory as soon as a certain level is reached, now while you can configure higher ratios (at the price of endangering integrity) it also won't stop the burst of I/O. No instead it will allow to submit more data to dirty the page cache and thereby indirectly more I/O overloading the system again.

### c) avoid the overload before it starts ###
It must be said, since this bug starts back in 2007 and a lot of the reports are related to I/O+*sync that just for sync&journaling various filesystem and general kernel improvements have been mad. Several posts in this bug confirm this already.
Now what I didn't see people trying throttle the processes that overload the system.
Throttling at => https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
As any - this approach has certain limitations, but it is a new way to tackle the overall issue.
It also need certain cgroup and filesystem features (like accounting writeback through pagecache) which might only be available in modern ubuntu releases.

### Experiment ###
As an experiment to prove the solution I use the tools fio and latencytop to compare:
1. no background load checking latencytop
2. running a random read/write mutlithread fio in background checking latencytop
3. running a throttled random read/write mutlithread fio in background checking latencytop

# Background Load #
A fio job file like this:

[global]
ioengine=libaio
rw=randrw
bssplit=1k/25:4k/50:64k/25
size=512m
directory=/home/paelzer/latencytest
iodepth=8

[dio]
direct=1
numjobs=8

[pgc]
direct=0
numjobs=8

# Case 1 - No background load => almost no latency
Cause Maximum Percentage
Waiting for event (select) 5,0 msec 39,7 %
Waiting for event (poll) 5,0 msec 33,9 %
Userspace lock contention 4,8 msec 25,7 %
[do_wait] 2,7 msec 0,4 %
[ep_poll] 2,4 msec 0,2 %
Reading from file 0,9 msec 0,0 %
Reading EXT3 directory htree 0,2 msec 0,0 %
[hrtimer_nanosleep] 0,1 msec 0,0 %

# Case 2 - Unrestricted background load overloading the I/O subsystem shows massive impact
- ext4 data/log writes
- memory management due to trashing page cache
...
=> Fast
Jobs: 16 (f=16): [m(16)] [6.7% done] [92482KB/99.50MB/0KB /s] [6302/6483/0 iops] [eta 01m:51s]

Cause Maximum Percentage
[ext4_file_write_iter] 91,8 msec 0,3 %
[wait_transaction_locked] 63,4 msec 0,1 %
Marking inode dirty 61,2 msec 0,9 %
[SyS_io_destroy] 46,3 msec 0,3 %
[lru_add_drain_all] 18,0 msec 0,1 %
[__block_write_begin] 16,8 msec 38,5 %
[__lock_page_killable] 16,2 msec 34,7 %
[read_events] 5,0 msec 21,2 %
Waiting for event (poll) 5,0 msec 1,9 %

# Case 3 - Now the same workload but contained in a blkio throttled cgroup
mkdir /sys/fs/cgroup/blkio/limitbgload
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 29,3G 0 disk
├─sda1 8:1 0 28,3G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 1021M 0 part
# Limit to 4MB/s write and 8 MB/s write speed
echo 8:0 $((1024*1024*4)) > /sys/fs/cgroup/blkio/limitbgload/blkio.throttle.write_bps_device
echo 8:0 $((1024*1024*8)) > /sys/fs/cgroup/blkio/limitbgload/blkio.throttle.read_bps_device
cgexec -g blkio:limitbgload fio causelatency.fiojob

The workload shows throttling is working:
Jobs: 16 (f=16): [m(16)] [22.0% done] [6724KB/8915KB/0KB /s] [577/598/0 iops] [eta 09m:25s]

But we can also see its desired effect avoiding to overload the system with I/O.
Cause Maximum Percentage
[__lock_page_killable] 132,2 msec 46,5 %
[__block_write_begin] 131,4 msec 47,9 %
fsync() on a file (type 'F' for details) 30,7 msec 0,0 %
Marking inode dirty 21,5 msec 0,1 %
[ext4_file_write_iter] 5,2 msec 0,0 %
Waiting for event (select) 5,0 msec 1,4 %
Userspace lock contention 5,0 msec 1,0 %
Waiting for event (poll) 5,0 msec 1,7 %
[read_events] 4,9 msec 1,3 %

=> this shows almost only the stalls due to throttling itself which are wanted
=> the dirtying and filesystem latencies are way smaller now
=> the system "feels" right regarding responsiveness

### TL;DR ###
- huge machines just beat I/O overload with more HW or better I/O Architecture
- Code improves to mitigate effects but can never be perfect for *ALL* users at once (especially in the default config)
- try throttling your processes overloading I/O if you are not requiring its results asap
=> Let us discuss if that would be an option and if so let us close this bug and open a separate one requesting configurable throttling for each component applicable like trackerd and so many other I/O heavy background tasks