Applies to SUSE Linux Enterprise Desktop 12

11 Tuning I/O Performance

I/O scheduling controls how input/output operations will be submitted to storage. SUSE Linux Enterprise Desktop offers various I/O algorithms—called elevators— suiting different workloads. Elevators can help to reduce seek operations, can prioritize I/O requests, or make sure, and I/O request is carried out before a given deadline.

Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too. Single ATA disk systems, SSDs, RAID arrays, or network storage systems, for example, each require different tuning strategies.

11.1 Switching I/O Scheduling

SUSE Linux Enterprise Desktop lets you set a default I/O scheduler at boot-time, which can be changed on the fly per block device. This makes it possible to set different algorithms for e.g. the device hosting the system partition and the device hosting a database.

By default the CFQ (Completely Fair Queuing) scheduler is used. Change this default by entering the boot parameter

elevator=SCHEDULER

where SCHEDULER is one of cfq, noop, or deadline. See Section 11.2, “Available I/O Elevators” for details.

To change the elevator for a specific device in the running system, run the following command:

echo SCHEDULER > /sys/block/DEVICE/queue/scheduler

Here, SCHEDULER is one of cfq, noop, or deadline and DEVICE is the block device (sda for example).

11.2 Available I/O Elevators

In the following elevators available on SUSE Linux Enterprise Desktop are listed. Each elevator has a set of tunable parameters, which can be set with the following command:

echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE

where VALUE is the desired value for the TUNABLE and DEVICE the block device.

To find out which elevator is the current default, run the following command. The currently selected scheduler is listed in brackets:

jupiter:~ # cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

11.2.1 CFQ (Completely Fair Queuing)

CFQ is a fairness-oriented scheduler and is used by default on SUSE Linux Enterprise Desktop. The algorithm assigns each thread a time slice in which it is allowed to submit I/O to disk. This way each thread gets a fair share of I/O throughput. It also allows assigning tasks I/O priorities which are taken into account during scheduling decisions (see man 1 ionice). The CFQ scheduler has the following tunable parameters:

/sys/block/<device>/queue/iosched/slice_idle

When a task has no more I/O to submit in its time slice, the I/O scheduler waits for a while before scheduling the next thread to improve locality of I/O. For media where locality does not play a big role (SSDs, SANs with lots of disks) setting /sys/block/<device>/queue/iosched/slice_idle to 0 can improve the throughput considerably.

/sys/block/<device>/queue/iosched/quantum

This option limits the maximum number of requests that are being processed at once by the device. The default value is 4. For a storage with several disks, this setting can unnecessarily limit parallel processing of requests. Therefore, increasing the value can improve performance. However, it can also cause latency of certain I/O operations to increase because more requests are buffered inside the storage. When changing this value, you can also consider tuning /sys/block/<device>/queue/iosched/slice_async_rq (the default value is 2). This limits the maximum number of asynchronous requests—usually write requests—that are submitted in one time slice.

/sys/block/<device>/queue/iosched/low_latency

When enabled (which is the default on SUSE Linux Enterprise Desktop) the scheduler may dynamically adjust the length of the time slice by aiming to meet a tuning parameter called the target_latency. Time slices are recomputed to meet this target_latency and ensure that processes get fair access within a bounded length of time.

/sys/block/<device>/queue/iosched/target_latency

Contains an estimated latency time for the CFQ. CFQ will 8 use it to calculate the time slice used for every task.

Example 11.1: Increasing individual thread throughput using CFQ

In SUSE Linux Enterprise Desktop 12 the low_latency tuning parameter is enabled by default to ensure that processes get fair access within a bounded length of time. Note that this parameter was not enabled in versions prior to SUSE Linux Enterprise 12.

This is usually preferred in a server scenario where processes are executing I/O as part of transactions the time to complete each transaction will be predictable. The exception is if the performance metric of interest is the peak performance of a single process when there is I/O contention. Another exception is if a workload must complete as quickly as possible and there are multiple sources of I/O. In the latter example, unfair treatment from the I/O scheduler may complete the transactions faster as processes take their full slice, exit quickly and reduce overall contention.

To address this, there are two options—increase target_latency or disable low_latency. As with all tuning parameters it is important to verify your workload behaves as expected before and after the tuning modification. Take careful note of whether your workload depends on individual process peak performance or scales better with fairness. It should also be noted that the performance will depend on the underlying storage and the correct tuning option for one installation may not be universally true.

Find below an example that does not control when I/O starts but is simple enough to just demonstrate the point. 32 processes are writing a small amount of data to disk in parallel. Using the SUSE Linux Enterprise Desktop default (enabling low_latency), the result looks as follows:

root # echo 1 > /sys/block/sda/queue/iosched/low_latency
root # time ./dd-test.sh 
10485760 bytes (10 MB) copied, 2.62464 s, 4.0 MB/s
10485760 bytes (10 MB) copied, 3.29624 s, 3.2 MB/s
10485760 bytes (10 MB) copied, 3.56341 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.56908 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.53043 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.57511 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.53672 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.5433 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.65474 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.63694 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.90122 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.88507 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.86135 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.84553 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.88871 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.94943 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 4.12731 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.15106 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.21601 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.35004 s, 2.4 MB/s
10485760 bytes (10 MB) copied, 4.33387 s, 2.4 MB/s
10485760 bytes (10 MB) copied, 4.55434 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.52283 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.52682 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.56176 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.62727 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.78958 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.79772 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.78004 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.77994 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.86114 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.88062 s, 2.1 MB/s

real    0m4.978s
user    0m0.112s
sys     0m1.544s

Note that each process completes in similar times. This is the CFQ scheduler meeting its target_latency giving each process gets fair access to storage. The early processes completed faster as the start time of the processes was not identical. This could be controlled for but not in this simple example.

This is what happens when low_latency is disabled:

root # echo 0 > /sys/block/sda/queue/iosched/low_latency
root # time ./dd-test.sh 
10485760 bytes (10 MB) copied, 0.813519 s, 12.9 MB/s
10485760 bytes (10 MB) copied, 0.788106 s, 13.3 MB/s
10485760 bytes (10 MB) copied, 0.800404 s, 13.1 MB/s
10485760 bytes (10 MB) copied, 0.816398 s, 12.8 MB/s
10485760 bytes (10 MB) copied, 0.959087 s, 10.9 MB/s
10485760 bytes (10 MB) copied, 1.09563 s, 9.6 MB/s
10485760 bytes (10 MB) copied, 1.18716 s, 8.8 MB/s
10485760 bytes (10 MB) copied, 1.27661 s, 8.2 MB/s
10485760 bytes (10 MB) copied, 1.46312 s, 7.2 MB/s
10485760 bytes (10 MB) copied, 1.55489 s, 6.7 MB/s
10485760 bytes (10 MB) copied, 1.64277 s, 6.4 MB/s
10485760 bytes (10 MB) copied, 1.78196 s, 5.9 MB/s
10485760 bytes (10 MB) copied, 1.87496 s, 5.6 MB/s
10485760 bytes (10 MB) copied, 1.9461 s, 5.4 MB/s
10485760 bytes (10 MB) copied, 2.08351 s, 5.0 MB/s
10485760 bytes (10 MB) copied, 2.28003 s, 4.6 MB/s
10485760 bytes (10 MB) copied, 2.42979 s, 4.3 MB/s
10485760 bytes (10 MB) copied, 2.54564 s, 4.1 MB/s
10485760 bytes (10 MB) copied, 2.6411 s, 4.0 MB/s
10485760 bytes (10 MB) copied, 2.75171 s, 3.8 MB/s
10485760 bytes (10 MB) copied, 2.86162 s, 3.7 MB/s
10485760 bytes (10 MB) copied, 2.98453 s, 3.5 MB/s
10485760 bytes (10 MB) copied, 3.13723 s, 3.3 MB/s
10485760 bytes (10 MB) copied, 3.36399 s, 3.1 MB/s
10485760 bytes (10 MB) copied, 3.60018 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.58151 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.67385 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.69471 s, 2.8 MB/s
10485760 bytes (10 MB) copied, 3.66658 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.81495 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 4.10172 s, 2.6 MB/s
10485760 bytes (10 MB) copied, 4.0966 s, 2.6 MB/s

real    0m3.505s
user    0m0.160s
sys     0m1.516s

Note that the time processes take to complete is spread much wider as processes are not getting fair access. Some processes complete faster and exit allowing the total workload to complete faster and some processes measure higher apparent I/O performance. It is also important to note that this example may not behave similarly on all systems as the results depend on the resources of the machine and the underlying storage.

It is important to emphasise that neither tuning option is inherently better than the other. Both are best in different circumstances and it is important to understand the requirements of your workload and tune accordingly.

11.2.2 NOOP

A trivial scheduler that only passes down the I/O that comes to it. Useful for checking whether complex I/O scheduling decisions of other schedulers are not causing I/O performance regressions.

In some cases, this scheduler can be helpful for devices that do I/O scheduling themselves, such as intelligent storage, or devices that do not depend on mechanical movement, like SSDs. Usually, the DEADLINE I/O scheduler is a better choice for these devices. However, NOOP creates less overhead and thus can on certain workloads increase performance.

11.2.3 DEADLINE

DEADLINE is a latency-oriented I/O scheduler. Each I/O request is assigned a deadline. Usually, requests are stored in queues (read and write) sorted by sector numbers. The DEADLINE algorithm maintains two additional queues (read and write) in which requests are sorted by deadline. As long as no request has timed out, the sector queue is used. When timeouts occur, requests from the deadline queue are served until there are no more expired requests. Generally, the algorithm prefers reads over writes.

This scheduler can provide a superior throughput over the CFQ I/O scheduler in cases where several threads read and write and fairness is not an issue. For example, for several parallel readers from a SAN and for databases (especially when using TCQ disks). The DEADLINE scheduler has the following tunable parameters:

/sys/block/<device>/queue/iosched/writes_starved

Controls how many reads can be sent to disk before it is possible to send writes. A value of 3 means, that three read operations are carried out for one write operation.

/sys/block/<device>/queue/iosched/read_expire

Sets the deadline (current time plus the read_expire value) for read operations in milliseconds. The default is 500.

/sys/block/<device>/queue/iosched/write_expire

/sys/block/<device>/queue/iosched/read_expire Sets the deadline (current time plus the read_expire value) for read operations in milliseconds. The default is 500.

11.3 I/O Barrier Tuning

Most file systems (such as XFS, Ext3, Ext4, or reiserfs) send write barriers to disk after fsync or during transaction commits. Write barriers enforce proper ordering of writes, making volatile disk write caches safe to use (at some performance penalty). If your disks are battery-backed in one way or another, disabling barriers can safely improve performance.

Sending write barriers can be disabled using the barrier=0 mount option (for Ext3, Ext4, and reiserfs), or using the nobarrier mount option (for XFS).

Warning
Warning: Disabling Barriers Can Lead to Data Loss

Disabling barriers when disks cannot guarantee caches are properly written in case of power failure can lead to severe file system corruption and data loss.

Print this page