MySQL Tuning Guides on CentOS

Checklist build MySQL from scratch

Application/Users Behavior
H/W Layer (RAID / RPM / Technology)
OS Lookup (Kernel, Filesystems support)
Kernel Focus on I/O part (which are Schedulers / Elevators support ?)
Filesystem / LVM (Blocksize depend of strip size)
Partition Layout (IO Alignment)
Tuned Kernel Parameters (I/O)
MySQL (Storage Engine/Partitioning)

Application/Users Behavior

At first, You must analyst the behavior of applications / systems to be running and because we are talking about mysql, so it must be a DBA perspective ! Web apps typically have a read/write ratio of somewhere between 80/20 and 90/10. If we can scale read capacity, we can solve a lot of situations

Disk R/W type of Query

– INSERT: Random reads/writes happen when inserting into indexes in random order
– UPDATE/DELETE: Random reads/writes happen when modifying records
– SELECT with indexes: Sequential reads on Indexes, Random reads on Tables for retrieve records
– SELECT FULL TABLE SCAN: Sequential reads happen when retrieve records

* All 4 cases above think out of Filesystem Caching / MySQL Logical Buffer & Caching

Bottleneck shifts from storage to CPU/Network

– Massively executed SQL statements should be migrated to NoSQL / HandlerSocket /etc

Separating tables

History tables (only recent data is accessed) fit with HDD because most of active data fit in memory
Rarely accessed tables can also be stored on HDD
Other tables on SSD

H/W Layer (RAID / RPM / Technology)

1.2.1 Basic Performance on R/W HDDs

Random Reads (base on IOPs specs)
Random Writes (base on IOPs specs)
Sequential Reads (base on Throughput MB/s disk specs)
Sequential Writes (base on Throughput MB/s disk specs)

Because of almost all database operations require random access, for example by query type {select, update,delete, modify index}

Selecting records by index scan
Updating records
Deleting records
Modifying indexes

1.2.2 General HDD/SSD Specification

Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow)
SSD : 2,000+ (writes) / 5,000+ (reads) per drive highly depending on SSDs and device drivers

Best Practice: Writes can be boosted by using BBWC/FBWC (Battery Backed up Write Cache) on RAID Controller, especially for REDO Logs (because it’s sequentially writes)

OS Lookup

* I choose CentOS, currently is 6.2 it is a great community and resources support.

Kernel Focus

* Used earliest version is preferable (think easier)

I/O scheduler

Note: RDBMS (especially InnoDB) also schedules I/O requests so theoretically Linux I/O scheduler is not needed
Linux has I/O schedulers

Tune to efficiently handle lots of I/O requests
“I/O scheduler type” and “Queue Size” matters

Types of I/O schedulers (introduced in 2.6.10: RHEL5)

noop: Sorting incoming i/o requests by logical block address, that’s all
deadlilne: Prioritize read (sync) requests rather than write requests (async) to some extent (to avoid “write-starving-reads” problem) — deadline recommend for RDBMS
cfq(default): Fairly scheduling i/o requests per i/o thread
anticipatory: Removed in 2.6.33 (bad scheduler. Don’t use it)

Default is cfq, but noop / deadline is better in many cases

– # echo noop > /sys/block/sdX/queue/scheduler

MyISAM does not optimize I/O requests internally

Highly depending on OS and storage
When inserting into indexes, massive random disk writes/reads happen

Increasing I/O queue size reduces disk seek overheads

# echo 100000 > /sys/block/sdX/queue/nr_requests

FileSystem

ext2 :

Faster for writes because ext2 doesn’t support journaling
It takes very long time for fsck
On active-active redundancy environment (i.e. MySQL Replication), in some cases ext2 is used to gain performance

ext3 :

By far the most widely used filesystem, But not always the best
Deleting large files takes long time (can fix by create hard-link to datafile and drop table, then remove hard-link later)

Internally has to do a lot of random disk i/o (slow on HDD)
In MySQL, if it takes long time to DROP table, all client threads will be blocked to open/close tables (by LOCK_open mutex)

Writing to a file is serialized

Serialized by “i-mutex”, allocated per i-node
Sometimes it is faster to allocate many files instead of single huge file
Less optimized for faster storage (like PCI-Express SSD)

Use “dir_index” to speed up searching files

xfs :

Fast for dropping files
Concurrent writes to a file is possible when using O_DIRECT
Not officially supported in current RHEL (Supported in SuSE)
Disable write barrier by setting “nobarrier”

For the good way to go is xfs, but in RAID+HDD environment it’s not differrent on ext3 and xfs. So ext3 is the choosen filesystem to go and you can tuned parameters to make ext3 like ext2 with mount options later.

File system plays very important role, it maps files logical address to physical address (at a certain level) so when writing a file, file system decides how to distribute writes properly so they make the best use of the underlying storage, it also ensures file starts in a proper position relative stripe size. The size of logical IO units also falls to the file system.

The goal is to write and read as little as possible. If you gonna be writing small (say 500 byte) files mostly, it’s best to use 512-byte blocks, for bigger files 4k may make more sense (you can’t use blocks bigger than 4k (page size) on Linux unless you are using HugePages). Some file systems let you set stripe width and stripe unit size so they can a proper alignment based on that. Mind however that different file systems (and different versions of them) might be using different units for these options so you should regard a manual on your system to be sure you’re doing the right thing.

Say we have 6-disk RAID5 (so 5 bearing disks) with 64k stripe unit size and 4k file system block size, here’s how we would create the file system:

xfs - mkfs.xfs -b 4k -d su=64k,sw=5 /dev/sdX

(alternatively you can specify sunit=X,swidth=Y as options when mounting the device)

ext2/3/4 - mke2fs -b 4096 -E stride=16,stripe-width=80 /dev/sdX

(some older versions of ext2/3 do not support stripe-with)

Partition Layout

Aligning IO on a hard disk RAID

Because flash storage is becoming more popular, IO alignment question keeps popping up more often than it used to when all we had were rotating disk drive. I think the reason is very simple– when systems only had one bearing hard disk drive (HDD) as in RAID1 or one hard disk drive at all, you couldn’t really have misaligned IO because HDDs operate in 512-byte sectors which’s also the smallest amount of disk IO that systems possible. NAND flash on the other hand can have a page size of 512-bytes, 2kbytes or 4kbytes (and often you don’t know what size it is really) so the IO alignment question becomes more relevant.

It was and still is, however, relevant with HDD RAID storage– technology we have been using for many years– when there’s striping like in RAID0, 5, 6 or any variation of them (5 +0, 1 +0, 1 +0 +0 etc.). While IO inside the RAID is perfectly aligned to disk sectors (again due to the fact operations are finished multiples of 512-bytes), outside of the RAID you want to align IO to a stripe element as you may otherwise end up reading or writing to more disks than you would like to. Recent systems like CentOS6 (not verified) and Ubuntu 10.04 (verified) would by default align to 1MB if storage does not support IO alignment hints which is good enough for most cases. Checking by use command fdisk -l it would be start at 2048th sector (previous block 0 – 2047th sector)

e.g., for the disk here so the RAID has 4 bearing disks (spindles) and if there’s a misalignment on the file system, you can see how RAID ends up doing 5 IO operations (or worth case to 6 IO operations) see through this link for Aligning IO on a hard disk RAID

Because HDD and SSD are mapped to OS via filesystem and divided into partition and mount over directory to usage. So we must design layout of partition carefully by following section:

Put sequentially written files on HDD

ibdata, ib_logfile, binary log files
HDD is fast enough for sequential writes
Write performance deterioration can be mitigated
Life expectancy of SSD will be longer

Put randomly accessed files on SSD

*ibd files, index files(MYI), data files(MYD)
SSD is 10x -100x faster for random reads/writes than HDD

Archive less active tables/records to HDD

SSD is still much expensive than HDD
Partitioning/Sharding data

Tuned Kernel Parameters

N/W card 1 Gbit Paramaters:

#ethtool -s eth0 speed 1000 duplex full

Edit the configuration for the network card:

#vi /etc/sysconfig/network-scripts/ifcfg-eth0

Add the following line:

ETHTOOL_OPTS='autoneg on speed 1000 duplex full'

Restart the network:

#service network restart

IPTABLES:

If you still need to keep iptables running you can simply adjust the following settings:
Check the current connections limit (only works if iptables is running):

#vi /etc/sysctl.conf

Change the connections limit, Add the following lines:

ip_conntrack_max = 196608

net.ipv4.ip_local_port_range = 1025 65535

net.ipv4.tcp_max_tw_buckets = 1000000

net.core.somaxconn = 10000

net.ipv4.tcp_max_syn_backlog = 2000

net.ipv4.tcp_fin_timeout = 30

#vi /etc/modprobe.conf

To change the buckets limit, Add the following lines:

options ip_conntrack hashsize=32768

To take effect must reboot

# reboot

Files Limit:

#vi /etc/security/limits.conf

Add the following lines:

soft nofile 8192

hard nofile 8192

MySQL

MySQL file location

Random I/O oriented:

Data Files (*.ibd)

Sequential reads if doing full table scan

Undo Log, Insert Buffer (ibdata)

UNDO tablespace (small in most cases, except for running long-running batch)
On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes)

Sequential Write oriented:

Doublewrite Buffer (ibdata0) — Write volume is equal to *ibd files. Huge
Binary log (mysql-bin.XXXXXX)
Redo log (ib_logfile)
Backup files

Useful commands and tools

iostat

Showing detailed I/O statistics per device
Very important tool because in most cases RDBMS becomes I/O bound
iostat -x
Check r/s, w/s, svctm, %util — IOPS is much more important than transfer size
Always %util = (r/s + w/s) * svctm (hard coded in the iostat source file) — Do not trust %util too much
Check svctm rather than %util — If your storage can handle 1000 IOPS, svctm should be less than 1.00 (ms) so you can send alerts if svctm is higher than 1.00 for a couple of minutes

mpstat

Per CPU core statistics
You can also check network bottlenecks (%irq, %soft) from mpstat
If you check only vmstat/top/iostat/sar you will not notice single threaded bottleneck

dstat

yum install dstat
Similar UI as vmstat, but it also shows network statistics
Disk and Net total is incorrect, if you use RAID/bonding (double counted) — Filter by disk/interface name that you want to trace
“mtstat” additionally supports mysql status outputs — https://launchpad.net/mtstat

oprofile

You can easily identify which functions consume CPU resources
Supporting both user space and system space profiling
Mainly used by database-internal developers
How to use
1. opcontrol –start (–no-vmlinux)
2. benchmarking
3. opcontrol –dump
4. opcontrol –shutdown
5. opreport -l /usr/local/bin/mysqld

Checking stalls

Database servers sometimes suffer from performance stalls

Not responding anything for 1-2 seconds

In many cases it’s database internal issue

Holding global mutex and doing expensive i/o
Buffer pool, redo log preflush, rollback segments (InnoDB mutex)
pthread_create()/clone() (LOCK_thread_count mutex)
Opening/closing tables (LOCK_open mutex)

Collecting statistics

Per-minute statistics (i.e. vmstat 60) is not helpful. Collect per-second statistics

gdb

Debugging tool for identifying why MySQL stalls
functionality to take thread stack dumps from a running process
Useful to identify where and why mysqld hangs up, slows down, etc — But you have to read MySQL source code
Debugging symbol is required on the target program

pmp (Poor Man’s Profiler)

A simple shell script (wrapper script) to summarize thread call stacks — Using gdb thread dump functionality
Useful for identifying why MySQL stalls
http://poormansprofiler.org/

gcore

Dumping core file
Dumping core file takes time, and blocks all threads during dump

Might take 1 hour when VIRT is 50GB+
Not useful on running systems, but can be used on a non- serving slave

If you have a program with DWARF symbols, target (running) process does not need symbols

You can run stripped mysqld on production, as long as you have debug-built mysqld

Capturing MySQL packets

Capturing MySQL packets is useful to trace slow queries / transactions / connection requests.

Modifying applications MySQL settings is not needed Starting from 5.1, you can dynamically enable/disable slow/general query logs

libpcap

library to capture network packets
Most of network capturing tools including tcpdump rely on libpcap
Packet loss often happens

tcpdump + mk-query-digest — It can trace how long each query takes, and it can summarize and sort by total execution time

[root#] tcpdump -i bond0 port 3306 -s 65535 -x -n -q -tttt -c 10000 > tcpdump.out

[root#] mk-query-digest --type=tcpdump --report-format=query_report tcpdump.out

MySlowTranCapture

https://github.com/yoshinorim/MySlowTranCapture
A tool to print “transactions” that take more than N milliseconds
Held locks are not released until COMMIT/ROLLBACK is executed

[mysql-server]# myslowtrancapture -i eth0

Resources