Checklist build MySQL from scratch

  1. Application/Users Behavior
  2. H/W Layer (RAID / RPM / Technology)
  3. OS Lookup (Kernel, Filesystems support)
  4. Kernel Focus on I/O part (which are Schedulers / Elevators support ?)
  5. Filesystem / LVM (Blocksize depend of strip size)
  6. Partition Layout (IO Alignment)
  7. Tuned Kernel Parameters (I/O)
  8. MySQL (Storage Engine/Partitioning)

Application/Users Behavior

At first, You must analyst the behavior of applications / systems to be running and because we are talking about mysql, so it must be a DBA perspective !  Web apps typically have a read/write ratio of somewhere between 80/20 and 90/10. If we can scale read capacity, we can solve a lot of situations

Disk R/W type of Query

– INSERT: Random reads/writes happen when inserting into indexes in random order
– UPDATE/DELETE: Random reads/writes happen when modifying records
– SELECT with indexes: Sequential reads on Indexes, Random reads on Tables for retrieve records
– SELECT FULL TABLE SCAN: Sequential reads happen when retrieve records

          * All 4 cases above think out of Filesystem Caching / MySQL Logical Buffer & Caching

Bottleneck shifts from storage to CPU/Network

– Massively executed SQL statements should be migrated to NoSQL / HandlerSocket /etc

Separating tables

  1. History tables (only recent data is accessed) fit with HDD because most of active data fit in memory
  2. Rarely accessed tables can also be stored on HDD
  3. Other tables on SSD

H/W Layer (RAID / RPM / Technology)

1.2.1 Basic Performance on R/W HDDs

  • Random Reads (base on IOPs specs)
  • Random Writes (base on IOPs specs)
  • Sequential Reads (base on Throughput MB/s disk specs)
  • Sequential Writes (base on Throughput MB/s disk specs)

Because of almost all database operations require random access, for example by query type {select, update,delete, modify index}

  • Selecting records by index scan
  • Updating records
  • Deleting records
  • Modifying indexes

1.2.2 General HDD/SSD Specification

  • Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow)
  • SSD : 2,000+ (writes) / 5,000+ (reads) per drive  highly depending on SSDs and device drivers

Best Practice: Writes can be boosted by using BBWC/FBWC (Battery Backed up Write Cache) on RAID Controller, especially for REDO Logs (because it’s sequentially writes)

OS Lookup

* I choose CentOS, currently is 6.2 it is a great community and resources support.

Kernel Focus

* Used earliest version is preferable (think easier)

I/O scheduler

  • Note: RDBMS (especially InnoDB) also schedules I/O requests so theoretically Linux I/O scheduler is not needed
  • Linux has I/O schedulers
    • Tune to efficiently handle lots of I/O requests
    • “I/O scheduler type” and “Queue Size” matters
  • Types of I/O schedulers (introduced in 2.6.10: RHEL5)
    • noop: Sorting incoming i/o requests by logical block address, that’s all
    • deadlilne: Prioritize read (sync) requests rather than write requests (async) to some extent (to avoid “write-starving-reads” problem) deadline recommend for RDBMS
    • cfq(default): Fairly scheduling i/o requests per i/o thread
    • anticipatory: Removed in 2.6.33 (bad scheduler. Don’t use it)
  • Default is cfq, but noop / deadline is better in many cases
    • – # echo noop > /sys/block/sdX/queue/scheduler
  • MyISAM does not optimize I/O requests internally
    • Highly depending on OS and storage
    • When inserting into indexes, massive random disk writes/reads happen
  • Increasing I/O queue size reduces disk seek overheads
    • # echo 100000 > /sys/block/sdX/queue/nr_requests


ext2 :

  • Faster for writes because ext2 doesn’t support journaling
  • It takes very long time for fsck
  • On active-active redundancy environment (i.e. MySQL Replication), in some cases ext2 is used to gain performance

ext3 :

  • By far the most widely used filesystem, But not always the best
  • Deleting large files takes long time (can fix by create hard-link to datafile and drop table, then remove hard-link later)
    • Internally has to do a lot of random disk i/o (slow on HDD)
    • In MySQL, if it takes long time to DROP table, all client threads will be blocked to open/close tables (by LOCK_open mutex)
  • Writing to a file is serialized
    • Serialized by “i-mutex”, allocated per i-node
    • Sometimes it is faster to allocate many files instead of single huge file
    • Less optimized for faster storage (like PCI-Express SSD)
  • Use “dir_index” to speed up searching files

xfs :

  • Fast for dropping files
  • Concurrent writes to a file is possible when using O_DIRECT
  • Not officially supported in current RHEL (Supported in SuSE)
  • Disable write barrier by setting “nobarrier”

For the good way to go is xfs, but in RAID+HDD environment it’s not differrent on ext3 and xfs. So ext3 is the choosen filesystem to go and you can tuned parameters to make ext3 like ext2 with mount options later.

File system plays very important role, it maps files logical address to physical address (at a certain level) so when writing a file, file system decides how to distribute writes properly so they make the best use of the underlying storage, it also ensures file starts in a proper position relative stripe size. The size of logical IO units also falls to the file system.

The goal is to write and read as little as possible. If you gonna be writing small (say 500 byte) files mostly, it’s best to use 512-byte blocks, for bigger files 4k may make more sense (you can’t use blocks bigger than 4k (page size) on Linux unless you are using HugePages). Some file systems let you set stripe width and stripe unit size so they can a proper alignment based on that. Mind however that different file systems (and different versions of them) might be using different units for these options so you should regard a manual on your system to be sure you’re doing the right thing.

Say we have 6-disk RAID5 (so 5 bearing disks) with 64k stripe unit size and 4k file system block size, here’s how we would create the file system:

Partition Layout

Aligning IO on a hard disk RAID

Because flash storage is becoming more popular, IO alignment question keeps popping up more often than it used to when all we had were rotating disk drive. I think the reason is very simple– when systems only had one bearing hard disk drive (HDD) as in RAID1 or one hard disk drive at all, you couldn’t really have misaligned IO because HDDs operate in 512-byte sectors which’s also the smallest amount of disk IO that systems possible. NAND flash on the other hand can have a page size of 512-bytes, 2kbytes or 4kbytes (and often you don’t know what size it is really) so the IO alignment question becomes more relevant.

It was and still is, however, relevant with HDD RAID storage– technology we have been using for many years– when there’s striping like in RAID0, 5, 6 or any variation of them (5 +0, 1 +0, 1 +0 +0 etc.). While IO inside the RAID is perfectly aligned to disk sectors (again due to the fact operations are finished multiples of 512-bytes), outside of the RAID you want to align IO to a stripe element as you may otherwise end up reading or writing to more disks than you would like to. Recent systems like CentOS6  (not verified) and Ubuntu 10.04 (verified) would by default align to 1MB if storage does not support IO alignment hints which is good enough for most cases. Checking by use command fdisk -l it would be start at 2048th sector (previous block 0 – 2047th sector)

e.g., for the disk here so the RAID has 4 bearing disks (spindles) and if there’s a misalignment on the file system, you can see how RAID ends up doing 5 IO operations (or worth case to 6 IO operations) see through this link for Aligning IO on a hard disk RAID

Because HDD and SSD are mapped to OS via filesystem and divided into partition and mount over directory to usage. So we must design layout of partition carefully by following section:

Put sequentially written files on HDD

  • ibdata, ib_logfile, binary log files
  • HDD is fast enough for sequential writes
  • Write performance deterioration can be mitigated
  • Life expectancy of SSD will be longer

Put randomly accessed files on SSD

  • *ibd files, index files(MYI), data files(MYD)
  • SSD is 10x -100x faster for random reads/writes than HDD

Archive less active tables/records to HDD

  • SSD is still much expensive than HDD
  • Partitioning/Sharding data

Tuned Kernel Parameters

N/W card 1 Gbit Paramaters:

Edit the configuration for the network card:

Restart the network:


If you still need to keep iptables running you can simply adjust the following settings:
Check the current connections limit (only works if iptables is running):


Files Limit:


MySQL file location

  • Random I/O oriented:
    • Data Files (*.ibd)
      • Sequential reads if doing full table scan
    • Undo Log, Insert Buffer (ibdata)
      • UNDO tablespace (small in most cases, except for running long-running batch)
      • On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes)
  • Sequential Write oriented:
    • Doublewrite Buffer (ibdata0) — Write volume is equal to *ibd files. Huge
    • Binary log (mysql-bin.XXXXXX)
    • Redo log (ib_logfile)
    • Backup files

Useful commands and tools


  • Showing detailed I/O statistics per device
  • Very important tool because in most cases RDBMS becomes I/O bound
  • iostat -x
  • Check r/s, w/s, svctm, %util — IOPS is much more important than transfer size
  • Always %util = (r/s + w/s) * svctm (hard coded in the iostat source file) — Do not trust %util too much
  • Check svctm rather than %util — If your storage can handle 1000 IOPS, svctm should be less than 1.00 (ms) so you can send alerts if svctm is higher than 1.00 for a couple of minutes


  • Per CPU core statistics
  • You can also check network bottlenecks (%irq, %soft) from mpstat
  • If you check only vmstat/top/iostat/sar you will not notice single threaded bottleneck


  • yum install dstat
  • Similar UI as vmstat, but it also shows network statistics
  • Disk and Net total is incorrect, if you use RAID/bonding (double counted) — Filter by disk/interface name that you want to trace
  • “mtstat” additionally supports mysql status outputs —


  • You can easily identify which functions consume CPU resources
  • Supporting both user space and system space profiling
  • Mainly used by database-internal developers
  • How to use
    1. opcontrol –start (–no-vmlinux)
    2. benchmarking
    3. opcontrol –dump
    4. opcontrol –shutdown
    5. opreport -l /usr/local/bin/mysqld

Checking stalls

  • Database servers sometimes suffer from performance stalls
    • Not responding anything for 1-2 seconds
  • In many cases it’s database internal issue
    • Holding global mutex and doing expensive i/o
    • Buffer pool, redo log preflush, rollback segments (InnoDB mutex)
    • pthread_create()/clone() (LOCK_thread_count mutex)
    • Opening/closing tables (LOCK_open mutex)
  • Collecting statistics
    • Per-minute statistics (i.e. vmstat 60) is not helpful. Collect per-second statistics


  • Debugging tool for identifying why MySQL stalls
  • functionality to take thread stack dumps from a running process
  • Useful to identify where and why mysqld hangs up, slows down, etc — But you have to read MySQL source code
  • Debugging symbol is required on the target program

pmp (Poor Man’s Profiler)

  • A simple shell script (wrapper script) to summarize thread call stacks — Using gdb thread dump functionality
  • Useful for identifying why MySQL stalls


  • Dumping core file
  • Dumping core file takes time, and blocks all threads during dump
    • Might take 1 hour when VIRT is 50GB+
    • Not useful on running systems, but can be used on a non- serving slave
  • If you have a program with DWARF symbols, target (running) process does not need symbols
    • You can run stripped mysqld on production, as long as you have debug-built mysqld

Capturing MySQL packets

  • Capturing MySQL packets is useful to trace slow queries / transactions / connection requests.
    • Modifying applications MySQL settings is not needed Starting from 5.1, you can dynamically enable/disable slow/general query logs
  • libpcap
    • library to capture network packets
    • Most of network capturing tools including tcpdump rely on libpcap
    • Packet loss often happens
  • tcpdump + mk-query-digest — It can trace how long each query takes, and it can summarize and sort by total execution time

  • MySlowTranCapture
    • A tool to print “transactions” that take more than N milliseconds
    • Held locks are not released until COMMIT/ROLLBACK is executed


MySQL Tuning Guides on CentOS