故障分析 | 是谁偷走了我的 IO
作者:孙桥
爱可生DBA团队成员,主要负责MySQL和DMP平台日常的维持工作,对数据库自动化运维存有浓厚兴趣。
本文来源:原创投稿
*爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。
前言
对于刚接触 MySQL 数据库不久的小伙伴来说,或多或少会遇到一些棘手的小问题;比如在数据库磁盘无损坏的情况下,我们经常会遇到数据库服务器的磁盘IO压力很大,但由于服务器上只有 MySQL 程序,我们第一想法肯定是某个连接用户在连到数据库后,背着我们在瞎搞着什么;然而直接通过 show processlist 命令,看到的结果不一定准确。下面模拟一个简单场景,让我们确切地定位问题所在。
场景模拟,制造 IO 压力
## 生成一张大表,用于测试[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 prepare
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Creating table 'sbtest1'...
Inserting 15000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...## 对生成的表进行写操作
[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 run
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 1
Report intermediate results every 2 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 2s ] thds: 1 tps: 24.47 qps: 148.33 (r/w/o: 0.00/98.89/49.44) lat (ms,95%): 86.00 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 53.02 qps: 318.62 (r/w/o: 0.00/212.58/106.04) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 1 tps: 60.50 qps: 363.50 (r/w/o: 0.00/242.50/121.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 1 tps: 65.50 qps: 391.02 (r/w/o: 0.00/260.01/131.01) lat (ms,95%): 39.65 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 1 tps: 55.50 qps: 334.51 (r/w/o: 0.00/223.51/111.00) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 1 tps: 57.00 qps: 340.99 (r/w/o: 0.00/227.00/114.00) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00问题分析
1、通过top命令发现io wait比较高,咦,What is going on?(假装不知道我在压测(●'◡'●))
top - 17:05:59 up 179 days, 7:05, 2 users, load average: 0.84, 0.70, 0.93
Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombie
%Cpu0 : 1.3 us, 0.7 sy, 0.0 ni, 95.3 id, 2.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 6.9 us, 3.8 sy, 0.0 ni, 60.3 id, 29.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 2.4 us, 2.0 sy, 0.0 ni, 90.8 id, 4.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 1.7 us, 1.3 sy, 0.0 ni, 95.3 id, 1.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 1.4 us, 1.0 sy, 0.0 ni, 95.6 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 3.1 us, 1.4 sy, 0.0 ni, 81.4 id, 14.2 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.7 us, 1.0 sy, 0.0 ni, 96.6 id, 1.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 1.0 us, 0.3 sy, 0.0 ni, 96.6 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.3 us, 1.0 sy, 0.0 ni, 94.6 id, 4.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 2.4 us, 1.0 sy, 0.0 ni, 91.2 id, 5.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 14.7 us, 2.7 sy, 0.0 ni, 82.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 2.0 us, 1.0 sy, 0.0 ni, 96.0 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12137024 total, 185984 free, 3987936 used, 7963104 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7150088 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8858 actiont+ 20 0 7249940 1.1g 11892 S 30.0 9.3 210:38.75 mysqld
26447 tidb 20 0 4000364 1.0g 4740 S 5.3 9.0 10018:23 TiFlashMain
22761 actiont+ 20 0 2526348 36188 9756 S 4.0 0.3 1197:43 ustats
29004 root 20 0 3784084 103676 2012 S 2.0 0.9 1802:29 beam.s2、通过iostat命令进一步分析,我们看到磁盘在频繁的写操作,维持在26M/s左右(%uitl值比较大,一般来说,值大于75%是表明磁盘压力比较大了)
[root@tidb-tikv01 ~]# iostat -xm 1
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01) 04/22/2022 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.21 0.00 0.44 0.18 0.10 98.07
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.01 0.18 3.53 22.99 0.08 0.25 25.84 0.19 7.16 5.64 7.40 0.71 1.88
scd0 0.00 0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.50 0.50 0.00 0.50 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
2.22 0.00 1.19 8.28 0.09 88.23
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 328.00 1800.00 5.12 26.24 30.17 3.18 1.49 4.68 0.91 0.41 88.30
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.63 0.00 1.03 6.68 0.09 90.57
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 352.00 1918.00 5.50 27.82 30.06 2.96 1.31 3.21 0.96 0.37 83.60
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
5.54 0.00 1.96 8.77 0.09 83.65
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 300.00 1682.00 4.69 25.68 31.38 3.09 1.55 3.68 1.17 0.45 88.60
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.003、通过iotop命令可以看到与MySQL相关线程ID 22233的IO消耗为52.59%,读为2.56M/s,同时线程ID 8879的写为27.31M/s(一般来说,IO列值越大,说明IO压力最有可能与其相关)
Total DISK READ : 5.29 M/s | Total DISK WRITE : 29.42 M/s
Actual DISK READ: 5.29 M/s | Actual DISK WRITE: 26.83 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
22233 be/4 actionte 2.56 M/s 473.07 K/s 0.00 % 52.59 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8879 be/4 actionte 0.00 B/s 27.31 M/s 0.00 % 17.76 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8887 be/4 actionte 0.00 B/s 0.00 B/s 0.00 % 10.10 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8907 be/4 actionte 116.33 K/s 0.00 B/s 0.00 % 6.27 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8930 be/4 actionte 7.76 K/s 0.00 B/s 0.00 % 3.45 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8910 be/4 actionte 1217.57 K/s 0.00 B/s 0.00 % 3.33 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
8911 be/4 actionte 372.25 K/s 0.00 B/s 0.00 % 1.53 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
2946 be/4 actionte 0.00 B/s 1985.34 B/s 0.00 % 1.52 % ustats
8909 be/4 actionte 418.78 K/s 0.00 B/s 0.00 % 1.08 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=33064、将上面获取的系统线程ID带入到MySQL中,我们便可以查到连接用户sun一直在频繁写操作,现在我很怀疑你,但我先保持沉默
[root@tidb-tikv01 ~]# /opt/mysql/base/8.0.21/bin/mysql -S /opt/mysql/data/3306/mysqld.sock -pxx
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 15041
Server version: 8.0.21 MySQL Community Server - GPL
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> select * from performance_schema.threads where THREAD_OS_ID=22233\G
*************************** 1. row ***************************
THREAD_ID: 14880
NAME: thread/sql/one_connection
TYPE: FOREGROUND
PROCESSLIST_ID: 14830 ### ID为148830连接在疯狂的进行写操作,show processlist可以看到正在进行的连接
PROCESSLIST_USER: sun
PROCESSLIST_HOST: 10.186.61.16
PROCESSLIST_DB: testdb
PROCESSLIST_COMMAND: Query
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: updating
PROCESSLIST_INFO: UPDATE sbtest1 SET k=k+1 WHERE id=7899611
PARENT_THREAD_ID: NULL
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: TCP/IP
THREAD_OS_ID: 22233 ### THREAD_OS_ID为linux系统中的线程ID
RESOURCE_GROUP: USR_default
1 row in set (0.00 sec)
mysql> select * from performance_schema.threads where THREAD_OS_ID=8879\G
*************************** 1. row ***************************
THREAD_ID: 21
NAME: thread/innodb/page_flush_coordinator_thread ### MySQL自身刷脏数据线程,因为用户的频繁写操作,导致该线程一直进行落盘操作
TYPE: BACKGROUND
PROCESSLIST_ID: NULL
PROCESSLIST_USER: NULL
PROCESSLIST_HOST: NULL
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: NULL
PROCESSLIST_TIME: 264523
PROCESSLIST_STATE: NULL
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: NULL
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 8879
RESOURCE_GROUP: SYS_default
1 row in set (0.00 sec)
mysql> show processlist;
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 264544 | Waiting on empty queue | NULL |
| 8 | universe_op | 127.0.0.1:38362 | NULL | Sleep | 0 | | NULL |
| 9 | universe_op | 127.0.0.1:38374 | NULL | Sleep | 0 | | NULL |
| 10 | universe_op | 127.0.0.1:38376 | NULL | Sleep | 0 | | NULL |
| 11 | universe_op | 127.0.0.1:38380 | NULL | Sleep | 0 | | NULL |
| 12 | universe_op | 127.0.0.1:38378 | NULL | Sleep | 0 | | NULL |
| 13 | universe_op | 127.0.0.1:38384 | NULL | Sleep | 0 | | NULL |
| 14 | universe_op | 127.0.0.1:38382 | NULL | Sleep | 0 | | NULL |
| 15 | universe_op | 10.186.61.45:50322 | NULL | Binlog Dump GTID | 264542 | Master has sent all binlog to slave; waiting for more updates | NULL |
| 17 | universe_op | 127.0.0.1:38454 | NULL | Sleep | 14 | | NULL |
| 18 | universe_op | 127.0.0.1:38456 | NULL | Sleep | 14 | | NULL |
| 19 | universe_op | 127.0.0.1:38458 | NULL | Sleep | 14 | | NULL |
| 20 | universe_op | 127.0.0.1:38460 | NULL | Sleep | 14 | | NULL |
| 21 | universe_op | 127.0.0.1:38462 | NULL | Sleep | 14 | | NULL |
| 22 | universe_op | 127.0.0.1:38464 | NULL | Sleep | 14 | | NULL |
| 23 | universe_op | 127.0.0.1:38452 | NULL | Sleep | 14 | | NULL |
| 24 | universe_op | 127.0.0.1:38466 | NULL | Sleep | 14 | | NULL |
| 142 | universe_op | 127.0.0.1:33390 | NULL | Sleep | 0 | | NULL |
| 14830 | sun | 10.186.61.16:60730 | testdb | Query | 0 | update | INSERT INTO sbtest1 (id, k, c, pad) VALUES (7493994, 6518751, '15795471662-35813531247-93886706048-2 |
| 15041 | root | localhost | NULL | Query | 0 | starting | show processlist |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
20 rows in set (0.00 sec)5、通过kill掉14830后,发现磁盘压力果然over了,你这“小阳人”( ̄ε(# ̄)☆╰╮o( ̄皿 ̄///)
mysql> kill 14830;
Query OK, 0 rows affected (0.00 sec)
mysql> exit
Bye
Segmentation fault
[root@tidb-tikv01 ~]# iostat -xm 1
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01) 04/22/2022 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.21 0.00 0.44 0.18 0.10 98.06
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.01 0.18 3.83 24.55 0.09 0.27 26.12 0.19 6.78 5.46 6.99 0.69 1.95
scd0 0.00 0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.50 0.50 0.00 0.50 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.85 0.00 0.59 0.00 0.00 97.56
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 0.00 61.00 0.00 0.46 15.52 0.01 0.20 0.00 0.20 0.18 1.10
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
3.36 0.00 0.84 0.08 0.00 95.71
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 10.00 1.00 53.00 0.02 0.48 18.81 2.87 0.22 1.00 0.21 1.31 7.10
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00备注说明:
生产环境可不要随随便便kill哦,出问题是要被鞭尸的呦~
总结
对于一些新入门的小可爱来说,以上方法可以循序渐进的帮助你定位到是哪些连接线程把你的服务器IO折腾高的,可以让你更加了解Linux系统与MySQL之间的联动关系。老鸟嘛,emmm...
转载自:https://segmentfault.com/a/1190000041768191