Linux性能优化-磁盘I/O延迟很高
目录
安装环境
安装 bcc,docker
启动docker
service docker start
运行环境如下
docker中有三个文件分别如下
io_app.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import os
import uuid
import random
import shutil
from concurrent.futures import ThreadPoolExecutor
from flask import Flask, jsonify
app = Flask(__name__)
def validate(word, sentence):
return word in sentence
def generate_article():
s_nouns = [
"A dude", "My mom", "The king", "Some guy", "A cat with rabies",
"A sloth", "Your homie", "This cool guy my gardener met yesterday",
"Superman"
]
p_nouns = [
"These dudes", "Both of my moms", "All the kings of the world",
"Some guys", "All of a cattery's cats",
"The multitude of sloths living under your bed", "Your homies",
"Like, these, like, all these people", "Supermen"
]
s_verbs = [
"eats", "kicks", "gives", "treats", "meets with", "creates", "hacks",
"configures", "spies on", "retards", "meows on", "flees from",
"tries to automate", "explodes"
]
infinitives = [
"to make a pie.", "for no apparent reason.",
"because the sky is green.", "for a disease.",
"to be able to make toast explode.", "to know more about archeology."
]
sentence = '{} {} {} {}'.format(
random.choice(s_nouns), random.choice(s_verbs),
random.choice(s_nouns).lower() or random.choice(p_nouns).lower(),
random.choice(infinitives))
return '\n'.join([sentence for i in range(50000)])
@app.route('/')
def hello_world():
return 'hello world'
@app.route("/popularity/<word>")
def word_popularity(word):
dir_path = '/tmp/{}'.format(uuid.uuid1())
count = 0
sample_size = 1000
def save_to_file(file_name, content):
with open(file_name, 'w') as f:
f.write(content)
try:
# initial directory firstly
os.mkdir(dir_path)
# save article to files
for i in range(sample_size):
file_name = '{}/{}.txt'.format(dir_path, i)
article = generate_article()
save_to_file(file_name, article)
# count word popularity
for root, dirs, files in os.walk(dir_path):
for file_name in files:
with open('{}/{}'.format(dir_path, file_name)) as f:
if validate(word, f.read()):
count += 1
finally:
# clean files
shutil.rmtree(dir_path, ignore_errors=True)
return jsonify({'popularity': count / sample_size * 100, 'word': word})
@app.route("/popular/<word>")
def word_popular(word):
count = 0
sample_size = 1000
articles = []
try:
for i in range(sample_size):
articles.append(generate_article())
for article in articles:
if validate(word, article):
count += 1
finally:
pass
return jsonify({'popularity': count / sample_size * 100, 'word': word})
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=80)
Dockerfile 如下
FROM python:alpine
LABEL maintainer="feiskyer@gmail.com"
RUN pip install flask
EXPOSE 80
ADD io_app.py /io_app.py
Makefile 如下
.PHONY: run
run:
docker run --name=io_app -p 10000:80 -itd feisky/word-pop
.PHONY: build
build:
docker build -t feisky/word-pop -f Dockerfile .
.PHONY: push
push:
docker push feisky/word-pop
.PHONY: clean
clean:
docker rm -f io_app
执行前的一些准备工作
# 构建 docker镜像
make build
# 运行案列
make run
#查看docker运行情况
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
88303172b050 feisky/word-pop "python /io_app.py" 1 hours ago Up 1 hours 0.0.0.0:10000->80/tcp io_app
测试运行情况
curl http://[IP]:10000/
hello world
分析问题
为了避免执行结果瞬间就结束了,把调用结果放到一个循环中
while true; do time curl curl http://[IP]:10000/popularity/word; sleep 1; done
#如果执行一次,结果如下
{
"popularity": 0.0,
"word": "word"
}
通过top观察结果,iowait比较高
top
top - 17:53:37 up 12 days, 7:36, 6 users, load average: 0.65, 0.16, 0.05
Tasks: 90 total, 2 running, 53 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.1 us, 21.1 sy, 0.0 ni, 0.0 id, 58.7 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1008936 total, 73040 free, 131788 used, 804108 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 718136 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27110 root 20 0 103304 21636 2600 S 38.2 2.1 1:29.64 python
34 root 20 0 0 0 0 S 2.0 0.0 0:01.79 kswapd0
10 root 20 0 0 0 0 R 0.3 0.0 0:51.73 rcu_sched
从iostat输出看,I/O使用率已经到100%了,写请求响应时间都是1秒
iostat -x -d 1
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 14.74 0.00 111.58 0.00 116517.89 2088.53 122.98 1064.30 0.00 1064.30 9.42 105.05
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 30.30 0.00 111.11 0.00 111385.86 2004.95 78.76 1078.08 0.00 1078.08 9.05 100.51
00
通过pidstat 看,就是python进程导致的
pidstat -d 1
06:06:30 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
06:06:31 PM 0 349 0.00 149.49 0.00 jbd2/vda1-8
06:06:31 PM 0 27110 0.00 5886.87 0.00 python
06:06:31 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
06:06:32 PM 0 27110 0.00 152249.50 0.00 python
通过strace看,只有一堆stat()函数
strace -p 27110
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/serializer.py", {st_mode=S_IFREG|0644, st_size=8653, ...}) = 0
。。。
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/signer.py", {st_mode=S_IFREG|0644, st_size=6345, ...}) = 0
。。。
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/timed.py", {st_mode=S_IFREG|0644, st_size=5635, ...}) = 0
再通过strace观察,这次只观察子进程,而且范围缩小,只跟踪文件系统调用相关的函数
通过trace=open,就非常明显的看到了,python进程在不断创建临时文件
strace -p 27110 -ff -e trace=desc
[pid 27651] ioctl(6, TIOCGWINSZ, 0x7fd244c6eef0) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 27651] lseek(6, 0, SEEK_CUR) = 0
[pid 27651] mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd243fd6000
[pid 27651] write(6, "A cat with rabies meets with thi"..., 4199999) = 4199999
[pid 27651] close(6) = 0
[pid 27651] mmap(NULL, 253952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd244af1000
[pid 27651] mmap(NULL, 3153920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd2440d6000
[pid 27651] open("/tmp/e6bc7e84-1d65-11e9-b8e3-0242ac120002/446.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27651] fcntl(6, F_SETFD, FD_CLOEXEC) = 0
strace -p 27110 -ff -e trace=open
strace: Process 27110 attached with 3 threads
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/245.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/246.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/247.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/248.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/249.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/250.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/251.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
通过bcc的工具观察
filetop -C
TID COMM READS WRITES R_Kb W_Kb T FILE
2079 AliYunDun 4 0 31 0 R stat
27679 filetop 2 0 2 0 R loadavg
27681 python 0 1 0 3173 R 362.txt
27681 python 0 1 0 2978 R 359.txt
27681 python 0 1 0 2343 R 363.txt
27681 python 0 1 0 2929 R 361.txt
27681 python 0 1 0 2685 R 356.txt
27681 python 0 1 0 2734 R 355.txt
27681 python 0 1 0 2490 R 360.txt
27681 python 0 1 0 3759 R 358.txt
27681 python 0 1 0 3124 R 357.txt
18:44:53 loadavg: 0.45 0.25 0.21 6/156 27681
TID COMM READS WRITES R_Kb W_Kb T FILE
2079 AliYunDun 4 0 31 0 R stat
27679 filetop 2 0 2 0 R loadavg
27681 python 0 1 0 1757 R 402.txt
27681 python 0 1 0 3027 R 373.txt
27681 python 0 1 0 3076 R 404.txt
27681 python 0 1 0 2685 R 414.txt
27681 python 0 1 0 3955 R 392.txt
27681 python 0 1 0 2539 R 388.txt
27681 python 0 1 0 2490 R 403.txt
27681 python 0 1 0 3271 R 396.txt
27681 python 0 1 0 2539 R 397.txt
27681 python 0 1 0 2880 R 368.txt
27681 python 0 1 0 2587 R 367.txt
#查看27681对应的进程
ps -efT | grep 27681
root 27110 27681 27090 35 18:44 pts/2 00:00:17 /usr/local/bin/python /io_app.py
root 27683 27683 27464 0 18:45 pts/5 00:00:00 grep --color=auto 27681
opensnoop
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/245.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/246.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/247.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/248.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/249.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/250.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/251.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/252.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/253.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/254.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/255.txt
27110 python 6 0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/256.txt
源码中可以看到,这个案例应用,在每个请求的处理过程中,都会生成一批临时文件,然后读入内存处理,最后再把整个目录删除掉。
这是一种常见的利用磁盘空间处理大量数据的技巧,不过,本次案例中的 I/O 请求太重,导致磁盘 I/O 利用率过高。
要解决这一点,其实就是算法优化问题了。比如在内存充足时,就可以把所有数据都放到内存中处理,这样就能避免 I/O 的性能问题。
当然,这只是优化的第一步,并且方法也不算完善,还可以做进一步的优化。不过,在实际系统中,我们大都是类似的做法,先用最简单的方法,
尽早解决线上问题,然后再继续思考更好的优化方法。
改为基于内存的方式
time curl http://[IP]:10000/popular/word
curl: (52) Empty reply from server
real 0m29.176s
user 0m0.002s
sys 0m0.035s
这是一个响应过慢的单词热度案例。
首先,用 top、iostat,分析了系统的 CPU 和磁盘使用情况。发现了磁盘 I/O 瓶颈,也知道了这个瓶颈是案例应用导致的。
接着,用 strace 来观察进程的系统调用,不过这次很不走运,没找到任何 write 系统调用。
再用strce -ff -e strace=open,就能很明显的发现问题原因了
借助动态追踪工具包 bcc 中的 filetop 和 opensnoop ,发现这个根源是大量读写临时文件。
找出问题后,优化方法就相对比较简单了。如果内存充足时,最简单的方法,就是把数据都放在速度更快的内存中,这样就没有磁盘 I/O 的瓶颈了。当然,再进一步,你可以还可以利用 Trie 树等各种算法,进一步优化单词处理的效率。
更多推荐
所有评论(0)