赞
踩
本篇文章是我日常cpu性能优化的实践笔记,结构有些混乱,未来会重构。
我认为,学习性能优化最有效的途径,一是看祖师爷Brendan Gregg
的文章,二就是实践了。
/proc/[pid]/stack
/proc/[PID]/task/
下表示当前进程所有的线程/proc/[PID]/task/[TID]/stack
表示当前进程的内核模式堆栈# -d 参数表示高亮显示变化的区域
$ watch -d cat /proc/interrupts
面对一个问题程序,最好采用自顶向下的策略。先整体看看该程序运行时各种统计事件的大概,再针对某些方向深入细节。而不要一下子扎进琐碎细节,会一叶障目的。
有些程序慢是因为计算量太大,其多数时间都应该在使用 CPU 进行计算,这叫做 CPU bound 型;有些程序慢是因为过多的 IO,这种时候其 CPU 利用率应该不高,这叫做 IO bound 型;对于 CPU bound 程序的调优和 IO bound 的调优是不同的。
perf list
使用 perf list
命令可以列出所有能够触发 perf 采样点的事件 perf list 'sched:*'
表示列出sched相关的tracepoints
不同的系统会列出不同的结果,在 2.6.35 版本的内核中,该列表已经相当的长,但无论有多少,我们可以将它们划分为三类:
man perf_event_open
可以查到每一个event的解释cpu-clock
一般就用这个,表示监控的指标时cpu的周期block:block_rq_issue
系统下发IOperf stat
# CPU counter statistics for the specified command: 对某个command的执行过程执行perf,显示一些cpu相关的计数 perf stat command # Detailed CPU counter statistics (includes extras) for the specified command: 基本同上,但更为详细 perf stat -d command # CPU counter statistics for the specified PID, until Ctrl-C: 对某个pid perf stat -p PID # CPU counter statistics for the entire system, for 5 seconds: 对整个系统,执行5秒 perf stat -a sleep 5 # Various basic CPU statistics, system wide, for 10 seconds: 显示基本cpu相关信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10 # Various CPU level 1 data cache statistics for the specified command: cache相关 perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command # Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command # Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command # Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5 # Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID # Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10 # Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10 # Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10 # Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000
perf record
perf record
则提供了保存数据的功能,保存后的数据,需要你用 perf report 解析展示。-g
选项,开启调试关系的采样,查看调用链-p <pid>
指定进程号# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command # Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID # Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10 # Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10 # Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10 # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10 # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10 # If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10 # Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10 # Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10 # Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10 # Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5 # Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1
# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a # Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a # Trace all context-switches, until Ctrl-C: # -c 1 意味着-vv选项中的sample_period是1,也就是抓所有的时间 perf record -e context-switches -c 1 -a # Include raw settings used (see: man perf_event_open): # -vv显示详细信息 perf record -vv -e context-switches -a # Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a # Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag # Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10 # Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T # Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10 # Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag # Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag # Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a # Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a # Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200' # Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"' # Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"' # Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag # Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag # Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag # Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag # Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node` # Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -a
perf report
# Show perf.data in an ncurses browser (TUI) if possible: perf report # Show perf.data with a column for sample count: perf report -n # Show perf.data as a text report, with data coalesced and percentages: perf report --stdio # Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded # List all events from perf.data: perf script # List all perf.data events, with data header (newer kernels; was previously default): perf script --header # List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace # List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace # List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso # Dump raw contents from perf.data as hex (for debugging): perf script -D # Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdio
Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg # Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg # Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return' # Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg # Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs # Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg # Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81 # Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx' # Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx' # Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100' # Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval' # Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size' # Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state' # Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state' # Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a # Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen' # Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string' # Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string' # Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc # Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry # List currently available dynamic probes: perf probe -l
Mixed
# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm
# Trace sent network packets by on-CPU process, rolling output (no clear):
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings
# Sample stacks at 99 Hertz, and, context switches:
perf record -F99 -e cpu-clock -e cs -a -g
# Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g
Special
# Record cacheline events (Linux 4.10+):
perf c2c record -a -- sleep 10
# Report cacheline events from previous recording (Linux 4.10+):
perf c2c report
其他用法
perf c2c
(Linux 4.10+): cache-2-cache and cacheline false sharing analysis.perf kmem
: kernel memory allocation analysis.perf kvm
: KVM virtual guest analysis.perf lock
: lock analysis.perf mem
: memory access analysis.perf sched
: kernel scheduler statistics. Examples.docker
外执行perf record
。 然后把文件拷贝到容器内来分析。
perf record -g -p < pid>
,执行一会儿(比如 15 秒)后,按 Ctrl+C 停止。docker cp perf.data phpfpm:/tmp
docker exec -i -t phpfpm bash
perf top
-g
选项,开启调试关系的采样,查看调用链-p <pid>
指定进程号[.]
表示用户空间的可执行程序、或者动态链接库,而 [k]
则表示内核空间。# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file):
perf top -F 49
# Sample CPUs at 49 Hertz, and show top process names and segments, live:
perf top -F 49 -ns comm,dso
# 1. 采样 # -g 选项是告诉perf record额外记录函数的调用关系 # -e cpu-clock 指perf record监控的指标为cpu周期 # -p 指定需要record的进程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60 sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid # 2. 用perf script工具对perf.data进行解析 perf script -i perf.data &> perf.unfold # 后面用到火焰图repo https://github.com/brendangregg/FlameGraph.git # 3. 将perf.unfold中的符号进行折叠 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded # 4. 最后生成svg图 xxx/flamegraph.pl perf.folded > perf.svg
当我们使用 stackcollapse-perf.pl
生成了对应的folded文件后,可以使用grep生成自己希望的火焰图
perf script | ./stackcollapse-perf.pl > out.perf-folded
# 可以过滤掉 cpu_idle线程的影响,将关注重点放到真正消耗cpu的地方
grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svg
grep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svg
egrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg
I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.
SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.
The timer.profile
probe was used, which samples all CPUs at 100 Hertz
# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保证stack trace不会truncate
stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
probe end { foreach (i in s+) { print_stack(i);
printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
> out.stap-stacks
./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg
可以把record记录下的文件转成readable的一条条记录
Omitting frame pointers
会导致perf看到的栈帧不全,有几种办法规避
-fno-omit-frame-pointer
。如果优化过度,使用了-fomit-frame-pointer
编译选项,会导致栈帧不全。 我猜这种优化是引入-O2
只有引入的,所以最好加-O0 -fno-omit-frame-pointer
perf record
的时候增加选项--call-graph lbr
,这个必须要cpu支持last branch record特性perf record
增加选项--call-graph dwarf
或-g dwarf
off-cpu说明进程使用的系统调用,进入了内核态,分析offcpu的主要是为了测量cpu的阻塞在什么样的系统调用已经系统调用花费的时间。
如下可见,off-cpu主要分析的是进入某种阻塞的系统调用(sys cpu并不高),进程消耗在等待某种资源(如IO)
Off-CPU Tracing -------------------------------------------->
| |
B B
A A
A(---------. .----------)
| |
B(--------. .--)
| | user-land
- - - - - - - - - - syscall - - - - - - - - - - - - - - - - -
| | kernel
X Off-CPU |
block . . . . . interrupt
最简单使用time命令,可以观察到进程花费在sys上时间。如下,我们总共耗费了50s,但是user和sys加起来只有11.6s。有38.2s消失了,这就是耗费在off-cpu上的时间(读写IO)
$ time tar cf archive.tar linux-4.15-rc2
real 0m50.798s
user 0m1.048s
sys 0m11.627s
bcc命令
# 统计offcpu时间,输出histo
/usr/share/bcc/tools/cpudist -O -p <pid>
# 统计offcpu的时间,可以分别统计用户态和内核态的,可以看到用户态和内核态的栈帧
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
# 对某个pid的调用malloc计数,因为bpf是内核态工具,stackcount只有在ctrl+c之后才会将统计结果输出,这样就避免了每次进程调用malloc之后的内核态用户态转换。
# 函数可以用*通配符,例如c:*map可以匹配到18个函数
/usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks
# 将上述输出转化为火焰图
./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \
--title="malloc() Flame Graph" --countname="calls" > out.svg
很多我们部署的服务,其进程大多处于等待状态(网络线程等待网络包,工作线程等待任务)。这些干扰会影响我们直接分析offcpu。我们其实想找的是如某次request的过程中,该线程出现的offcpu时间。所以我们可以在上面的offcputime
的输出中只选出有用栈帧进行分析。
Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.
You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.
感觉意思是说,如果cpu负载太高,那么offcpu有一部分的时间是花费在调度上了
Involuntary Context Switching也会造成上下文切换,但这一般情况下不是我们想要定位的问题,可以用如下的参数只关注D状态下的offcpu TASK_UNINTERRUPTIBLE
/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2
On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.
sycn IO的时延增大直接会表现为业务的时延增大。我们需要找到一种方式,可以记录每一次调用sync io系统调用的时间,从而画出火焰图。 Brendan写了一个工具 fileiostacks.py。当我们确定系统是因为sync IO导致了性能问题,可以使用这种方式进行分析
如下这种方式,只会打印出调用sycn IO访问文件系统的栈帧的火焰图
/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg
但是IO不止有一种,还有block device,还有网络,我们下面看一个block_device的
使用perf
perf record -e block:block_rq_insert -a -g -- sleep 30
perf script --header > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \
--title="Block I/O Flame Graph" --countname="I/O" > out.svg
使用bcc
/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg
和上面不同,这里我们想分析所有的off-cpu事件。
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
这个火焰图主要将关注点放在wakeup的线程上,我们就可以看到到底是什么线程唤醒了某个进程,感觉还是挺有偶用的,比如说锁开销很高的时候,我们就能分析到底是什么进程在唤醒锁。
比如之前内核出现的wmb惊群的问题,应该可以用这种方法分析出
/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg
后之览者,亦将有感于斯文。
cd /usr/share/bcc/tools/
# Trace new processes: execsnoop # Trace file opens with process and filename: opensnoop # Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D # Summarize block I/O size as a power-of-2 distribution by program name: bitesize # Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1 # Trace TCP active connections (connect()) with IP address and ports: tcpconnect # Trace TCP passive connections (accept()) with IP address and ports: tcpaccept # Trace TCP connections to local port 80, with session duration: tcplife -L 80 # Trace TCP retransmissions with IP addresses and TCP state: # 只打印重传包 tcpretrans # Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10 # Trace details and latency of resolver DNS lookups: gethostlatency # Trace commands issued in all running bash shells: bashreadline
# Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*' # Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*' # Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2' # Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2' # Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval' # Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2' # Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode' # Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task' # Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size' # Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size' # Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio # Summarize the latency (time taken) by the vfs_read() function for PID 181: # 统计函数延时,我们也可以用来做自己进程的性能优化 funclatency -p 181 -u vfs_read
# Trace the libc library function nanosleep() and print the requested sleep details:
trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'
# Count the libc write() call for PID 181 by file descriptor:
argdist -p 181 -C 'p:c:write(int fd):int:fd'
# Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds:
funclatency.py -u 'c:getaddrinfo'
# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint:
stackcount t:block:block_rq_insert
# Trace the pthread_create USDT probe, and print arg1 as hex:
trace 'u:pthread:pthread_create "%x", arg1'
里面有些工具bcc已经收纳了,但是有的比如像iosnoop,还是perf-tools才有
tcpretrans
可以只抓重传的tcp包Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。