赞
踩
实时的概念是不一定是速度要快,是要保证任务完成的时间。让关键的操作能够在所保证的时间之内完成。
实时分为:
实时操作系统(Real-time operating system, RTOS),又称即时操作系统,它会按照排序运行、管理系统资源,并为开发应用程序提供一致的基础。
如果有一个任务需要执行,实时操作系统会马上(在较短时间内)执行该任务,不会有较长的延时。这种特性保证了各个任务的及时执行。
实时操作系统中都要包含一个实时任务调度器,这个任务调度器与其它操作系统的最大不同是强调:严格按照优先级来分配CPU时间,并且时间片轮转不是实时调度器的一个必选项。
提出实时操作系统的概念,可以至少解决两个问题:一个是早期的CPU任务切换的开销太大,实时调度器可以避免任务频繁切换导致CPU时间的浪费;另一个是在一些特殊的应用场景中,必须要保证重要的任务优先被执行。
国外有VxWorks,RT-Thread,uCOS,QNX,WinCE等。
国内有如电子科技大学嵌入式实时教研室和科银公司联合研制开发的实时操作系统Delta OS (道系统) 、凯思公司的Hopen OS (女娲计划) 、中科院北京软件工程研制中心开发的CASSPDA,以及浙江大学自行研制开发的嵌入式操作系统HBOS等;
因为于Linux系统一开始就被设计成GPOS(通用操作系统),所以注重的是尽量缩短系统的平均响应时间,提高吞吐量,顾及整体功能需求。调度器尽可能将可用的资源平均分配给所有需要处理器的进程,所以标准Linux并不提供强实时性。
Linux系统提供符合POSIX标准的调度策略
SCHED_FIFO
)SCHED_RR
)SCHED_OTHER
)[ 默认 ]Linux进程默认的调度策略为静态优先级抢占式调度策略,虽然可以让进程公平地使用CPU和其它资源,但是并不能保证对时间要求严格或者高优先级的进程将先于低优先级的执行,这将严重影响系统实时性。
那么,将实时进程的调度策略设置为SCHED_FIFO
或SCHED_RR
,似乎使得Linux系统具备根据进程优先级进行实时调度的能力,但问题在于,Linux系统在用户态支持可抢占调度策略,而在内核态却不完全支持抢占式调度策略。这样运行在Linux内核态的任务(包括系统调用和中断处理)是不能被其它优先级更高的任务所抢占的,由此引起优先级逆转问题。
Linux的系统进程运行分为用户态和内核态两种模式。
当进程运行在用户态时,具有高的优先级的进程可以抢占进程去执行;但是当进程运行在内核态时,即使其他高优先级进程也不能抢占该进程。当进程通过系统调用进入内核态运行时,实时任务必须等待系统调用返回后才能获得系统资源。这和实时系统所要求的高优先级任务运行是相互矛盾的。
Linux2.6版本的内核进行了改动,Linux2.6版本后的内核是抢占式的,这意味着进程无论在处于内核态还是用户态,都可能被抢占。
Linux2.6以后的内核提供以下3种抢占模式供用户选择
PREEMPT_NONE
——没有强制性的抢占。整体的平均延时较低,但偶尔也会出现一些较长的延时。它最适合那些以整体吞吐率为首要设计准则的应用PREEMPT_VOLUNTARY
——降低延时的第一阶段。它会在内核代码的一些关键位置上放置额外的显示抢占点,以降低延时。但这是以牺牲整体吞吐率为代价的PREEMPT/PREEMPT_DESKTOP
——这种模式使内核在任何地方都是可抢占的,临界区除外。这种模式适用于那些需要软实时性能的应用程序,比如音频和多媒体。这也是以牺牲整体吞吐率为代价的Linux在进行中断处理时都会关闭中断,这样可以更快、更安全地完成自己的任务,但是在此期间,即使有更高优先级的实时进程发生中断,系统也无法响应,必须等到当前中断任务处理完毕。这种状况下会导致中断延时和调度延时增大,降低Linux系统的实时性。
时钟系统是计算机的重要组成部分,相当于整个操作系统的脉搏。系统所能提供的最小时间间隔称为时钟粒度,时钟粒度与进程响应的延迟性是正比关系,即粒度越粗糙,延迟性越长。但时钟粒度并不是越小越好,就同等硬件环境而言,较小的时间粒度会导致系统开销增大,降低整体吞吐率。
在Linux2.6内核中,时钟中断发生频率范围是50~1200Hz,周期不小于0.8ms,对于需要几十微秒的响应精度的应用来说显然不满足要求。而在嵌入式Linux系统中,为了提高整体吞吐率,时钟频率一般设置为100HZ或250HZ。
另外,系统时钟负责软定时,当软定时器逐渐增多时会引起定时器冲突,增加系统负荷。
Linux采用虚拟内存技术,进程可以运行在比实际空间大得多的虚拟空间中。在分时系统中,虚拟内存机制非常适用,然而对于实时系统这是难以忍受的,频繁的页面换进换出会使得系统进程运行无法在规定时间内完成。 (影响性能)
对于此问题,Linux系统提供内存锁定功能,以避免在实时处理中存储页被换出。
多个任务互斥地访问同一共享资源时,需要防止数据遭到破坏,系统通常采用信号量机制解决互斥问题。然而,在采取基于优先级调度的实时系统中,信号量机制容易造成优先级倒置,即低优先级任务占用高优先级任务资源,导致高优先级任务无法运行。
虽然从2.6.12版本之后,Linux内核已经可以在较快的x86处理器上实现10毫秒以内的软实时性能。但如果想实现可预测、可重复的微秒级的延时,使Linux系统更好地应用于嵌入式实时环境,则需要在保证Linux系统功能的基础上对其进行改造。
较为合理的两大类方法为:
遵循GPL协议的情况下,直接修改内核源代码将Linux改造成一个完全可抢占的实时系统。核心修改面向局部,不会从根本上改变Linux内核,并且一些改动还可以通过Linux的模块加载来完成,即系统需要处理实时任务时加载该功能模块,不需要时动态卸载该模块。
目前kernel.org发布的主线内核版本还不支持硬实时。为了开启硬实时的功能,必须对代码打补丁。补丁网址如下:www.kernel.org/pub/linux/kernel/projects/rt/
补丁添加了第4种抢占模式,称为PREEMPT_RT
(实时抢占)。实时补丁在Linux内核中添加了几个重要特性,包括使用可抢占的互斥量来替代自旋锁;除了使用preempt_disable()
保护的区域以外,内核中的所有地方都开启了非自愿式抢占(involuntary preemption)功能。这种模式能够显著降低抖动(延时的变化),并且使那些对延时要求很高的实时应用具有可预测的较低延时。
这种方法存在的问题是:很难百分之百保证,在任何情况下,GPOS程序代码绝不会阻碍RTOS的实时行为。也就是说,通过修改Linux内核,难以保证实时进程的执行不会遭到非实时进程所进行的不可预测活动的干扰。
双内核方法的实质是把标准的Linux内核作为一个普通进程在另一个内核上运行。关键的改造部分是在Linux和中断控制器之间加一个中断控制的仿真层,成为其实时内核的一部分。该中断仿真机制提供了一个标志用来记录Linux的关开中断情况。一般只在修改核心数据结构关键代码时才关中断,所以其中断响应很小。其优点是可以做到硬实时,并且能很方便地实现一种新的调度策略。
为方便使用,实时内核通常由一套可动态载入的模块提供,也可以像编译任何一般的子系统那样在Linux源码树中直接编译。常用的双内核法实时补丁有RTLinux/GPL、RTAI 和 Xenomai,其中RTLinux/GPL只允许以内核模块的形式提供实时应用;而RTAI和Xenomai支持在具有MMU保护的用户空间中执行实时程序。
CPU抢占分两种情况,
Linux2.6有一个CONFIG_PREEMPT的选项,打开该选项后,linux kernel就支持了内核代码的抢占。对于抢占式内核而言,即便是从中断上下文返回内核空间的进程上下文,只要内核代码不在临界区内,就可以发生调度,让最高优先级的任务调度执行。例如高优先级的进程可以抢占内核态的系统调用,而不必等系统调用执行完返回用户空间才抢占。
看一看这个preempt-rt社区网站。
https://wiki.linuxfoundation.org/realtime/start
一个进程进入内核态后,别的进程无法抢占,只能等其完成或退出内核态时才能抢占, 这带来严重的延时问题,2.6 开始支持内核态抢占。
使用内核内置的trace工具,可以跟踪具体某个进程的内核函数调用过程,并且可以统计到每个函数的运行时间,对优化系统性能有一定的帮助。
写了一段C的代码,每隔1秒打印一个数字
#include<stdio.h> #include<stdlib.h> #include<unistd.h> void dy(int num){ printf("%d\n",num); } int main(){ int i = 0; pid_t pid; pid = getpid(); printf("\n\nPid = %d\n\n",pid); while(i<60){ i++; dy(i); sleep(1); } return 0; }
再跳转到目录
cd /sys/kernel/debug/tracing
设置我们要追踪的类型
echo function > current_tracer
运行我们写的C语言的代码
./test
输出
Pid = 12896
1
2
3
4
....剩下的省略
设置我们要追踪的进程
echo 12896 > set_ftrace_pid #这里的12896是上面的Pid
然后记得先清空 trace
echo > trace
然后开启记录模式 ( 检测开启之前是不是关闭状态 )
echo 1 > tracing_on # 0为关闭,1为开启
然后把日志复制出来进行分析
cp trace /tmp/trace1.log
vim /tmp/trace1.log
由于数据太多这里只看头几行
head -n40 /tmp/trace1.log
结果
# tracer: function## entries-in-buffer/entries-written: 2354/2354 #P:12## _-----=> irqs-off# / _----=> need-resched# | / _---=> hardirq/softirq# || / _--=> preempt-depth# ||| / delay# TASK-PID CPU# |||| TIMESTAMP FUNCTION# | | | |||| | | <idle>-0 [002] d... 9153.236787: switch_mm_irqs_off <-__schedule <idle>-0 [002] d... 9153.236789: load_new_mm_cr3 <-switch_mm_irqs_off test-12896 [002] d... 9153.236790: finish_task_switch <-__schedule test-12896 [002] .... 9153.236790: _cond_resched <-do_nanosleep test-12896 [002] .... 9153.236790: rcu_all_qs <-_cond_resched test-12896 [002] .... 9153.236790: hrtimer_try_to_cancel <-do_nanosleep test-12896 [002] .... 9153.236790: hrtimer_active <-hrtimer_try_to_cancel test-12896 [002] d... 9153.236790: fpregs_assert_state_consistent <-do_syscall_64 test-12896 [002] d... 9153.236791: switch_fpu_return <-do_syscall_64 test-12896 [002] d... 9153.236794: do_syscall_64 <-entry_SYSCALL_64_after_hwframe test-12896 [002] .... 9153.236794: __x64_sys_write <-do_syscall_64 test-12896 [002] .... 9153.236794: ksys_write <-__x64_sys_write test-12896 [002] .... 9153.236795: __fdget_pos <-ksys_write test-12896 [002] .... 9153.236795: __fget_light <-__fdget_pos test-12896 [002] .... 9153.236795: vfs_write <-ksys_write test-12896 [002] .... 9153.236795: rw_verify_area <-vfs_write test-12896 [002] .... 9153.236795: security_file_permission <-rw_verify_area test-12896 [002] .... 9153.236796: extend_file_permission <-security_file_permission test-12896 [002] .... 9153.236796: __vfs_write <-vfs_write test-12896 [002] .... 9153.236796: tty_write <-__vfs_write test-12896 [002] .... 9153.236796: tty_paranoia_check <-tty_write test-12896 [002] .... 9153.236797: tty_ldisc_ref_wait <-tty_write test-12896 [002] .... 9153.236797: ldsem_down_read <-tty_ldisc_ref_wait test-12896 [002] .... 9153.236797: _cond_resched <-ldsem_down_read test-12896 [002] .... 9153.236797: rcu_all_qs <-_cond_resched test-12896 [002] .... 9153.236797: tty_write_lock <-tty_write test-12896 [002] .... 9153.236797: mutex_trylock <-tty_write_lock test-12896 [002] .... 9153.236798: __check_object_size <-tty_write test-12896 [002] .... 9153.236798: check_stack_object <-__check_object_size
简单的说idle是一个进程,其pid号为 0。其前身是系统创建的第一个进程,也是唯一一个没有通过fork()产生的进程。
主处理器上的idle由原始进程(pid=0)演变而来。从处理器上的idle由init进程fork得到,但是它们的pid都为0。
Idle进程为最低优先级,且不参与调度,只是在运行队列为空的时候才被调度。
Idle循环等待need_resched置位。默认使用hlt节能。
开机流程:系统是从BIOS加电自检,载入MBR中的引导程序(LILO/GRUB),再加载linux内核开始运行的,一直到指定shell开始运行,这时用户开始操作Linux。
而大致是在vmlinux的入口startup_32(head.S)中为pid号为0的原始进程设置了执行环境,然后原始进程开始执行start_kernel()完成Linux内核的初始化工作。包括初始化页表,初始化中断向量表,初始化系统时间等。
继而调用 fork(),创建第一个用户进程 :kernel_thread ( kernel_init, NULL, CLONE_FS | CLONE_SIGHAND );
这个进程就是着名的pid为1的init进程,它会继续完成剩下的初始化工作,然后execve(/sbin/init)
, 成为系统中的其他所有进程的祖先。关于init我们这次先不研究,回过头来看pid=0的进程,在创建了init进程后,pid=0的进程调用 cpu_idle()演变成了idle进程。
在 smp系统中,除了上面刚才我们讲的主处理器(执行初始化工作的处理器)上idle进程的创建,还有从处理器(被主处理器activate的处理器)上的 idle进程,他们又是怎么创建的呢?接着看init进程,init在演变成/sbin/init之前,会执行一部分初始化工作,其中一个就是 smp_prepare_cpus(),初始化SMP处理器,在这过程中会在处理每个从处理器时调用
task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, NULL, 0);
init_idle(task, cpu);
即从init中复制出一个进程,并把它初始化为idle进程(pid仍然为0)。从处理器上的idle进程会进行一些Activate工作,然后执行cpu_idle()。
整个过程简单的说就是,原始进程(pid=0)创建init进程(pid=1),然后演化成idle进程(pid=0)。init进程为每个从处理器(运行队列)创建出一个idle进程(pid=0),然后演化成/sbin/init。
idle 进程优先级为MAX_PRIO,即最低优先级。早先版本中,idle是参与调度的,所以将其优先级设为最低,当没有其他进程可以运行时,才会调度执行 idle。而目前的版本中idle并不在运行队列中参与调度,而是在运行队列结构中含idle指针,指向idle进程,在调度器发现运行队列为空的时候运行,调入运行。
简单来说是,早期版本是没有进程运行了,调度idle运行,后来的版本是队列没有程序运行了,队列里面有一个idle指针,调入运行。
idle在系统没有其他就绪的进程可执行的时候才会被调度。
不管是主还是从处理器,最后都是执行的cpu_idle()函数(演变成了idle进程)。所以我们来看看cpu_idle做了什么事情。
因为idle进程中并不执行什么有意义的任务,所以通常考虑的是两点:1.节能,2.低退出延迟。其核心代码如下:
void cpu_idle(void) { int cpu = smp_processor_id(); current_thread_info()->status |= TS_POLLING; /* endless idle loop with no priority at all */ while (1) { tick_nohz_stop_sched_tick(1); while (!need_resched()) { check_pgt_cache(); rmb(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, 0); if (cpu_is_offline(cpu)) play_dead(); local_irq_disable(); __get_cpu_var(irq_stat).idle_timestamp = jiffies; /* Don't trace irqs off for idle */ stop_critical_timings(); pm_idle(); start_critical_timings(); } tick_nohz_restart_sched_tick(); preempt_enable_no_resched(); schedule(); preempt_disable(); }}
循环判断need_resched以降低退出延迟,用idle()来节能。
默认的idle实现是hlt指令,hlt指令使CPU处于暂停状态,等待硬件中断发生的时候恢复,从而达到节能的目的。即从处理器C0态变到 C1态(见 ACPI标准)。这也是早些年windows平台上各种"处理器降温"工具的主要手段。当然idle也可以是在别的ACPI或者APM模块中定义的,甚至是自定义的一个idle(比如说nop)。
解决思路:
内存泄露可以使用mtrace和valgrind
DESCRIPTION top mtrace is a Perl script used to interpret and provide human readable output of the trace log contained in the file mtracedata, whose contents were produced by mtrace(3). If binary is provided, the output of mtrace also contains the source file name with line number information for problem locations (assuming that binary was compiled with debugging information). For more information about the mtrace(3) function and mtrace script usage, see mtrace(3).
大致的意思就是是一个perl的脚本来分析trace log文件的
#include <stdio.h>#include <stdlib.h>int main(){ //setenv是设置相关环境变量。 setenv("MALLOC_TRACE", "test.log", "1"); //mtrace函数记录malloc的trace log. mtrace(); int *p = (int *)malloc(2 * sizeof(int)); return 0;}
使用命令编译(一定要带上 -g 参数)
gcc -g 1.c
运行
./a.out
查看生成的日志 test.log
wanglei@wanglei-PC:~/ccode$ cat test.log = Start@ ./test:[0x5574215471c5] + 0x557422567890 0x8@ /lib/x86_64-linux-gnu/libc.so.6:[0x7f974e92e83d] - 0x5574225672a0@ /lib/x86_64-linux-gnu/libc.so.6:(tdestroy+0x36)[0x7f974e8b0996] - 0x557422567460@ /lib/x86_64-linux-gnu/libc.so.6:[0x7f974e92e826] - 0x557422567480
使用mtrace进行分析
wanglei@wanglei-PC:~/ccode$ mtrace a.out test.log - 0x00005574225672a0 Free 3 was never alloc'd 0x7f974e92e83d- 0x0000557422567460 Free 4 was never alloc'd 0x7f974e8b0996- 0x0000557422567480 Free 5 was never alloc'd 0x7f974e92e826Memory not freed:----------------- Address Size Caller0x0000557422567890 0x8 at 0x5574215471c5
这里我们看到 发现了一个内存泄露,但是教程上面有泄露的代码的地址,我这里分析的话就没有,不知道为什么,等下去查一下。
现在我们进行一下修复
#include <stdio.h>#include <stdlib.h>int main(){ //setenv是设置相关环境变量。 setenv("MALLOC_TRACE", "test.log", "1"); //mtrace函数记录malloc的trace log. mtrace(); int *p = (int *)malloc(2 * sizeof(int)); free(p); return 0;}
再进行分析
wanglei@wanglei-PC:~/ccode$ mtrace a.out test.log - 0x00005558806c62a0 Free 4 was never alloc'd 0x7efdaabf483d- 0x00005558806c6460 Free 5 was never alloc'd 0x7efdaab76996- 0x00005558806c6480 Free 6 was never alloc'd 0x7efdaabf4826No memory leaks.
Valgrind是一个工具集,包含下列:
这几个工具的使用是通过命令:valgrand --tool=name 程序名来分别调用的,当不指定tool参数时默认是 --tool=memcheck
最常用的工具,用来检测程序中出现的内存问题,所有对内存的读写都会被检测到,一切对malloc、free、new、delete的调用都会被捕获。所以,它能检测以下问题:
#include<stdlib.h>#include<malloc.h>#include<string.h>void test(){ int *ptr = malloc(sizeof(int)*10); ptr[10] = 7; // 内存越界 memcpy(ptr +1, ptr, 5); // 踩内存 free(ptr); free(ptr);// 重复释放 int *p1; *p1 = 1; // 非法指针}int main(){ test(); return 0;}
编译
gcc val.c -o val
进行检测
valgrind --leak-check=full ./val
结果
==30952== Memcheck, a memory error detector==30952== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.==30952== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info==30952== Command: ./val==30952== ==30952== Invalid write of size 4 #内存越界==30952== at 0x1091AB: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== Address 0x4a59068 is 0 bytes after a block of size 40 alloc'd==30952== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)==30952== by 0x10919E: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== ==30952== Invalid free() / delete / delete[] / realloc() #重复释放==30952== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)==30952== by 0x1091E4: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== Address 0x4a59040 is 0 bytes inside a block of size 40 free'd==30952== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)==30952== by 0x1091D8: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== Block was alloc'd at==30952== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)==30952== by 0x10919E: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== ==30952== Use of uninitialised value of size 8 #非法指针==30952== at 0x1091E9: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== ==30952== ==30952== Process terminating with default action of signal 11 (SIGSEGV): dumping core==30952== Bad permissions for mapped region at address 0x1090A0==30952== at 0x1091E9: test (in /home/wanglei/ccode/val)==30952== by 0x109203: main (in /home/wanglei/ccode/val)==30952== ==30952== HEAP SUMMARY:==30952== in use at exit: 0 bytes in 0 blocks==30952== total heap usage: 1 allocs, 2 frees, 40 bytes allocated==30952== ==30952== All heap blocks were freed -- no leaks are possible==30952== ==30952== Use --track-origins=yes to see where uninitialised values come from==30952== For lists of detected and suppressed errors, rerun with: -s==30952== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)段错误
和gprof类似的分析工具,能给我们提供更多的信息。和gprof不同,它不需要在编译源代码时附加特殊选项,但最好加上调试选项。Callgrind收集程序运行时的一些数据,建立函数调用关系图,还可以有选择地进行cache模拟。在运行结束时,它会把分析数据写入一个文件。callgrind_annotate可以把这个文件的内容转化成可读的形式。
生成可视化的图形需要下载gprof2dot
pip install gprof2dot
Callgrind可以生成程序性能分析的图形,首先来说说程序性能分析的工具吧,通常可以使用gnu自带的gprof,它的使用方法是:在编译程序时添加-pg参数
#include <stdio.h>#include <malloc.h>void test(){ sleep(1);}void f(){ int i; for( i = 0; i < 5; i ++) test();}int main(){ f(); printf("process is over!\n"); return 0;}
然后进行编译
gcc -pg -o calg calg.c
运行该程序./calg,程序运行完成后会在当前目录下生成gmon.out文件
./calg
下载并且解压gprof2dot
tar zxvf gprof2dot-2021.2.21.tar.gz
进入目录
cd gprof2dot-2021.2.21/
分配权限
chmod +7 gprof2dot.py
编辑 .bashrc 文件(Bash 在运行起来之后,会先加载 .bashrc
文件)
vim .bashrc
插入退出
export PATH="/home/wanglei/下载/gprof2dot-2021.2.21:$PATH"
让我们的修改生效
source ~/.bashrc
显示没有 dot
命令的话
sudo apt install graphviz
再执行
gprof ./calg | gprof2dot.py |dot -Tpng -o report.png
生成了一个report.png
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Xs209SBF-1636965351829)(report.png)]
这个是使用gprof来生成的,我们现在使用Callgrind来生成
执行
valgrind --tool=callgrind ./calg
生成了一个 callgrind.out.35316
这个是分析文件
可以直接打印结果
callgrind_annotate callgrind.out.35316
或者生成图形化结果
gprof2dot.py -f callgrind callgrind.out.35316 |dot -Tpng -o report2.png
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6pNEbehm-1636965351831)(report2.png)]
很不错的工具,学到了。
Cache分析器,它模拟CPU中的一级缓存I1,Dl和二级缓存,能够精确地指出程序中cache的丢失和命中。如果需要,它还能够为我们提供cache丢失次数,内存引用次数,以及每行代码,每个函数,每个模块,整个程序产生的指令数。这对优化程序有很大的帮助。
拿刚刚的代码
valgrind --tool=cachegrind ./calg
输出
==35929== Cachegrind, a cache and branch-prediction profiler==35929== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.==35929== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info==35929== Command: ./calg==35929== --35929-- warning: L3 cache found, using its data for the LL simulation.--35929-- warning: specified LL cache: line_size 64 assoc 16 total_size 12,582,912--35929-- warning: simulated LL cache: line_size 64 assoc 24 total_size 12,582,912process is over!==35929== ==35929== I refs: 203,786==35929== I1 misses: 1,090==35929== LLi misses: 1,068==35929== I1 miss rate: 0.53%==35929== LLi miss rate: 0.52%==35929== ==35929== D refs: 58,455 (46,268 rd + 12,187 wr)==35929== D1 misses: 3,188 ( 2,531 rd + 657 wr)==35929== LLd misses: 2,655 ( 2,057 rd + 598 wr)==35929== D1 miss rate: 5.5% ( 5.5% + 5.4% )==35929== LLd miss rate: 4.5% ( 4.4% + 4.9% )==35929== ==35929== LL refs: 4,278 ( 3,621 rd + 657 wr)==35929== LL misses: 3,723 ( 3,125 rd + 598 wr)==35929== LL miss rate: 1.4% ( 1.2% + 4.9% )
主要用来检查多线程程序中出现的竞争问题。Helgrind寻找内存中被多个线程访问,而又没有一贯加锁的区域,这些区域往往是线程之间失去同步的地方,而且会导致难以发掘的错误。Helgrind实现了名为“Eraser”的竞争检测算法,并做了进一步改进,减少了报告错误的次数。不过,Helgrind仍然处于实验阶段。
#include <stdio.h>#include <pthread.h>#define NLOOP 50int counter = 0; /* incremented by threads */void *threadfn( void * );int main( int argc, char **argv ){ pthread_t tid1, tid2, tid3; pthread_create( &tid1, NULL, &threadfn, NULL ); pthread_create( &tid2, NULL, &threadfn, NULL ); pthread_create( &tid3, NULL, &threadfn, NULL ); /* wait for both threads to terminate */ pthread_join( tid1, NULL ); pthread_join( tid2, NULL ); pthread_join( tid3, NULL ); return(0);}void *threadfn( void *vptr ){ int i, val; for ( i = 0; i < NLOOP; i++ ) { val = counter; printf( "%x: %d \n", (unsigned int) pthread_self(), val + 1 ); counter = val + 1; } return(NULL);}
编译
gcc jz.c -o jz -lpthread
我们的运行结果有时候可以达到150,有时候不行,因为没有加锁,所以会竞争
我们进行分析
valgrind --tool=helgrind ./jz
结果
==37656== Helgrind, a thread error detector==37656== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.==37656== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info==37656== Command: ./jz==37656== 5283700: 1 5283700: 2 5283700: 3 5283700: 4 5283700: 5 ==37656== ---Thread-Announcement------------------------------------------==37656== ==37656== Thread #3 was created==37656== at 0x49B0282: clone (clone.S:71)==37656== by 0x48732EB: create_thread (createthread.c:101)==37656== by 0x4874E0F: pthread_create@@GLIBC_2.2.5 (pthread_create.c:817)==37656== by 0x4842917: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x109224: main (in /home/wanglei/ccode/jz)==37656== ==37656== ---Thread-Announcement------------------------------------------==37656== ==37656== Thread #2 was created==37656== at 0x49B0282: clone (clone.S:71)==37656== by 0x48732EB: create_thread (createthread.c:101)==37656== by 0x4874E0F: pthread_create@@GLIBC_2.2.5 (pthread_create.c:817)==37656== by 0x4842917: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x109207: main (in /home/wanglei/ccode/jz)==37656== ==37656== ----------------------------------------------------------------==37656== ==37656== Possible data race during read of size 4 at 0x10C014 by thread #3 检测到可能存在竞争关系==37656== Locks held: none==37656== at 0x1092AA: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== ==37656== This conflicts with a previous write of size 4 by thread #2==37656== Locks held: none==37656== at 0x1092D9: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Address 0x10c014 is 0 bytes inside data symbol "counter"==37656== ==37656== ---Thread-Announcement------------------------------------------==37656== ==37656== Thread #4 was created==37656== at 0x49B0282: clone (clone.S:71)==37656== by 0x48732EB: create_thread (createthread.c:101)==37656== by 0x4874E0F: pthread_create@@GLIBC_2.2.5 (pthread_create.c:817)==37656== by 0x4842917: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x109241: main (in /home/wanglei/ccode/jz)==37656== ==37656== ----------------------------------------------------------------==37656== ==37656== Possible data race during write of size 1 at 0x5284190 by thread #4==37656== Locks held: none==37656== at 0x48488CC: mempcpy (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x49207B1: _IO_new_file_xsputn (fileops.c:1236)==37656== by 0x49207B1: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Address 0x5284190 is 0 bytes inside a block of size 1,024 alloc'd==37656== at 0x483C893: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4912E83: _IO_file_doallocate (filedoalloc.c:101)==37656== by 0x492304F: _IO_doallocbuf (genops.c:347)==37656== by 0x49220AF: _IO_file_overflow@@GLIBC_2.2.5 (fileops.c:745)==37656== by 0x4920834: _IO_new_file_xsputn (fileops.c:1244)==37656== by 0x4920834: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Block was alloc'd by thread #2==37656== ==37656== ----------------------------------------------------------------==37656== ==37656== Possible data race during write of size 1 at 0x5284198 by thread #4==37656== Locks held: none==37656== at 0x48488A6: mempcpy (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x49207B1: _IO_new_file_xsputn (fileops.c:1236)==37656== by 0x49207B1: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x4908165: __vfprintf_internal (vfprintf-internal.c:1719)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Address 0x5284198 is 8 bytes inside a block of size 1,024 alloc'd==37656== at 0x483C893: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4912E83: _IO_file_doallocate (filedoalloc.c:101)==37656== by 0x492304F: _IO_doallocbuf (genops.c:347)==37656== by 0x49220AF: _IO_file_overflow@@GLIBC_2.2.5 (fileops.c:745)==37656== by 0x4920834: _IO_new_file_xsputn (fileops.c:1244)==37656== by 0x4920834: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Block was alloc'd by thread #2==37656== 6685700: 5 ==37656== ----------------------------------------------------------------==37656== ==37656== Possible data race during write of size 4 at 0x10C014 by thread #4==37656== Locks held: none==37656== at 0x1092D9: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== ==37656== This conflicts with a previous read of size 4 by thread #3==37656== Locks held: none==37656== at 0x1092AA: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Address 0x10c014 is 0 bytes inside data symbol "counter"==37656== 6685700: 6 6685700: 7 ==37656== ----------------------------------------------------------------==37656== ==37656== Possible data race during write of size 1 at 0x5284196 by thread #2==37656== Locks held: none==37656== at 0x48488A6: mempcpy (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x49207B1: _IO_new_file_xsputn (fileops.c:1236)==37656== by 0x49207B1: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== ==37656== This conflicts with a previous write of size 1 by thread #4==37656== Locks held: none==37656== at 0x48488CC: mempcpy (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x49207B1: _IO_new_file_xsputn (fileops.c:1236)==37656== by 0x49207B1: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Address 0x5284196 is 6 bytes inside a block of size 1,024 alloc'd==37656== at 0x483C893: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4912E83: _IO_file_doallocate (filedoalloc.c:101)==37656== by 0x492304F: _IO_doallocbuf (genops.c:347)==37656== by 0x49220AF: _IO_file_overflow@@GLIBC_2.2.5 (fileops.c:745)==37656== by 0x4920834: _IO_new_file_xsputn (fileops.c:1244)==37656== by 0x4920834: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1197)==37656== by 0x490892C: __vfprintf_internal (vfprintf-internal.c:1687)==37656== by 0x48F2EBE: printf (printf.c:33)==37656== by 0x1092D2: threadfn (in /home/wanglei/ccode/jz)==37656== by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)==37656== by 0x4874608: start_thread (pthread_create.c:477)==37656== by 0x49B0292: clone (clone.S:95)==37656== Block was alloc'd by thread #2==37656== 省略无用信息==37656== ==37656== Use --history-level=approx or =none to gain increased speed, at==37656== the cost of reduced accuracy of conflicting-access information==37656== For lists of detected and suppressed errors, rerun with: -s==37656== ERROR SUMMARY: 121 errors from 5 contexts (suppressed: 145 from 25)
堆栈分析器,它能测量程序在堆栈中使用了多少内存,告诉我们堆块,堆管理块和栈的大小。Massif能帮助我们减少内存的使用,在带有虚拟内存的现代系统中,它还能够加速我们程序的运行,减少程序停留在交换区中的几率。
Massif对内存的分配和释放做profile。程序开发者通过它可以深入了解程序的内存使用行为,从而对内存使用进行优化。这个功能对C++尤其有用,因为C++有很多隐藏的内存分配和释放。
略
略
系统级性能优化通常包括两个阶段:
性能剖析(performance profiling),需要借助于现有的profiling工具,如perf等。
代码优化,往往需要借助开发者的经验,编写简洁高效的代码,甚至在汇编级别合理使用各种指令,合理安排各种指令的执行顺序。
tracepoints是散落在内核源码中的一些hook,它们可以在特定的代码被执行到时触发,这一特定可以被各种trace/debug工具所使用。
perf将tracepoint产生的时间记录下来,生成报告,通过分析这些报告,条有人缘便可以了解程序运行期间内核的各种细节,对性能症状做出准确的诊断。
这些tracepint的对应的sysfs节点在/sys/kernel/debug/tracing/events目录下。
perf usage: perf [--version] [--help] COMMAND [ARGS] The most commonly used perf commands are: #解析perf record生成的perf.data文件,显示被注释的代码。 annotate Read perf.data (created by perf record) and display annotated code #根据数据文件记录的build-id,将所有被采样到的elf文件打包。利用此压缩包,可以再任何机器上分析数据文件中 记录的采样数据。 archive Create archive with object files with build-ids found in perf.data file #perf中内置的benchmark,目前包括两套针对调度器和内存管理子系统的benchmark。 bench General framework for benchmark suites #管理perf的buildid缓存,每个elf文件都有一个独一无二的buildid。buildid被perf用来关联性能数据与elf文 件。 buildid-cache Manage <tt>build-id</tt> cache. #列出数据文件中记录的所有buildid。 buildid-list List the buildids in a perf.data file #对比两个数据文件的差异。能够给出每个符号(函数)在热点分析上的具体差异。 diff Read two perf.data files and display the differential profile #该工具读取perf record工具记录的事件流,并将其定向到标准输出。在被分析代码中的任何一点,都可以向事件流 中注入其它事件。 inject Filter to augment the events stream with additional information #针对内核内存(slab)子系统进行追踪测量的工具 kmem Tool to trace/measure kernel memory(slab) properties #用来追踪测试运行在KVM虚拟机上的Guest OS。 kvm Tool to trace/measure kvm guest os #列出当前系统支持的所有性能事件。包括硬件性能事件、软件性能事件以及检查点。 list List all symbolic event types #分析内核中的锁信息,包括锁的争用情况,等待延迟等。 lock Analyze lock events #用于定义动态检查点。 probe Define new dynamic tracepoints #收集采样信息,并将其记录在数据文件中。随后可通过其它工具对数据文件进行分析。 record Run a command and record its profile into perf.data #读取perf record创建的数据文件,并给出热点分析结果。 report Read perf.data (created by perf record) and display the profile #针对调度器子系统的分析工具。 sched Tool to trace/measure scheduler properties (latencies) #执行perl或python写的功能扩展脚本、生成脚本框架、读取数据文件中的数据信息等。 script Read perf.data (created by perf record) and display trace output #执行某个命令,收集特定进程的性能概况,包括CPI、Cache丢失率等。 stat Run a command and gather performance counter statistics #perf对当前软硬件平台进行健全性测试,可用此工具测试当前的软硬件平台是否能支持perf的所有功能。 test Runs sanity tests. #针对测试期间系统行为进行可视化的工具 timechart Tool to visualize total system behavior during a workload # 关于syscall的工具。 top System profiling tool. See 'perf help COMMAND' for more information on a specific command.
机器上安装了两个包
sudo apt install linux-tools-genericsudo apt install linux-cloud-tools-generic
perf list
List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] duration_time [Tool event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event]
perf stat ./XXX
Performance counter stats for './XXX': 1.67 msec task-clock # 0.000 CPUs utilized 60 context-switches # 0.036 M/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.034 M/sec 6,057,219 cycles # 3.632 GHz 2,487,302 instructions # 0.41 insn per cycle 447,745 branches # 268.444 M/sec 28,980 branch-misses # 6.47% of all branches 60.013507121 seconds time elapsed 0.002326000 seconds user 0.002326000 seconds sys
perf top
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gSjk449x-1636965351832)(2021-11-15_15-42-09.png)]
sudo perf record
记录系统或软件一段时间内的事件统计情况
sudo perf report -f perf.data
将perf.data进行文本界面的展示
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。