Linux的时间
前言
本章是在学习《图解Linux内核》第4章时的笔记。
本章重点是Linux的时间,而不是定时器(timer list,这个在内核里面基于软中断实现,其最小分辨率jiffies就是这一章要讨论的)
部分内容是和AI“讨论”出的,可能有误,欢迎指出
与时间维持相关的结构
timekeeper 和 tk_read_base
include/linux/timekeeper_internal.h
/**
* struct tk_read_base - base structure for timekeeping readout
* struct tk_read_base - 用于读取时间的基础结构
* @clock: Current clocksource used for timekeeping.
* clock: 当前选择的时钟源
* @read: Read function of @clock
* read: 读 @clock 时钟源的方法指针
* @mask: Bitmask for two's complement subtraction of non 64bit clocks
* 用于非64位时钟的补码减法位掩码
* @cycle_last: @clock cycle value at last update
* 上一次更新时 @clock 的周期计数值
* @mult: (NTP adjusted) multiplier for scaled math conversion
* @mult: 用于缩放数学转换的(NTP校准后)乘法因子
* @shift: Shift value for scaled math conversion
* @shift: 用于缩放数学转换的移位值
* @xtime_nsec: Shifted (fractional) nano seconds offset for readout
* @xtime_nsec: 供读取使用的移位(小数形式)纳秒偏移量
* @base: ktime_t (nanoseconds) base time for readout
* 供读取使用的ktime_t类型(纳秒)基准时间,ktime_t结构体其实就是int64
*
* This struct has size 56 byte on 64 bit. Together with a seqcount it
* occupies a single 64byte cache line.
*
* The struct is separate from struct timekeeper as it is also used
* for a fast NMI safe accessors.
*/
struct tk_read_base {
struct clocksource *clock;//选择的时钟源
cycle_t (*read)(struct clocksource *cs); //从clock时钟源中读出时钟周期的方法
cycle_t mask; //用于数学运算,时钟周期到纳秒
cycle_t cycle_last; //上次更新的时钟周期
u32 mult; //用于数学运算,时钟周期到纳秒
u32 shift; //用于数学运算,时钟周期到纳秒
u64 xtime_nsec;
ktime_t base;
};
/**
* struct timekeeper - Structure holding internal timekeeping values.
* @tkr_mono: The readout base structure for CLOCK_MONOTONIC
* @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW
* @xtime_sec: Current CLOCK_REALTIME time in seconds
* @ktime_sec: Current CLOCK_MONOTONIC time in seconds
* @wall_to_monotonic: CLOCK_REALTIME to CLOCK_MONOTONIC offset
* @offs_real: Offset clock monotonic -> clock realtime
* @offs_boot: Offset clock monotonic -> clock boottime
* @offs_tai: Offset clock monotonic -> clock tai
* @tai_offset: The current UTC to TAI offset in seconds
* @raw_time: Monotonic raw base time in timespec64 format
* @cycle_interval: Number of clock cycles in one NTP interval
* @xtime_interval: Number of clock shifted nano seconds in one NTP
* interval.
* @xtime_remainder: Shifted nano seconds left over when rounding
* @cycle_interval
* @raw_interval: Raw nano seconds accumulated per NTP interval.
* @ntp_error: Difference between accumulated time and NTP time in ntp
* shifted nano seconds.
* @ntp_error_shift: Shift conversion between clock shifted nano seconds and
* ntp shifted nano seconds.
*
* Note: For timespec(64) based interfaces wall_to_monotonic is what
* we need to add to xtime (or xtime corrected for sub jiffie times)
* to get to monotonic time. Monotonic is pegged at zero at system
* boot time, so wall_to_monotonic will be negative, however, we will
* ALWAYS keep the tv_nsec part positive so we can use the usual
* normalization.
*
* wall_to_monotonic is moved after resume from suspend for the
* monotonic time not to jump. We need to add total_sleep_time to
* wall_to_monotonic to get the real boot based time offset.
*
* wall_to_monotonic is no longer the boot time, getboottime must be
* used instead.
*/
补充:timespec64,就是秒和纳秒的结构体
struct timespec64 {
time64_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
struct timekeeper {
struct tk_read_base tkr_mono; //CLOCK_MONOTONIC(单调时钟)对应的读取基础结构
struct tk_read_base tkr_raw; //CLOCK_MONOTONIC_RAW(原始单调时钟)对应的读取基础结构
u64 xtime_sec; //当前 CLOCK_REALTIME(实时时钟)的时间,单位:秒
unsigned long ktime_sec; //当前 CLOCK_MONOTONIC(单调时钟)的时间,单位:秒
struct timespec64 wall_to_monotonic; //CLOCK_REALTIME 到 CLOCK_MONOTONIC 的时间偏移量
ktime_t offs_real; //单调时钟 -> 实时时钟 的偏移量
ktime_t offs_boot; //单调时钟 -> 启动时钟 的偏移量
ktime_t offs_tai; //单调时钟 -> TAI时钟(国际原子时)的偏移量
s32 tai_offset; //当前 UTC 到 TAI 的偏移量,单位:秒
struct timespec64 raw_time; //原始单调时钟的基准时间,timespec64 格式
/* The following members are for timekeeping internal use */
cycle_t cycle_interval; //一个 NTP 时间间隔内包含的时钟周期数
u64 xtime_interval; //一个 NTP 时间间隔内对应的移位纳秒数
s64 xtime_remainder; //对 @cycle_interval 取整后剩余的移位纳秒数
u32 raw_interval; //每个 NTP 时间间隔内累计的原始纳秒数
/* The ntp_tick_length() value currently being used.
* This cached copy ensures we consistently apply the tick
* length for an entire tick, as ntp_tick_length may change
* mid-tick, and we don't want to apply that new value to
* the tick in progress.
*/
u64 ntp_tick;
/* Difference between accumulated time and NTP time in ntp
* shifted nano seconds. */
s64 ntp_error;
u32 ntp_error_shift;
u32 ntp_err_mult;
};
虽然结构体成员比较多,但都比较好理解。timekeeper结构体用于维持时间,里面有“两条时间线”:tkr_mono和tkr_raw,一条是校准过的,另一条是原始的。除此以外,里面记录了与各种时间对齐的差值,还有NTP校准参数,毕竟这些都是和时间相关的、必不可少的东西
时钟源 clocksource
这里就和书里面一致了,粘一下源码:
include/linux/clocksource.h
/**
* struct clocksource - hardware abstraction for a free running counter
* Provides mostly state-free accessors to the underlying hardware.
* This is the structure used for system time.
*
* @name: ptr to clocksource name
* @list: list head for registration
* @rating: rating value for selection (higher is better)
* To avoid rating inflation the following
* list should give you a guide as to how
* to assign your clocksource a rating
* 1-99: Unfit for real use
* Only available for bootup and testing purposes.
* 100-199: Base level usability.
* Functional for real use, but not desired.
* 200-299: Good.
* A correct and usable clocksource.
* 300-399: Desired.
* A reasonably fast and accurate clocksource.
* 400-499: Perfect
* The ideal clocksource. A must-use where
* available.
* @read: returns a cycle value, passes clocksource as argument
* @enable: optional function to enable the clocksource
* @disable: optional function to disable the clocksource
* @mask: bitmask for two's complement
* subtraction of non 64 bit counters
* @mult: cycle to nanosecond multiplier
* @shift: cycle to nanosecond divisor (power of two)
* @max_idle_ns: max idle time permitted by the clocksource (nsecs)
* @maxadj: maximum adjustment value to mult (~11%)
* @max_cycles: maximum safe cycle value which won't overflow on multiplication
* @flags: flags describing special properties
* @archdata: arch-specific data
* @suspend: suspend function for the clocksource, if necessary
* @resume: resume function for the clocksource, if necessary
* @owner: module reference, must be set by clocksource in modules
*/
struct clocksource {
/*
* Hotpath data, fits in a single cache line when the
* clocksource itself is cacheline aligned.
*/
cycle_t (*read)(struct clocksource *cs);
cycle_t mask;
u32 mult;
u32 shift;
u64 max_idle_ns;
u32 maxadj;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
u64 max_cycles;
const char *name;
struct list_head list;
int rating;
int (*enable)(struct clocksource *cs);
void (*disable)(struct clocksource *cs);
unsigned long flags;
void (*suspend)(struct clocksource *cs);
void (*resume)(struct clocksource *cs);
/* private: */
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
/* Watchdog related data, used by the framework */
struct list_head wd_list;
cycle_t cs_last;
cycle_t wd_last;
#endif
struct module *owner;
} ____cacheline_aligned;
要关注的几个字段:
| 字段 | 类型 | 描述 |
|---|---|---|
| flags | unsigned long | clocksource的标志 |
| mult和shift | u32 | 与timekeeper的同名字段相同,都是用于计算从时钟周期到纳秒 |
| read | 回调函数 | 读取时钟源的当前时钟周期 |
| rating | int | clocksource的等级,表示精度,越大越高 |
- tk_read_base里的read方法和clocksource里的read方法有什么不同?
完全一样:
kernel/time/timekeeping.c
static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) {
// ... 其他初始化代码 ...
tk->tkr_mono.clock = clock; // 绑定时钟源
tk->tkr_mono.read = clock->read; // 直接赋值 read 指针!
tk->tkr_raw.clock = clock; // 绑定同一个时钟源
tk->tkr_raw.read = clock->read; // 同样直接赋值!
// ... 其他初始化代码 ...
}
时间的获取
《图解Linux内核》书里的:
gettimeofday系统调用,内核代替执行:ktime_get_read_ts64
只描述作用:
kernel/time/timekeeping.h和timekeeping.c
void ktime_get_real_ts64(struct timespec64 *ts) {
获得全局的timekeeper tk
秒 = tk->xtime_sec(上次更新的CLOCK_REALTIME,秒)
纳秒 = timekeeping_get_ns(&tk->tkr_mono) 从timekeeper的tkr_mono时间线读纳秒
}
u64 timekeeping_get_ns(const struct tk_read_base *tkr) {
struct clock_source *clock = READ_ONCE(tkr->clock) 拿到tkr的clock时钟源
cycle_now = clock->read(clock) 读时钟周期
cycle_delta = (cycle_now - tkr->cycle_last) & tkr->mask 计算和上次时钟周期的差值
nsec = cycle_delta * tkr->mult + tkr->xtime_nsec; 做数学运算,得到纳秒
nsec >>= tkr->shift;
return nsec
}
gettimeofday用的是timekeeper里面的时间线tkr_mono,mono是单调的意思,只增加,不跳变、不回头。tkr里面有个叫mult的成员,可以控制时钟周期换算到纳秒的快慢,NTP对时会影响mult,假如内核在NTP对时的时候,发现一月快了10秒,就会修改tkr的mult,让它以后计算纳秒的时候“走的慢一点”,这就是tkr_mono和tkr_raw的区别(和AI讨论出的)
另外,tkr_mono和tkr_raw都面向的是时钟周期,只要晶振在跳,时钟周期都会累加,不会跳变。NTP对时可能会影响时间的跳变,对应的是timekeeper里的xtime_sec(CLOCK_REALTIME)。这就是NTP对时会造成gettimeofday的tv_sec秒数跳变,但不会造成tv_nsec纳秒跳变的原因。(和AI讨论出的)
时钟源的选择
书里说,clocksource里的rating表示时钟源的等级,rating越高表示精度越高,内核会选择精度最高的时钟源给timekeeper用。那如何查看系统里有几个时钟源?
ls /sys/devices/system/clocksource
clocksource0 power uevent
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
arch_sys_counter
好吧,看来我这个STM32MP157开发板只有1个时钟源arch_sys_counter,内核也没得选。看看虚拟机里的:
fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
确实有多个时钟源,那怎么看他们的rating?(未解决,AI在骗,先不管了)
可以确定的是,这个struct clocksource就是对应的一个硬件时钟源。I.MX6ULL叫mxc_timer1、STM32MP157叫arch_sys_counter
书里说,内核会选择一个不需要被监控的连续时钟源,负责监控其他时钟源,如果某个时钟源的误差超过可接受范围,就会将其状态设置为CLOCK_SOURCE_UNSTABLE,并将rating字段设为0
- 什么是不需要被监控的连续时钟源?
标志位:无 CLOCK_SOURCE_MUST_VERIFY,有CLOCK_SOURCE_IS_CONTINUOUS
通常,最先被切换到的高 rating 时钟源会成为看门狗
开发板:
root@lubancat:/home# dmesg | grep -i "clocksource: Switched to clocksource"
[ 0.362019] clocksource: Switched to clocksource arch_sys_counter
与时钟中断相关的结构
书里面将保持时间的设备称为时钟源,将关注时间事件的设备称为时钟中断设备(时钟事件设备)。
结构体clock_event_device
这个结构体描述了能产生中断的设备,在其中的event_handler里处理这个设备的时钟中断
include/linux/clockchips.h
/**
* struct clock_event_device - clock event device descriptor
* @event_handler: Assigned by the framework to be called by the low
* level handler of the event source
* @set_next_event: set next event function using a clocksource delta
* @set_next_ktime: set next event function using a direct ktime value
* @next_event: local storage for the next event in oneshot mode
* @max_delta_ns: maximum delta value in ns
* @min_delta_ns: minimum delta value in ns
* @mult: nanosecond to cycles multiplier
* @shift: nanoseconds to cycles divisor (power of two)
* @mode: operating mode, relevant only to ->set_mode(), OBSOLETE
* @state: current state of the device, assigned by the core code
* @features: features
* @retries: number of forced programming retries
* @set_mode: legacy set mode function, only for modes <= CLOCK_EVT_MODE_RESUME.
* @set_state_periodic: switch state to periodic, if !set_mode
* @set_state_oneshot: switch state to oneshot, if !set_mode
* @set_state_shutdown: switch state to shutdown, if !set_mode
* @tick_resume: resume clkevt device, if !set_mode
* @broadcast: function to broadcast events
* @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
* @max_delta_ticks: maximum delta value in ticks stored for reconfiguration
* @name: ptr to clock event name
* @rating: variable to rate clock event devices
* @irq: IRQ number (only for non CPU local devices)
* @bound_on: Bound on CPU
* @cpumask: cpumask to indicate for which CPUs this device works
* @list: list head for the management code
* @owner: module reference
*/
struct clock_event_device {
void (*event_handler)(struct clock_event_device *); 时钟中断到来时,处理中断的回调函数,被动回调
int (*set_next_event)(unsigned long evt, struct clock_event_device *); 设置下一个时钟中断
int (*set_next_ktime)(ktime_t expires, struct clock_event_device *);
ktime_t next_event;
u64 max_delta_ns;
u64 min_delta_ns;
u32 mult; 与timekeeper里的mult和shift一样
u32 shift;
enum clock_event_mode mode;
enum clock_event_state state;
unsigned int features; 设备的特性,CLOCK_EVT_FEAT_XXX
unsigned long retries;
/*
* State transition callback(s): Only one of the two groups should be
* defined:
* - set_mode(), only for modes <= CLOCK_EVT_MODE_RESUME.
* - set_state_{shutdown|periodic|oneshot}(), tick_resume().
*/
void (*set_mode)(enum clock_event_mode mode, struct clock_event_device *);
int (*set_state_periodic)(struct clock_event_device *); 切换当前设备状态,内核需要切换设备模式时主动调用,
里面是驱动相关的实现
int (*set_state_oneshot)(struct clock_event_device *); 切换当前设备状态
int (*set_state_shutdown)(struct clock_event_device *); 切换当前设备状态
int (*tick_resume)(struct clock_event_device *);
void (*broadcast)(const struct cpumask *mask);
void (*suspend)(struct clock_event_device *);
void (*resume)(struct clock_event_device *);
unsigned long min_delta_ticks;
unsigned long max_delta_ticks;
const char *name;
int rating; 设备的等级
int irq;
int bound_on;
const struct cpumask *cpumask;
struct list_head list;
struct module *owner;
} ____cacheline_aligned;
衡量时间
两个名词,HZ和jiffies,HZ就是进入定时中断的频率,jiffies每次进入就会+1
HZ
是个在编译时就确定的东西,由编译参数CONFIG_HZ指定,作为很多模块衡量时间的基准
#ifndef __ASM_GENERIC_PARAM_H
#define __ASM_GENERIC_PARAM_H
#include <uapi/asm-generic/param.h>
# undef HZ
# define HZ CONFIG_HZ /* Internal kernel timer frequency */
# define USER_HZ 100 /* some user interfaces are */
# define CLOCKS_PER_SEC (USER_HZ) /* in "ticks" like times() */
#endif /* __ASM_GENERIC_PARAM_H */
jiffies
全局变量,类似单片机的systick
linux/kernel/time/jiffies.c
EXPORT_SYMBOL(jiffies);
读取方式:直接使用jiffies或get_jiffies_64函数,由于32位机没有读64位的原子指令,所以需要借助函数加锁。
X86时钟硬件
- RTC(Real-Time Clock)
- PIT(Programmable Interval Timer)
- TSC(Time Stamp Counter)程序可以通过RDTSC指令读值
- HPET(High Precision Event Timer)
- APIC(Advanced Programmable Interrupt Controller)每个CPU都有一个本地的
其中后3个都见到了:
fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
RTC在单片机里也见过了,能产生中断的时钟设备并没有什么稀奇的。
计算机默认选择TSC作为时钟源,但它不能作为看门狗,反而会被看门狗监控。
时间的计算
- REALTIME(WALL TIME)
内核维护的、墙上时间、xtime时间、系统时间,系统启动时会读RTC时间作为REALTIME时间,之后独立。settimeofday只改变REALTIME,并不改变RTC时间(不会保存,重启丢失) - MONONIC
系统启动到当前的非休眠时间,从0开始单调递增,系统休眠就不再增加 - BOOTTIME
系统启动到当前的时间,从0开始单调递增,休眠也增加 - RAW MONINIC
不受NTP影响
时钟中断
书里面说的时钟中断很简单,时钟中断发生时,最终调用struct clock_event_device的event_handler指针,计算当前进程占用CPU的时间,如果需要调度,就设置TIF_NEED_RESCHED标志,然后在中断退出时,根据这个标志触发进程调度。
听起来就是进程占用CPU太久,时间片用完了,内核强制把它切走,和RTOS中的Systick+PendSV中断里面切任务是一样的。
以I.MX6ULL内核为例:
// kernel/time/tick-common.c
/*
* Event handler for periodic ticks
*/
void tick_handle_periodic(struct clock_event_device *dev)
{
int cpu = smp_processor_id();
ktime_t next = dev->next_event;
// ====================== 核心核心核心!======================
// 1. 所有时间相关的工作全在这里:
// - 更新 jiffies
// - 触发 TIMER_SOFTIRQ 软中断
// - 进程调度器 tick
// ==========================================================
tick_periodic(cpu);
// 这里就是说,如果时钟源不是ONESHOT,就直接退出(伏笔)
if (dev->state != CLOCK_EVT_STATE_ONESHOT)
return;
//ONESHOT能继续走这里
for (;;) {
next = ktime_add(next, tick_period);
if (!clockevents_program_event(dev, next, false))
return;
if (timekeeping_valid_for_hres())
tick_periodic(cpu);
}
}
核心是tick_periodic(cpu):
/*
* Periodic tick:周期性时钟中断的核心处理函数
*/
static void tick_periodic(int cpu)
{
// ==========================================
// 1. 【单核IMX6ULL:必定成立】
// 只有指定CPU(默认CPU0)负责更新全局时间
// ==========================================
if (tick_do_timer_cpu == cpu) {
// 多核加锁,保护jiffies并发访问(单核没用,但保留)
write_seqlock(&jiffies_lock);
// 记录下一次时钟中断的时间点
tick_next_period = ktime_add(tick_next_period, tick_period);
// ==========================================
// 🔥【核心中的核心】jiffies 更新!
// do_timer(1) 内部:jiffies_64 += 1
// jiffies 是 jiffies_64 的低32位别名
// ==========================================
do_timer(1);
write_sequnlock(&jiffies_lock);
// 更新系统实时时间(墙上时间,date命令看到的时间)
update_wall_time();
}
// ==========================================
// 2. 【每个CPU都执行】
// 更新进程时间片、触发定时器软中断
// ==========================================
update_process_times(user_mode(get_irq_regs()));
// 性能分析工具用的,和时间/定时器无关
profile_tick(CPU_PROFILING);
}
/*
* Must hold jiffies_lock
*/
void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
calc_global_load(ticks);
}
/*
* Called from the timer interrupt handler to charge one tick to the current
* process. user_tick is 1 if the tick is user time, 0 for system.
*/
void update_process_times(int user_tick)
{
struct task_struct *p = current;
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers(); 这里还会RAISE Timer SOFTIRQ!!!
rcu_check_callbacks(user_tick);
#ifdef CONFIG_IRQ_WORK
if (in_irq())
irq_work_tick();
#endif
scheduler_tick();
run_posix_cpu_timers(p);
}
确实如书里所述,但还有一个小细节书里没说,TIMER_SOFTIRQ在run_local_timers函数被打开了,这就意味着中断退出后,基于TIMER_SOFTIRQ实现的内核timer list也会被处理:
kernel/time/tick-common.c
/*
* Called by the local, per-CPU timer interrupt on SMP.
*/
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ);
}
这就是全部吗?
本来到这里一切都很顺利,硬件时钟源被配置为周期模式(PERIODIC),按照HZ触发定时中断,进入tick_handle_periodic,然后jiffies增加、检查进程时间、切走时间片用尽的进程、唤醒TIMER_SOFTIRQ…多么流畅。直到我又找到了hrtimer这个东西,而且正点原子I.MX6ULL内核确实开启了这个模式…没办法,继续看吧
hrtimer介绍
全称高精度定时器(High Resolution Timer),是要求时钟源工作在OneShot模式的,这和上面讲的周期性的tick中断冲突了,因为tick_handle_periodic的注释说的很清楚:
/* Event handler for periodic ticks */
而高精度定时器根本不是工作在PERIODIC模式。这下郁闷了,开发板只有1个时钟源,一直以为它工作在Periodic模式,为系统产生周期的中断,可事实却是,高精度定时器也在工作:
开发板:
root@ATK-IMX6U:/home/tmodel# cat /proc/timer_list
Timer List Version: v0.7
HRTIMER_MAX_CLOCK_BASES: 4
now at 513407679694534 nsecs
cpu: 0
clock 0:
.base: 97b913f8
.index: 0
.resolution: 1 nsecs
.get_time: ktime_get
.offset: 0 nsecs
active timers:
#0: <97b91650>, tick_sched_timer, S:01
# expires at 513407680000000-513407680000000 nsecs [in 305466 to 305466 nsecs]
#1: def_rt_bandwidth, sched_rt_period_timer, S:01
# expires at 513408000000000-513408000000000 nsecs [in 320305466 to 320305466 n secs]
#2: <94abfb78>, hrtimer_wakeup, S:01
# expires at 513408103155868-513408104145867 nsecs [in 423461334 to 424451333 n secs]
#3: <947d5f40>, hrtimer_wakeup, S:01
# expires at 513408223075201-513408223075201 nsecs [in 543380667 to 543380667 n secs]
#4: <94d01b78>, hrtimer_wakeup, S:01
# expires at 513411658126531-513411688126527 nsecs [in 3978431997 to 4008431993 nsecs]
#5: <94065ae0>, hrtimer_wakeup, S:01
# expires at 513412199754534-513412204754529 nsecs [in 4520060000 to 4525059995 nsecs]
#6: <94ad7ae0>, hrtimer_wakeup, S:01
# expires at 513412742455534-513412752455525 nsecs [in 5062761000 to 5072760991 nsecs]
#7: <947db018>, it_real_fn, S:01
# expires at 513434065199534-513434065199534 nsecs [in 26385505000 to 263855050 00 nsecs]
#8: <946f1f40>, hrtimer_wakeup, S:01
# expires at 513440093633531-513440093683531 nsecs [in 32413938997 to 324139889 97 nsecs]
#9: <94925b78>, hrtimer_wakeup, S:01
# expires at 513463000534868-513463056509866 nsecs [in 55320840334 to 553768153 32 nsecs]
#10: sched_clock_timer, sched_clock_poll, S:01
# expires at 513964419879838-513964419879838 nsecs [in 556740185304 to 55674018 5304 nsecs]
#11: <94adbf40>, hrtimer_wakeup, S:01
# expires at 514811460020273-514811460070273 nsecs [in 1403780325739 to 1403780 375739 nsecs]
clock 1:
.base: 97b91430
.index: 1
.resolution: 1 nsecs
.get_time: ktime_get_real
.offset: 1733880881385632000 nsecs
active timers:
clock 2:
.base: 97b91468
.index: 2
.resolution: 1 nsecs
.get_time: ktime_get_boottime
.offset: 0 nsecs
active timers:
clock 3:
.base: 97b914a0
.index: 3
.resolution: 1 nsecs
.get_time: ktime_get_clocktai
.offset: 1733880881385632000 nsecs
active timers:
.expires_next : 513407680000000 nsecs
.hres_active : 1
.nr_events : 14674035
.nr_retries : 443
.nr_hangs : 0
.max_hang_time : 0 nsecs
.nohz_mode : 2
.last_tick : 513407657000000 nsecs
.tick_stopped : 0
.idle_jiffies : 513107657
.idle_calls : 16072536
.idle_sleeps : 10951328
.idle_entrytime : 513407658005868 nsecs
.idle_waketime : 513407658005868 nsecs
.idle_exittime : 513407658084201 nsecs
.idle_sleeptime : 509908760813518 nsecs
.iowait_sleeptime: 771037983 nsecs
.last_jiffies : 513107656
.next_jiffies : 513107658
.idle_expires : 513407658000000 nsecs
jiffies: 513107679
Tick Device: mode: 1
Broadcast device
Clock Event Device: <NULL>
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 00000000
Tick Device: mode: 1
Per CPU device: 0
Clock Event Device: mxc_timer1
max_delta_ns: 1431655752223
min_delta_ns: 85000
mult: 6442451
shift: 31
mode: 3
next_event: 513407680000000 nsecs
set_next_event: v2_set_next_event
set_mode: mxc_set_mode
event_handler: hrtimer_interrupt
retries: 0
奇了怪了,不工作在周期模式,它的原理是什么?这就得从名字说起了,既然它工作在高精度,那肯定精度要比PERIODIC时钟源高。想想看,HZ=100,那时钟源精度就是10ms,HZ=250,那就是4ms,HZ=1000,那就是1ms,可即使这样,还有0.5ms、0.1ms的定时需求存在,总不能无限制的提高HZ吧?再高的HZ也有更高的HZ能获得更高的精度(人外有人,天外有天哇)
所以高精度定时器工作在Oneshot模式就不奇怪了,内核需要的最近的一次定时周期是什么,就告诉时钟源硬件就好了:我这次需要10ms后产生1个中断,就给硬件设置一个10ms的定时,下次需要1ms后产生1个中断,就给硬件设置一个1ms的定时,妙哇,只要内核维护好这个数据结构,别算错下次需要设定的时间就行
以下才是struct clock_event_device真正的回调函数,hrtimer_interrupt:
kernel/timer/hrtimer.c
/*
* High resolution timer interrupt
* Called with interrupts disabled
*/
void hrtimer_interrupt(struct clock_event_device *dev)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
ktime_t expires_next, now, entry_time, delta;
int i, retries = 0;
BUG_ON(!cpu_base->hres_active);
cpu_base->nr_events++;
dev->next_event.tv64 = KTIME_MAX;
raw_spin_lock(&cpu_base->lock);
entry_time = now = hrtimer_update_base(cpu_base);
retry:
cpu_base->in_hrtirq = 1;
/*
* We set expires_next to KTIME_MAX here with cpu_base->lock
* held to prevent that a timer is enqueued in our queue via
* the migration code. This does not affect enqueueing of
* timers which run their callback and need to be requeued on
* this CPU.
*/
cpu_base->expires_next.tv64 = KTIME_MAX;
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
struct hrtimer_clock_base *base;
struct timerqueue_node *node;
ktime_t basenow;
if (!(cpu_base->active_bases & (1 << i)))
continue;
base = cpu_base->clock_base + i;
basenow = ktime_add(now, base->offset);
while ((node = timerqueue_getnext(&base->active))) {
struct hrtimer *timer;
timer = container_of(node, struct hrtimer, node);
/*
* The immediate goal for using the softexpires is
* minimizing wakeups, not running timers at the
* earliest interrupt after their soft expiration.
* This allows us to avoid using a Priority Search
* Tree, which can answer a stabbing querry for
* overlapping intervals and instead use the simple
* BST we already have.
* We don't add extra wakeups by delaying timers that
* are right-of a not yet expired timer, because that
* timer will have to trigger a wakeup anyway.
*/
if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer))
break;
__run_hrtimer(timer, &basenow);
}
}
/* Reevaluate the clock bases for the next expiry */
expires_next = __hrtimer_get_next_event(cpu_base);
/*
* Store the new expiry value so the migration code can verify
* against it.
*/
cpu_base->expires_next = expires_next;
cpu_base->in_hrtirq = 0;
raw_spin_unlock(&cpu_base->lock);
/* Reprogramming necessary ? */
if (expires_next.tv64 == KTIME_MAX ||
!tick_program_event(expires_next, 0)) {
cpu_base->hang_detected = 0;
return;
}
/*
* The next timer was already expired due to:
* - tracing
* - long lasting callbacks
* - being scheduled away when running in a VM
*
* We need to prevent that we loop forever in the hrtimer
* interrupt routine. We give it 3 attempts to avoid
* overreacting on some spurious event.
*
* Acquire base lock for updating the offsets and retrieving
* the current time.
*/
raw_spin_lock(&cpu_base->lock);
now = hrtimer_update_base(cpu_base);
cpu_base->nr_retries++;
if (++retries < 3)
goto retry;
/*
* Give the system a chance to do something else than looping
* here. We stored the entry time, so we know exactly how long
* we spent here. We schedule the next event this amount of
* time away.
*/
cpu_base->nr_hangs++;
cpu_base->hang_detected = 1;
raw_spin_unlock(&cpu_base->lock);
delta = ktime_sub(now, entry_time);
if (delta.tv64 > cpu_base->max_hang_time.tv64)
cpu_base->max_hang_time = delta;
/*
* Limit it to a sensible value as we enforce a longer
* delay. Give the CPU at least 100ms to catch up.
*/
if (delta.tv64 > 100 * NSEC_PER_MSEC)
expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
else
expires_next = ktime_add(now, delta);
tick_program_event(expires_next, 1);
printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n",
ktime_to_ns(delta));
}
那么原来的tick_periodic函数就废弃了吗?也不是。回调函数hrtimer_interrupt是为了实现hrtimer高精度定时器,那干脆把原本的、周期的tick_periodic干的事情,也交给hrtimer实现不就好了吗?假如现在的HZ=250,那周期就是4ms,那直接把tick_periodic作为回调,按照4ms的周期,注册进入hrtimer,那就和原来的逻辑一模一样了,jiffies照样按照HZ增加、进程的时间片照样按HZ判断,触发调度也照样按照HZ进行。
似乎确实是这样的,有个全局变量叫tick_sched,里面第一个成员就是struct hrtimer sched_timer:
kernel/time/tick-sched.h
/**
* struct tick_sched - sched tick emulation and no idle tick control/stats
* @sched_timer: hrtimer to schedule the periodic tick in high
* resolution mode
* @last_tick: Store the last tick expiry time when the tick
* timer is modified for nohz sleeps. This is necessary
* to resume the tick timer operation in the timeline
* when the CPU returns from nohz sleep.
* @tick_stopped: Indicator that the idle tick has been stopped
* @idle_jiffies: jiffies at the entry to idle for idle time accounting
* @idle_calls: Total number of idle calls
* @idle_sleeps: Number of idle calls, where the sched tick was stopped
* @idle_entrytime: Time when the idle call was entered
* @idle_waketime: Time when the idle was interrupted
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
* @sleep_length: Duration of the current idle sleep
* @do_timer_lst: CPU was the last one doing do_timer before going idle
*/
struct tick_sched {
struct hrtimer sched_timer;
unsigned long check_clocks;
enum tick_nohz_mode nohz_mode;
ktime_t last_tick;
int inidle;
int tick_stopped;
unsigned long idle_jiffies;
unsigned long idle_calls;
unsigned long idle_sleeps;
int idle_active;
ktime_t idle_entrytime;
ktime_t idle_waketime;
ktime_t idle_exittime;
ktime_t idle_sleeptime;
ktime_t iowait_sleeptime;
ktime_t sleep_length;
unsigned long last_jiffies;
unsigned long next_jiffies;
ktime_t idle_expires;
int do_timer_last;
};
hrtimer sched_timer的回调如下:
kernel/time/tick-sched.c
/*
* 高精度模式下:模拟传统周期性时钟节拍的 hrtimer 回调函数
* 功能:完全替代旧的 tick_handle_periodic(),完成系统核心定时工作
* 调用上下文:硬件中断上下文(和传统时钟中断环境完全一致)
*/
static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
{
// 通过 hrtimer 对象,反向获取所属的 tick_sched 管理结构体
struct tick_sched *ts =
container_of(timer, struct tick_sched, sched_timer);
// 获取CPU寄存器(用于区分当前是用户态/内核态)
struct pt_regs *regs = get_irq_regs();
// 从硬件 clocksource(mxc_timer1) 获取当前**纳秒级时间戳**
ktime_t now = ktime_get();
// ==============================================
// 🔥 核心1:更新全局 jiffies
// 等价于传统模式的 do_timer(1),严格按照 HZ 频率更新 jiffies
// 保证 jiffies 与系统时钟完全同步,不会出现错乱
// ==============================================
tick_sched_do_timer(now);
/*
* 仅在有效中断上下文、寄存器合法时执行后续逻辑
* 防止异常上下文下的误操作
*/
if (regs)
// ==============================================
// 🔥 核心2:统计进程时间 + 触发普通定时器软中断
// 等价于传统模式的 update_process_times()
// 1. 统计当前进程的用户态/内核态运行时间
// 2. 触发 TIMER_SOFTIRQ 软中断,处理普通 timer_list 定时器
// 3. 执行进程调度器的定时逻辑
// ==============================================
tick_sched_handle(ts, regs);
// 如果CPU进入空闲模式(idle)且停止了周期性tick,则不再重启定时器
if (unlikely(ts->tick_stopped))
return HRTIMER_NORESTART;
// ==============================================
// 🔥 核心3:按固定周期重置定时器
// tick_period = HZ对应的时间(如HZ=100 → 10ms)
// 保证定时器**严格按照系统HZ频率周期性触发**
// ==============================================
hrtimer_forward(timer, now, tick_period);
// 重启hrtimer,持续模拟传统的周期性时钟节拍
return HRTIMER_RESTART;
}
这下就说的通咯,之前的HZ、jiffies理论都是正确的,只不过借助hrtimer这个功能进行实现了。hrtimer和调度还非常有关,感觉要埋坑了,后面到进程调度的时候,估计还要和它见面。
小节
内核与时间相关的数据结构
- struct clock_source
描述硬件时钟源,具有从时钟源里面读时钟周期的能力 - struct timekeeper
正如其名字:时间保持者,从时钟周期到CLOCK_REALTIME、CLOCK_MONONIC的计算都在这里面了。里面维护两条tk_read_base时间线:tkr_mono和tkr_raw。tkr_mono会受NTP对时影响(调整快慢,但不是跳变),而tkr_raw就是原始值。 - struct clock_event_device
能够产生中断(event)的时钟源设备,里面有中断处理函数、设置设备模式(设置为periodic或是oneshot)的方法。hrtimer就是借助这里面的中断处理函数实现的
Linux内核定时器
到这里已经见过2个定时器了:timer list 和 hrtimer
这两个严格来说都是软件定时器,因为都离不开软件维护,但hrtimer要更硬一些,因为是在中断上下文处理的
- timer list:
借助TIMER_SOFTIRQ实现的简单定时器,精度受jiffies、HZ影响的软件定时器(因为它判断是否超时的依据就是jiffies,这直接决定了它的精度上限,不可能超过jiffies) - hrtimer:
高精度定时器,借助struct clock_event_device的回调函数实现的,前提是硬件时钟源必须支持中断、必须支持oneshot单次触发。看了一眼其他人讲的,里面是树的数据结构,估计是把离下次定时时间最小的那个hrtimer放到树顶,然后快速去取,取出来就得到了下次要设定的超时时间。这样的方法以前也见过,还有在应用层,用epoll_wait()的超时模拟时钟源,自己维护这么一棵树的定时器(libev,ev_timer),我想原理应该都是类似的。
时间的更新
借助hrtimer,按照HZ设定的周期往sched_timer里注册处理函数,在这里面完成jiffies的更新、检查进程时间片、触发调度和TIMER_SOFTIRQ。在没有hrtimer的情况下,这个过程就放在了周期回调的tick_handle_periodic里,那这就更简单了。sched_timer和调度也有很大的关系,所以等记录到进程的调度的时候,可能还要讲hrtimer。
其他细节
这一章的东西有一部分是和AI“讨论”出来的,尤其是最后的hrtimer,属于是自圆其说,没有认真追过源码,所以细节上可能存在偏差,欢迎指正。但大部分应该是正确的,对初学者来说应该也好理解。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)