Linux的时间

用不完的肝

354人浏览 · 2026-04-22 18:02:31

用不完的肝 · 2026-04-22 18:02:31 发布

前言

本章是在学习《图解Linux内核》第4章时的笔记。

本章重点是Linux的时间，而不是定时器（timer list，这个在内核里面基于软中断实现，其最小分辨率jiffies就是这一章要讨论的）

部分内容是和AI“讨论”出的，可能有误，欢迎指出

与时间维持相关的结构

timekeeper 和 tk_read_base

include/linux/timekeeper_internal.h
/**
 * struct tk_read_base - base structure for timekeeping readout
 * struct tk_read_base - 用于读取时间的基础结构
 * @clock:	Current clocksource used for timekeeping.
 * clock:  当前选择的时钟源
 * @read:	Read function of @clock
 * read:  读 @clock 时钟源的方法指针
 * @mask:	Bitmask for two's complement subtraction of non 64bit clocks
 * 用于非64位时钟的补码减法位掩码
 * @cycle_last: @clock cycle value at last update
 * 上一次更新时 @clock 的周期计数值
 * @mult:	(NTP adjusted) multiplier for scaled math conversion
 * @mult:	用于缩放数学转换的（NTP校准后）乘法因子
 * @shift:	Shift value for scaled math conversion
 * @shift:	用于缩放数学转换的移位值
 * @xtime_nsec: Shifted (fractional) nano seconds offset for readout
 * @xtime_nsec:	供读取使用的移位（小数形式）纳秒偏移量
 * @base:	ktime_t (nanoseconds) base time for readout
 * 供读取使用的ktime_t类型（纳秒）基准时间,ktime_t结构体其实就是int64
 *
 * This struct has size 56 byte on 64 bit. Together with a seqcount it
 * occupies a single 64byte cache line.
 *
 * The struct is separate from struct timekeeper as it is also used
 * for a fast NMI safe accessors.
 */
struct tk_read_base {
	struct clocksource	*clock;//选择的时钟源
	cycle_t			(*read)(struct clocksource *cs);	//从clock时钟源中读出时钟周期的方法
	cycle_t			mask;			//用于数学运算，时钟周期到纳秒
	cycle_t			cycle_last;	//上次更新的时钟周期
	u32					mult;			//用于数学运算，时钟周期到纳秒
	u32					shift;			//用于数学运算，时钟周期到纳秒
	u64					xtime_nsec;
	ktime_t			base;
};

/**
 * struct timekeeper - Structure holding internal timekeeping values.
 * @tkr_mono:		The readout base structure for CLOCK_MONOTONIC
 * @tkr_raw:		The readout base structure for CLOCK_MONOTONIC_RAW
 * @xtime_sec:		Current CLOCK_REALTIME time in seconds
 * @ktime_sec:		Current CLOCK_MONOTONIC time in seconds
 * @wall_to_monotonic:	CLOCK_REALTIME to CLOCK_MONOTONIC offset
 * @offs_real:		Offset clock monotonic -> clock realtime
 * @offs_boot:		Offset clock monotonic -> clock boottime
 * @offs_tai:		Offset clock monotonic -> clock tai
 * @tai_offset:		The current UTC to TAI offset in seconds
 * @raw_time:		Monotonic raw base time in timespec64 format
 * @cycle_interval:	Number of clock cycles in one NTP interval
 * @xtime_interval:	Number of clock shifted nano seconds in one NTP
 *			interval.
 * @xtime_remainder:	Shifted nano seconds left over when rounding
 *			@cycle_interval
 * @raw_interval:	Raw nano seconds accumulated per NTP interval.
 * @ntp_error:		Difference between accumulated time and NTP time in ntp
 *			shifted nano seconds.
 * @ntp_error_shift:	Shift conversion between clock shifted nano seconds and
 *			ntp shifted nano seconds.
 *
 * Note: For timespec(64) based interfaces wall_to_monotonic is what
 * we need to add to xtime (or xtime corrected for sub jiffie times)
 * to get to monotonic time.  Monotonic is pegged at zero at system
 * boot time, so wall_to_monotonic will be negative, however, we will
 * ALWAYS keep the tv_nsec part positive so we can use the usual
 * normalization.
 *
 * wall_to_monotonic is moved after resume from suspend for the
 * monotonic time not to jump. We need to add total_sleep_time to
 * wall_to_monotonic to get the real boot based time offset.
 *
 * wall_to_monotonic is no longer the boot time, getboottime must be
 * used instead.
 */
补充：timespec64，就是秒和纳秒的结构体
struct timespec64 {
	time64_t	tv_sec;			/* seconds */
	long		tv_nsec;		/* nanoseconds */
};

struct timekeeper {
	struct tk_read_base	tkr_mono;		//CLOCK_MONOTONIC（单调时钟）对应的读取基础结构
	struct tk_read_base	tkr_raw;		//CLOCK_MONOTONIC_RAW（原始单调时钟）对应的读取基础结构
	u64			xtime_sec;						//当前 CLOCK_REALTIME（实时时钟）的时间，单位：秒
	unsigned long		ktime_sec;			//当前 CLOCK_MONOTONIC（单调时钟）的时间，单位：秒
	struct timespec64	wall_to_monotonic; //CLOCK_REALTIME 到 CLOCK_MONOTONIC 的时间偏移量
	ktime_t			offs_real; //单调时钟 -> 实时时钟 的偏移量
	ktime_t			offs_boot;	//单调时钟 -> 启动时钟 的偏移量
	ktime_t			offs_tai; 	//单调时钟 -> TAI时钟（国际原子时）的偏移量
	s32			tai_offset;			//当前 UTC 到 TAI 的偏移量，单位：秒
	struct timespec64	raw_time;	//原始单调时钟的基准时间，timespec64 格式

	/* The following members are for timekeeping internal use */
	cycle_t			cycle_interval; //一个 NTP 时间间隔内包含的时钟周期数
	u64			xtime_interval;			//一个 NTP 时间间隔内对应的移位纳秒数
	s64			xtime_remainder;			//对 @cycle_interval 取整后剩余的移位纳秒数
	u32			raw_interval;				//每个 NTP 时间间隔内累计的原始纳秒数
	/* The ntp_tick_length() value currently being used.
	 * This cached copy ensures we consistently apply the tick
	 * length for an entire tick, as ntp_tick_length may change
	 * mid-tick, and we don't want to apply that new value to
	 * the tick in progress.
	 */
	u64			ntp_tick;
	/* Difference between accumulated time and NTP time in ntp
	 * shifted nano seconds. */
	s64			ntp_error;
	u32			ntp_error_shift;
	u32			ntp_err_mult;
};

虽然结构体成员比较多，但都比较好理解。timekeeper结构体用于维持时间，里面有“两条时间线”：tkr_mono和tkr_raw，一条是校准过的，另一条是原始的。除此以外，里面记录了与各种时间对齐的差值，还有NTP校准参数，毕竟这些都是和时间相关的、必不可少的东西

时钟源 clocksource

这里就和书里面一致了，粘一下源码：

include/linux/clocksource.h
/**
 * struct clocksource - hardware abstraction for a free running counter
 *	Provides mostly state-free accessors to the underlying hardware.
 *	This is the structure used for system time.
 *
 * @name:		ptr to clocksource name
 * @list:		list head for registration
 * @rating:		rating value for selection (higher is better)
 *			To avoid rating inflation the following
 *			list should give you a guide as to how
 *			to assign your clocksource a rating
 *			1-99: Unfit for real use
 *				Only available for bootup and testing purposes.
 *			100-199: Base level usability.
 *				Functional for real use, but not desired.
 *			200-299: Good.
 *				A correct and usable clocksource.
 *			300-399: Desired.
 *				A reasonably fast and accurate clocksource.
 *			400-499: Perfect
 *				The ideal clocksource. A must-use where
 *				available.
 * @read:		returns a cycle value, passes clocksource as argument
 * @enable:		optional function to enable the clocksource
 * @disable:		optional function to disable the clocksource
 * @mask:		bitmask for two's complement
 *			subtraction of non 64 bit counters
 * @mult:		cycle to nanosecond multiplier
 * @shift:		cycle to nanosecond divisor (power of two)
 * @max_idle_ns:	max idle time permitted by the clocksource (nsecs)
 * @maxadj:		maximum adjustment value to mult (~11%)
 * @max_cycles:		maximum safe cycle value which won't overflow on multiplication
 * @flags:		flags describing special properties
 * @archdata:		arch-specific data
 * @suspend:		suspend function for the clocksource, if necessary
 * @resume:		resume function for the clocksource, if necessary
 * @owner:		module reference, must be set by clocksource in modules
 */
struct clocksource {
	/*
	 * Hotpath data, fits in a single cache line when the
	 * clocksource itself is cacheline aligned.
	 */
	cycle_t (*read)(struct clocksource *cs);
	cycle_t mask;
	u32 mult;
	u32 shift;
	u64 max_idle_ns;
	u32 maxadj;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
	struct arch_clocksource_data archdata;
#endif
	u64 max_cycles;
	const char *name;
	struct list_head list;
	int rating;
	int (*enable)(struct clocksource *cs);
	void (*disable)(struct clocksource *cs);
	unsigned long flags;
	void (*suspend)(struct clocksource *cs);
	void (*resume)(struct clocksource *cs);

	/* private: */
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
	/* Watchdog related data, used by the framework */
	struct list_head wd_list;
	cycle_t cs_last;
	cycle_t wd_last;
#endif
	struct module *owner;
} ____cacheline_aligned;

要关注的几个字段：

字段	类型	描述
flags	unsigned long	clocksource的标志
mult和shift	u32	与timekeeper的同名字段相同，都是用于计算从时钟周期到纳秒
read	回调函数	读取时钟源的当前时钟周期
rating	int	clocksource的等级，表示精度，越大越高

tk_read_base里的read方法和clocksource里的read方法有什么不同？
完全一样：

kernel/time/timekeeping.c
static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) {
    // ... 其他初始化代码 ...
    tk->tkr_mono.clock = clock;  // 绑定时钟源
    tk->tkr_mono.read  = clock->read;  // 直接赋值 read 指针！
    tk->tkr_raw.clock  = clock;  // 绑定同一个时钟源
    tk->tkr_raw.read   = clock->read;  // 同样直接赋值！
    // ... 其他初始化代码 ...
}

时间的获取

《图解Linux内核》书里的：
gettimeofday系统调用，内核代替执行：ktime_get_read_ts64
只描述作用：

kernel/time/timekeeping.h和timekeeping.c
void ktime_get_real_ts64(struct timespec64 *ts) {
	获得全局的timekeeper tk
	
	秒 = tk->xtime_sec（上次更新的CLOCK_REALTIME，秒）
	
	纳秒 = timekeeping_get_ns(&tk->tkr_mono) 从timekeeper的tkr_mono时间线读纳秒
}

u64 timekeeping_get_ns(const struct tk_read_base *tkr) {
	struct clock_source *clock = READ_ONCE(tkr->clock) 拿到tkr的clock时钟源
	
	cycle_now = clock->read(clock) 读时钟周期
	cycle_delta = (cycle_now - tkr->cycle_last) & tkr->mask 计算和上次时钟周期的差值
	
	nsec = cycle_delta * tkr->mult + tkr->xtime_nsec;	做数学运算，得到纳秒
	nsec >>= tkr->shift;
	
	return nsec
}

gettimeofday用的是timekeeper里面的时间线tkr_mono，mono是单调的意思，只增加，不跳变、不回头。tkr里面有个叫mult的成员，可以控制时钟周期换算到纳秒的快慢，NTP对时会影响mult，假如内核在NTP对时的时候，发现一月快了10秒，就会修改tkr的mult，让它以后计算纳秒的时候“走的慢一点”，这就是tkr_mono和tkr_raw的区别（和AI讨论出的）

另外，tkr_mono和tkr_raw都面向的是时钟周期，只要晶振在跳，时钟周期都会累加，不会跳变。NTP对时可能会影响时间的跳变，对应的是timekeeper里的xtime_sec（CLOCK_REALTIME）。这就是NTP对时会造成gettimeofday的tv_sec秒数跳变，但不会造成tv_nsec纳秒跳变的原因。（和AI讨论出的）

时钟源的选择

书里说，clocksource里的rating表示时钟源的等级，rating越高表示精度越高，内核会选择精度最高的时钟源给timekeeper用。那如何查看系统里有几个时钟源？

ls /sys/devices/system/clocksource
clocksource0  power  uevent

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
arch_sys_counter

好吧，看来我这个STM32MP157开发板只有1个时钟源arch_sys_counter，内核也没得选。看看虚拟机里的：

fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc

确实有多个时钟源，那怎么看他们的rating？（未解决，AI在骗，先不管了）
可以确定的是，这个struct clocksource就是对应的一个硬件时钟源。I.MX6ULL叫mxc_timer1、STM32MP157叫arch_sys_counter

书里说，内核会选择一个不需要被监控的连续时钟源，负责监控其他时钟源，如果某个时钟源的误差超过可接受范围，就会将其状态设置为CLOCK_SOURCE_UNSTABLE，并将rating字段设为0

什么是不需要被监控的连续时钟源？
标志位：无 CLOCK_SOURCE_MUST_VERIFY，有CLOCK_SOURCE_IS_CONTINUOUS
通常，最先被切换到的高 rating 时钟源会成为看门狗

开发板：
root@lubancat:/home# dmesg | grep -i "clocksource: Switched to clocksource"
[    0.362019] clocksource: Switched to clocksource arch_sys_counter

与时钟中断相关的结构

书里面将保持时间的设备称为时钟源，将关注时间事件的设备称为时钟中断设备（时钟事件设备）。

结构体clock_event_device

这个结构体描述了能产生中断的设备，在其中的event_handler里处理这个设备的时钟中断

include/linux/clockchips.h
/**
 * struct clock_event_device - clock event device descriptor
 * @event_handler:	Assigned by the framework to be called by the low
 *			level handler of the event source
 * @set_next_event:	set next event function using a clocksource delta
 * @set_next_ktime:	set next event function using a direct ktime value
 * @next_event:		local storage for the next event in oneshot mode
 * @max_delta_ns:	maximum delta value in ns
 * @min_delta_ns:	minimum delta value in ns
 * @mult:		nanosecond to cycles multiplier
 * @shift:		nanoseconds to cycles divisor (power of two)
 * @mode:		operating mode, relevant only to ->set_mode(), OBSOLETE
 * @state:		current state of the device, assigned by the core code
 * @features:		features
 * @retries:		number of forced programming retries
 * @set_mode:		legacy set mode function, only for modes <= CLOCK_EVT_MODE_RESUME.
 * @set_state_periodic:	switch state to periodic, if !set_mode
 * @set_state_oneshot:	switch state to oneshot, if !set_mode
 * @set_state_shutdown:	switch state to shutdown, if !set_mode
 * @tick_resume:	resume clkevt device, if !set_mode
 * @broadcast:		function to broadcast events
 * @min_delta_ticks:	minimum delta value in ticks stored for reconfiguration
 * @max_delta_ticks:	maximum delta value in ticks stored for reconfiguration
 * @name:		ptr to clock event name
 * @rating:		variable to rate clock event devices
 * @irq:		IRQ number (only for non CPU local devices)
 * @bound_on:		Bound on CPU
 * @cpumask:		cpumask to indicate for which CPUs this device works
 * @list:		list head for the management code
 * @owner:		module reference
 */
struct clock_event_device {
	void			(*event_handler)(struct clock_event_device *); 时钟中断到来时，处理中断的回调函数，被动回调
	int			(*set_next_event)(unsigned long evt, struct clock_event_device *); 设置下一个时钟中断
	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
	ktime_t			next_event;
	u64			max_delta_ns;
	u64			min_delta_ns;
	u32			mult; 与timekeeper里的mult和shift一样
	u32			shift;
	enum clock_event_mode	mode;
	enum clock_event_state	state;
	unsigned int		features; 设备的特性，CLOCK_EVT_FEAT_XXX
	unsigned long		retries;

	/*
	 * State transition callback(s): Only one of the two groups should be
	 * defined:
	 * - set_mode(), only for modes <= CLOCK_EVT_MODE_RESUME.
	 * - set_state_{shutdown|periodic|oneshot}(), tick_resume().
	 */
	void			(*set_mode)(enum clock_event_mode mode, struct clock_event_device *);
	int			(*set_state_periodic)(struct clock_event_device *); 切换当前设备状态，内核需要切换设备模式时主动调用，
	里面是驱动相关的实现
	int			(*set_state_oneshot)(struct clock_event_device *); 切换当前设备状态
	int			(*set_state_shutdown)(struct clock_event_device *); 切换当前设备状态
	int			(*tick_resume)(struct clock_event_device *);

	void			(*broadcast)(const struct cpumask *mask);
	void			(*suspend)(struct clock_event_device *);
	void			(*resume)(struct clock_event_device *);
	unsigned long		min_delta_ticks;
	unsigned long		max_delta_ticks;

	const char		*name;
	int			rating; 设备的等级
	int			irq;
	int			bound_on;
	const struct cpumask	*cpumask;
	struct list_head	list;
	struct module		*owner;
} ____cacheline_aligned;

衡量时间

两个名词，HZ和jiffies，HZ就是进入定时中断的频率，jiffies每次进入就会+1

HZ

是个在编译时就确定的东西，由编译参数CONFIG_HZ指定，作为很多模块衡量时间的基准

#ifndef __ASM_GENERIC_PARAM_H
#define __ASM_GENERIC_PARAM_H

#include <uapi/asm-generic/param.h>

# undef HZ
# define HZ		CONFIG_HZ	/* Internal kernel timer frequency */
# define USER_HZ	100		/* some user interfaces are */
# define CLOCKS_PER_SEC	(USER_HZ)       /* in "ticks" like times() */
#endif /* __ASM_GENERIC_PARAM_H */

jiffies

全局变量，类似单片机的systick

linux/kernel/time/jiffies.c

EXPORT_SYMBOL(jiffies);

读取方式：直接使用jiffies或get_jiffies_64函数，由于32位机没有读64位的原子指令，所以需要借助函数加锁。

X86时钟硬件

RTC（Real-Time Clock）
PIT（Programmable Interval Timer）
TSC（Time Stamp Counter）程序可以通过RDTSC指令读值
HPET（High Precision Event Timer）
APIC（Advanced Programmable Interrupt Controller）每个CPU都有一个本地的

其中后3个都见到了：

fylfly@ubuntu2204:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

RTC在单片机里也见过了，能产生中断的时钟设备并没有什么稀奇的。
计算机默认选择TSC作为时钟源，但它不能作为看门狗，反而会被看门狗监控。

时间的计算

REALTIME（WALL TIME）
内核维护的、墙上时间、xtime时间、系统时间，系统启动时会读RTC时间作为REALTIME时间，之后独立。settimeofday只改变REALTIME，并不改变RTC时间（不会保存，重启丢失）
MONONIC
系统启动到当前的非休眠时间，从0开始单调递增，系统休眠就不再增加
BOOTTIME
系统启动到当前的时间，从0开始单调递增，休眠也增加
RAW MONINIC
不受NTP影响

时钟中断

书里面说的时钟中断很简单，时钟中断发生时，最终调用struct clock_event_device的event_handler指针，计算当前进程占用CPU的时间，如果需要调度，就设置TIF_NEED_RESCHED标志，然后在中断退出时，根据这个标志触发进程调度。

听起来就是进程占用CPU太久，时间片用完了，内核强制把它切走，和RTOS中的Systick+PendSV中断里面切任务是一样的。

以I.MX6ULL内核为例：

// kernel/time/tick-common.c
/*
 * Event handler for periodic ticks
 */
void tick_handle_periodic(struct clock_event_device *dev)
{
	int cpu = smp_processor_id();
	ktime_t next = dev->next_event;

	// ====================== 核心核心核心！======================
	// 1. 所有时间相关的工作全在这里：
	//    - 更新 jiffies
	//    - 触发 TIMER_SOFTIRQ 软中断
	//    - 进程调度器 tick
	// ==========================================================
	tick_periodic(cpu);

	// 这里就是说，如果时钟源不是ONESHOT，就直接退出（伏笔）
	if (dev->state != CLOCK_EVT_STATE_ONESHOT)
		return;

	//ONESHOT能继续走这里
	for (;;) {
		next = ktime_add(next, tick_period);
		if (!clockevents_program_event(dev, next, false))
			return;
		if (timekeeping_valid_for_hres())
			tick_periodic(cpu);
	}
}

核心是tick_periodic(cpu)：

/*
 * Periodic tick：周期性时钟中断的核心处理函数
 */
static void tick_periodic(int cpu)
{
	// ==========================================
	// 1. 【单核IMX6ULL：必定成立】
	// 只有指定CPU(默认CPU0)负责更新全局时间
	// ==========================================
	if (tick_do_timer_cpu == cpu) {
		// 多核加锁，保护jiffies并发访问（单核没用，但保留）
		write_seqlock(&jiffies_lock);

		// 记录下一次时钟中断的时间点
		tick_next_period = ktime_add(tick_next_period, tick_period);

		// ==========================================
		// 🔥【核心中的核心】jiffies 更新！
		// do_timer(1) 内部：jiffies_64 += 1
		// jiffies 是 jiffies_64 的低32位别名
		// ==========================================
		do_timer(1);

		write_sequnlock(&jiffies_lock);
		// 更新系统实时时间（墙上时间，date命令看到的时间）
		update_wall_time();
	}

	// ==========================================
	// 2. 【每个CPU都执行】
	// 更新进程时间片、触发定时器软中断
	// ==========================================
	update_process_times(user_mode(get_irq_regs()));

	// 性能分析工具用的，和时间/定时器无关
	profile_tick(CPU_PROFILING);
}
/*
 * Must hold jiffies_lock
 */
void do_timer(unsigned long ticks)
{
	jiffies_64 += ticks;
	calc_global_load(ticks);
}

/*
 * Called from the timer interrupt handler to charge one tick to the current
 * process.  user_tick is 1 if the tick is user time, 0 for system.
 */
void update_process_times(int user_tick)
{
	struct task_struct *p = current;

	/* Note: this timer irq context must be accounted for as well. */
	account_process_tick(p, user_tick);
	run_local_timers(); 这里还会RAISE Timer SOFTIRQ!!!
	rcu_check_callbacks(user_tick);
#ifdef CONFIG_IRQ_WORK
	if (in_irq())
		irq_work_tick();
#endif
	scheduler_tick();
	run_posix_cpu_timers(p);
}

确实如书里所述，但还有一个小细节书里没说，TIMER_SOFTIRQ在run_local_timers函数被打开了，这就意味着中断退出后，基于TIMER_SOFTIRQ实现的内核timer list也会被处理：

kernel/time/tick-common.c
/*
 * Called by the local, per-CPU timer interrupt on SMP.
 */
void run_local_timers(void)
{
	hrtimer_run_queues();
	raise_softirq(TIMER_SOFTIRQ);
}

这就是全部吗？

本来到这里一切都很顺利，硬件时钟源被配置为周期模式（PERIODIC），按照HZ触发定时中断，进入tick_handle_periodic，然后jiffies增加、检查进程时间、切走时间片用尽的进程、唤醒TIMER_SOFTIRQ…多么流畅。直到我又找到了hrtimer这个东西，而且正点原子I.MX6ULL内核确实开启了这个模式…没办法，继续看吧

hrtimer介绍

全称高精度定时器（High Resolution Timer），是要求时钟源工作在OneShot模式的，这和上面讲的周期性的tick中断冲突了，因为tick_handle_periodic的注释说的很清楚：
/* Event handler for periodic ticks */
而高精度定时器根本不是工作在PERIODIC模式。这下郁闷了，开发板只有1个时钟源，一直以为它工作在Periodic模式，为系统产生周期的中断，可事实却是，高精度定时器也在工作：

开发板：
root@ATK-IMX6U:/home/tmodel# cat /proc/timer_list
Timer List Version: v0.7
HRTIMER_MAX_CLOCK_BASES: 4
now at 513407679694534 nsecs

cpu: 0
 clock 0:
  .base:       97b913f8
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <97b91650>, tick_sched_timer, S:01
 # expires at 513407680000000-513407680000000 nsecs [in 305466 to 305466 nsecs]
 #1: def_rt_bandwidth, sched_rt_period_timer, S:01
 # expires at 513408000000000-513408000000000 nsecs [in 320305466 to 320305466 n                                                                                             secs]
 #2: <94abfb78>, hrtimer_wakeup, S:01
 # expires at 513408103155868-513408104145867 nsecs [in 423461334 to 424451333 n                                                                                             secs]
 #3: <947d5f40>, hrtimer_wakeup, S:01
 # expires at 513408223075201-513408223075201 nsecs [in 543380667 to 543380667 n                                                                                             secs]
 #4: <94d01b78>, hrtimer_wakeup, S:01
 # expires at 513411658126531-513411688126527 nsecs [in 3978431997 to 4008431993                                                                                              nsecs]
 #5: <94065ae0>, hrtimer_wakeup, S:01
 # expires at 513412199754534-513412204754529 nsecs [in 4520060000 to 4525059995                                                                                              nsecs]
 #6: <94ad7ae0>, hrtimer_wakeup, S:01
 # expires at 513412742455534-513412752455525 nsecs [in 5062761000 to 5072760991                                                                                              nsecs]
 #7: <947db018>, it_real_fn, S:01
 # expires at 513434065199534-513434065199534 nsecs [in 26385505000 to 263855050                                                                                             00 nsecs]
 #8: <946f1f40>, hrtimer_wakeup, S:01
 # expires at 513440093633531-513440093683531 nsecs [in 32413938997 to 324139889                                                                                             97 nsecs]
 #9: <94925b78>, hrtimer_wakeup, S:01
 # expires at 513463000534868-513463056509866 nsecs [in 55320840334 to 553768153                                                                                             32 nsecs]
 #10: sched_clock_timer, sched_clock_poll, S:01
 # expires at 513964419879838-513964419879838 nsecs [in 556740185304 to 55674018                                                                                             5304 nsecs]
 #11: <94adbf40>, hrtimer_wakeup, S:01
 # expires at 514811460020273-514811460070273 nsecs [in 1403780325739 to 1403780                                                                                             375739 nsecs]
 clock 1:
  .base:       97b91430
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1733880881385632000 nsecs
active timers:
 clock 2:
  .base:       97b91468
  .index:      2
  .resolution: 1 nsecs
  .get_time:   ktime_get_boottime
  .offset:     0 nsecs
active timers:
 clock 3:
  .base:       97b914a0
  .index:      3
  .resolution: 1 nsecs
  .get_time:   ktime_get_clocktai
  .offset:     1733880881385632000 nsecs
active timers:
  .expires_next   : 513407680000000 nsecs
  .hres_active    : 1
  .nr_events      : 14674035
  .nr_retries     : 443
  .nr_hangs       : 0
  .max_hang_time  : 0 nsecs
  .nohz_mode      : 2
  .last_tick      : 513407657000000 nsecs
  .tick_stopped   : 0
  .idle_jiffies   : 513107657
  .idle_calls     : 16072536
  .idle_sleeps    : 10951328
  .idle_entrytime : 513407658005868 nsecs
  .idle_waketime  : 513407658005868 nsecs
  .idle_exittime  : 513407658084201 nsecs
  .idle_sleeptime : 509908760813518 nsecs
  .iowait_sleeptime: 771037983 nsecs
  .last_jiffies   : 513107656
  .next_jiffies   : 513107658
  .idle_expires   : 513407658000000 nsecs
jiffies: 513107679

Tick Device: mode:     1
Broadcast device
Clock Event Device: <NULL>
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 00000000

Tick Device: mode:     1
Per CPU device: 0
Clock Event Device: mxc_timer1
 max_delta_ns:   1431655752223
 min_delta_ns:   85000
 mult:           6442451
 shift:          31
 mode:           3
 next_event:     513407680000000 nsecs
 set_next_event: v2_set_next_event
 set_mode:       mxc_set_mode
 event_handler:  hrtimer_interrupt
 retries:        0

奇了怪了，不工作在周期模式，它的原理是什么？这就得从名字说起了，既然它工作在高精度，那肯定精度要比PERIODIC时钟源高。想想看，HZ=100，那时钟源精度就是10ms，HZ=250，那就是4ms，HZ=1000，那就是1ms，可即使这样，还有0.5ms、0.1ms的定时需求存在，总不能无限制的提高HZ吧？再高的HZ也有更高的HZ能获得更高的精度（人外有人，天外有天哇）

所以高精度定时器工作在Oneshot模式就不奇怪了，内核需要的最近的一次定时周期是什么，就告诉时钟源硬件就好了：我这次需要10ms后产生1个中断，就给硬件设置一个10ms的定时，下次需要1ms后产生1个中断，就给硬件设置一个1ms的定时，妙哇，只要内核维护好这个数据结构，别算错下次需要设定的时间就行

以下才是struct clock_event_device真正的回调函数，hrtimer_interrupt：

kernel/timer/hrtimer.c
/*
 * High resolution timer interrupt
 * Called with interrupts disabled
 */
void hrtimer_interrupt(struct clock_event_device *dev)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	ktime_t expires_next, now, entry_time, delta;
	int i, retries = 0;

	BUG_ON(!cpu_base->hres_active);
	cpu_base->nr_events++;
	dev->next_event.tv64 = KTIME_MAX;

	raw_spin_lock(&cpu_base->lock);
	entry_time = now = hrtimer_update_base(cpu_base);
retry:
	cpu_base->in_hrtirq = 1;
	/*
	 * We set expires_next to KTIME_MAX here with cpu_base->lock
	 * held to prevent that a timer is enqueued in our queue via
	 * the migration code. This does not affect enqueueing of
	 * timers which run their callback and need to be requeued on
	 * this CPU.
	 */
	cpu_base->expires_next.tv64 = KTIME_MAX;

	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
		struct hrtimer_clock_base *base;
		struct timerqueue_node *node;
		ktime_t basenow;

		if (!(cpu_base->active_bases & (1 << i)))
			continue;

		base = cpu_base->clock_base + i;
		basenow = ktime_add(now, base->offset);

		while ((node = timerqueue_getnext(&base->active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing querry for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don't add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(timer, &basenow);
		}
	}
	/* Reevaluate the clock bases for the next expiry */
	expires_next = __hrtimer_get_next_event(cpu_base);
	/*
	 * Store the new expiry value so the migration code can verify
	 * against it.
	 */
	cpu_base->expires_next = expires_next;
	cpu_base->in_hrtirq = 0;
	raw_spin_unlock(&cpu_base->lock);

	/* Reprogramming necessary ? */
	if (expires_next.tv64 == KTIME_MAX ||
	    !tick_program_event(expires_next, 0)) {
		cpu_base->hang_detected = 0;
		return;
	}

	/*
	 * The next timer was already expired due to:
	 * - tracing
	 * - long lasting callbacks
	 * - being scheduled away when running in a VM
	 *
	 * We need to prevent that we loop forever in the hrtimer
	 * interrupt routine. We give it 3 attempts to avoid
	 * overreacting on some spurious event.
	 *
	 * Acquire base lock for updating the offsets and retrieving
	 * the current time.
	 */
	raw_spin_lock(&cpu_base->lock);
	now = hrtimer_update_base(cpu_base);
	cpu_base->nr_retries++;
	if (++retries < 3)
		goto retry;
	/*
	 * Give the system a chance to do something else than looping
	 * here. We stored the entry time, so we know exactly how long
	 * we spent here. We schedule the next event this amount of
	 * time away.
	 */
	cpu_base->nr_hangs++;
	cpu_base->hang_detected = 1;
	raw_spin_unlock(&cpu_base->lock);
	delta = ktime_sub(now, entry_time);
	if (delta.tv64 > cpu_base->max_hang_time.tv64)
		cpu_base->max_hang_time = delta;
	/*
	 * Limit it to a sensible value as we enforce a longer
	 * delay. Give the CPU at least 100ms to catch up.
	 */
	if (delta.tv64 > 100 * NSEC_PER_MSEC)
		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
	else
		expires_next = ktime_add(now, delta);
	tick_program_event(expires_next, 1);
	printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n",
		    ktime_to_ns(delta));
}

那么原来的tick_periodic函数就废弃了吗？也不是。回调函数hrtimer_interrupt是为了实现hrtimer高精度定时器，那干脆把原本的、周期的tick_periodic干的事情，也交给hrtimer实现不就好了吗？假如现在的HZ=250，那周期就是4ms，那直接把tick_periodic作为回调，按照4ms的周期，注册进入hrtimer，那就和原来的逻辑一模一样了，jiffies照样按照HZ增加、进程的时间片照样按HZ判断，触发调度也照样按照HZ进行。

似乎确实是这样的，有个全局变量叫tick_sched，里面第一个成员就是struct hrtimer sched_timer：

kernel/time/tick-sched.h
/**
 * struct tick_sched - sched tick emulation and no idle tick control/stats
 * @sched_timer:	hrtimer to schedule the periodic tick in high
 *			resolution mode
 * @last_tick:		Store the last tick expiry time when the tick
 *			timer is modified for nohz sleeps. This is necessary
 *			to resume the tick timer operation in the timeline
 *			when the CPU returns from nohz sleep.
 * @tick_stopped:	Indicator that the idle tick has been stopped
 * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
 * @idle_calls:		Total number of idle calls
 * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
 * @idle_entrytime:	Time when the idle call was entered
 * @idle_waketime:	Time when the idle was interrupted
 * @idle_exittime:	Time when the idle state was left
 * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
 * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
 * @sleep_length:	Duration of the current idle sleep
 * @do_timer_lst:	CPU was the last one doing do_timer before going idle
 */
struct tick_sched {
	struct hrtimer			sched_timer;
	unsigned long			check_clocks;
	enum tick_nohz_mode		nohz_mode;
	ktime_t				last_tick;
	int				inidle;
	int				tick_stopped;
	unsigned long			idle_jiffies;
	unsigned long			idle_calls;
	unsigned long			idle_sleeps;
	int				idle_active;
	ktime_t				idle_entrytime;
	ktime_t				idle_waketime;
	ktime_t				idle_exittime;
	ktime_t				idle_sleeptime;
	ktime_t				iowait_sleeptime;
	ktime_t				sleep_length;
	unsigned long			last_jiffies;
	unsigned long			next_jiffies;
	ktime_t				idle_expires;
	int				do_timer_last;
};

hrtimer sched_timer的回调如下：

kernel/time/tick-sched.c
/*
 * 高精度模式下：模拟传统周期性时钟节拍的 hrtimer 回调函数
 * 功能：完全替代旧的 tick_handle_periodic()，完成系统核心定时工作
 * 调用上下文：硬件中断上下文（和传统时钟中断环境完全一致）
 */
static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
{
	// 通过 hrtimer 对象，反向获取所属的 tick_sched 管理结构体
	struct tick_sched *ts =
		container_of(timer, struct tick_sched, sched_timer);
	// 获取CPU寄存器（用于区分当前是用户态/内核态）
	struct pt_regs *regs = get_irq_regs();
	// 从硬件 clocksource(mxc_timer1) 获取当前**纳秒级时间戳**
	ktime_t now = ktime_get();

	// ==============================================
	// 🔥 核心1：更新全局 jiffies
	// 等价于传统模式的 do_timer(1)，严格按照 HZ 频率更新 jiffies
	// 保证 jiffies 与系统时钟完全同步，不会出现错乱
	// ==============================================
	tick_sched_do_timer(now);

	/*
	 * 仅在有效中断上下文、寄存器合法时执行后续逻辑
	 * 防止异常上下文下的误操作
	 */
	if (regs)
		// ==============================================
		// 🔥 核心2：统计进程时间 + 触发普通定时器软中断
		// 等价于传统模式的 update_process_times()
		// 1. 统计当前进程的用户态/内核态运行时间
		// 2. 触发 TIMER_SOFTIRQ 软中断，处理普通 timer_list 定时器
		// 3. 执行进程调度器的定时逻辑
		// ==============================================
		tick_sched_handle(ts, regs);

	// 如果CPU进入空闲模式（idle）且停止了周期性tick，则不再重启定时器
	if (unlikely(ts->tick_stopped))
		return HRTIMER_NORESTART;

	// ==============================================
	// 🔥 核心3：按固定周期重置定时器
	// tick_period = HZ对应的时间（如HZ=100 → 10ms）
	// 保证定时器**严格按照系统HZ频率周期性触发**
	// ==============================================
	hrtimer_forward(timer, now, tick_period);

	// 重启hrtimer，持续模拟传统的周期性时钟节拍
	return HRTIMER_RESTART;
}

这下就说的通咯，之前的HZ、jiffies理论都是正确的，只不过借助hrtimer这个功能进行实现了。hrtimer和调度还非常有关，感觉要埋坑了，后面到进程调度的时候，估计还要和它见面。

小节

内核与时间相关的数据结构

struct clock_source
描述硬件时钟源，具有从时钟源里面读时钟周期的能力
struct timekeeper
正如其名字：时间保持者，从时钟周期到CLOCK_REALTIME、CLOCK_MONONIC的计算都在这里面了。里面维护两条tk_read_base时间线：tkr_mono和tkr_raw。tkr_mono会受NTP对时影响（调整快慢，但不是跳变），而tkr_raw就是原始值。
struct clock_event_device
能够产生中断（event）的时钟源设备，里面有中断处理函数、设置设备模式（设置为periodic或是oneshot）的方法。hrtimer就是借助这里面的中断处理函数实现的

Linux内核定时器

到这里已经见过2个定时器了：timer list 和 hrtimer
这两个严格来说都是软件定时器，因为都离不开软件维护，但hrtimer要更硬一些，因为是在中断上下文处理的

timer list：
借助TIMER_SOFTIRQ实现的简单定时器，精度受jiffies、HZ影响的软件定时器（因为它判断是否超时的依据就是jiffies，这直接决定了它的精度上限，不可能超过jiffies）
hrtimer:
高精度定时器，借助struct clock_event_device的回调函数实现的，前提是硬件时钟源必须支持中断、必须支持oneshot单次触发。看了一眼其他人讲的，里面是树的数据结构，估计是把离下次定时时间最小的那个hrtimer放到树顶，然后快速去取，取出来就得到了下次要设定的超时时间。这样的方法以前也见过，还有在应用层，用epoll_wait()的超时模拟时钟源，自己维护这么一棵树的定时器（libev，ev_timer），我想原理应该都是类似的。

时间的更新

借助hrtimer，按照HZ设定的周期往sched_timer里注册处理函数，在这里面完成jiffies的更新、检查进程时间片、触发调度和TIMER_SOFTIRQ。在没有hrtimer的情况下，这个过程就放在了周期回调的tick_handle_periodic里，那这就更简单了。sched_timer和调度也有很大的关系，所以等记录到进程的调度的时候，可能还要讲hrtimer。

其他细节

这一章的东西有一部分是和AI“讨论”出来的，尤其是最后的hrtimer，属于是自圆其说，没有认真追过源码，所以细节上可能存在偏差，欢迎指正。但大部分应该是正确的，对初学者来说应该也好理解。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

AI 原生营销矩阵系统：全链路自动化测试与监控告警技术实现

本文从工程实践角度，深入拆解了 AI 原生营销矩阵系统的全链路自动化测试体系与分布式监控告警系统，详细讲解了接口自动化测试、全链路压测、指标监控、日志分析、智能告警与故障自愈等核心技术的实现细节，并分享了容灾备份与系统性能优化的实践经验。通过构建完善的测试与监控体系，能够有效提高系统的质量和稳定性，减少故障发生的概率，缩短故障处理时间，保障企业营销业务的连续运行。在未来，随着 AIOps 技术的不