征程 6X 常見 kernel panic 問題

發布人：地平線開發者時間：2025-11-14 來源：工程師

加入技術交流群
- 掃碼加入
  和技術大咖面對面交流
  海量資料庫查詢

發布文章

1. 概述

kernel panic 包含了多種內核異常類型，包括但不限于：空指針/異常指針、HungTask、RCU Stall、softlockup、hardlockup、OOM、BUG_ON。

下圖是各種類型 panic 的路徑：

在這里插入圖片描述

2. 通用方法

kpanic 類異常均為 kernel 軟件可感知到的異常， kernel 完成 panic 流程后會由 bl31 完成一次 WarmReset，所以所有 panic 現場我們都是能夠拿到 pstore log 的。
由于是軟件異常，所以 pstore 中都能看到異常調用棧、寄存器等信息，通過這些信息就可以初步分析 70% 的問題。
BUG_ON 類是軟件主動觸發的 panic，所以直接檢查代碼邏輯即可。
對于一些依賴于時序最終產生的 panic 問題，還需要進一步加 log 或 ftrace 進行復現，以跟蹤時序（競爭類）引發的異常。
對于復雜的 panic 問題，還需要開啟 ramdump 功能，抓取 dump 進行分析。

3. 典型問題

3.1. 異常指針訪問

此類問題代表內核中訪問了一個空指針、未映射、沒有權限的地址空間，導致觸發 mem abort。

例：

<4>[ 1486.084782] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13
<4>[ 1486.090809] Hardware name: Horizon AI Technologies, Inc. HOBOT Sigi-P Matrix (DT)
<4>[ 1486.091730] pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
<4>[ 1486.092492] pc : __memcpy+0x48/0x180
<4>[ 1486.092953] lr : kfifo_copy_in+0x68/0x94
<4>[ 1486.093456] sp : ffff8000124cbd40
<4>[ 1486.093876] pmr_save: 000000e0
<4>[ 1486.094265] x29: ffff8000124cbd40 x28: ffff0001950ae200
<4>[ 1486.094942] x27: ffff0001765b8800 x26: 0000000000000000
<4>[ 1486.095619] x25: 0000000000000340 x24: 0000000000000018
<4>[ 1486.096295] x23: 00000000f97f0681 x22: ffff8000124cbe18
<4>[ 1486.096970] x21: ffff0001950ae280 x20: 00000000ffff0001
<4>[ 1486.097646] x19: 00000000f97f0681 x18: 0000000000000000
<4>[ 1486.098321] x17: 0000000000000000 x16: 0000000000000000
<4>[ 1486.098997] x15: 000000000000000a x14: 0000000035693bc0
<4>[ 1486.099672] x13: ffff800011feecd1 x12: ffffffffffffffff
<4>[ 1486.100348] x11: ffffffffffffffff x10: 0000000000000020
<4>[ 1486.101024] x9 : ffff800011feecbc x8 : 0000000005f5e100
<4>[ 1486.101699] x7 : ffff00027ee2f8b8 x6 : 00000001239ae000
<4>[ 1486.102376] x5 : ffff0001950ae280 x4 : 0000000000000008
<4>[ 1486.103052] x3 : 0000000100000025 x2 : 00000000f97f0679
<4>[ 1486.103728] x1 : ffff8000124cbe20 x0 : 00000001239ae000
<4>[ 1486.104405] Call trace:
<4>[ 1486.104717] __memcpy+0x48/0x180
<4>[ 1486.105133] __kfifo_in+0x3c/0x5c
<4>[ 1486.105558] bpu_core_tasklet+0x1b0/0x300 [bpu_cores]
<4>[ 1486.106214] tasklet_action_common.isra.0+0x90/0xd4
<4>[ 1486.106837] tasklet_action+0x28/0x34
<4>[ 1486.107306] _stext+0x1e0/0x220
<4>[ 1486.107709] __irq_exit_rcu+0x80/0xd0
<6>[ 1486.107789] [S1][V0]subdev_balance_lost_next_frame lost_this = 0x1, lost_next = 0x1,
<6>[ 1486.107803] [S2][V0]subdev_balance_lost_next_frame lost_this = 0x1, lost_next = 0x1,
<6>[ 1486.107813] [S3][V0]subdev_balance_lost_next_frame lost_this = 0x1, lost_next = 0x1,
<6>[ 1486.107827] [S0][V0]subdev_balance_lost_next_frame lost_this = 0x0, lost_next = 0x1,
<6>[ 1486.107921] [S4][V0]cim_balance_lost_next_frame lost_this = 0x0, lost_next = 0x1,
<4>[ 1486.108179] irq_exit+0x10/0x20
<4>[ 1486.113403] __handle_domain_irq+0x70/0xa0
<4>[ 1486.113928] gic_handle_irq+0x2d0/0x30c
<4>[ 1486.114419] el1_irq+0xcc/0x180
<4>[ 1486.114822] arch_cpu_idle+0x30/0x38
<4>[ 1486.115278] default_idle_call+0x30/0x3c
<4>[ 1486.115779] do_idle+0x130/0x258
<4>[ 1486.116194] cpu_startup_entry+0x24/0x54
<4>[ 1486.116695] rest_init+0xd4/0xe4
<4>[ 1486.117108] arch_call_rest_init+0x10/0x1c
<4>[ 1486.117631] start_kernel+0x6f0/0x734

這類問題，首先我們可以看到 mem abort 的 CPU 調用棧，所以馬上就能夠定位是哪個函數訪問的異常地址，如果這個函數比較簡單，也就能夠很快定位到是哪個變量的空指針訪問。
如果函數比較復雜，我們可以使用 gdb，addr2line 等工具配合符號表進行匯編分析定位代碼位置，通過偏移確認變量。
確認異常指針的變量和來源后，有可能是下面的原因導致的錯誤：

檢查調用棧代碼，從業務/調用邏輯上看是否存在引入錯誤指針情況。
檢查對應變量相關代碼邏輯，考慮是否可能存在競爭風險。
如果異常地址沒有發現競爭或引入錯誤的可能，考慮是否是被踩踏，參考 memory correcption 節。
檢查設備是否存在隨機 crash 的情況，設備是否有單體問題，如果多設備存在隨機 crash，也有可能是 DDR 軟/硬件配置問題。

3.2. HungTask

khungtaskd 是內核對 D 狀態的進程進行掃描的內核線程，當內核某進程/線程長期處于 D 狀態，hungtask 就會被觸發，在征程 6 的系統中，hungtask 超時時間設置為 120s，且 CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=1，故當 hungtask 檢測到有進程處于 D 狀態超過 120s 后就會直接觸發 panic。

例：

[ 605.147501] INFO: task ipi7_thread:3826 blocked for more than 120 seconds.
[ 605.147532] Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13[ 605.147538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.147543] task:ipi7_thread state:D stack: 0 pid: 3826 ppid: 2 flags:0x00000028
[ 605.147559] Call trace:
[ 605.147562] __switch_to+0xf8/0x160
[ 605.147583] __schedule+0x268/0x800
[ 605.147594] schedule+0x84/0xf8
[ 605.147602] cimdma_swap_buffer+0x244/0x338 [hobot_cim_dma][ 605.147631] kthread+0x160/0x188
[ 605.147640] ret_from_fork+0x10/0x18
[ 605.147649] INFO: task ipi4_thread:3830 blocked for more than 120 seconds.
[ 605.147654] Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13[ 605.147658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.147661] task:ipi4_thread state:D stack: 0 pid: 3830 ppid: 2 flags:0x00000028
[ 605.147669] Call trace:
[ 605.147671] __switch_to+0xf8/0x160
[ 605.147678] __schedule+0x268/0x800
[ 605.147685] schedule+0x84/0xf8
[ 605.147692] cimdma_swap_buffer+0x244/0x338 [hobot_cim_dma][ 605.147710] kthread+0x160/0x188
[ 605.147716] ret_from_fork+0x10/0x18
[ 605.147723] INFO: task ipi6_thread:3831 blocked for more than 120 seconds.
[ 605.147728] Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13[ 605.147732] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.147735] task:ipi6_thread state:D stack: 0 pid: 3831 ppid: 2 flags:0x00000028
[ 605.147743] Call trace:
[ 605.147745] __switch_to+0xf8/0x160
[ 605.147751] __schedule+0x268/0x800
[ 605.147758] schedule+0x84/0xf8
[ 605.147765] cimdma_swap_buffer+0x244/0x338 [hobot_cim_dma][ 605.147783] kthread+0x160/0x188
[ 605.147789] ret_from_fork+0x10/0x18
[ 605.147796] INFO: task ipi5_thread:3832 blocked for more than 120 seconds.
[ 605.147801] Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13[ 605.147805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.147808] task:ipi5_thread state:D stack: 0 pid: 3832 ppid: 2 flags:0x00000028
[ 605.147816] Call trace:
[ 605.147818] __switch_to+0xf8/0x160
[ 605.147824] __schedule+0x268/0x800
[ 605.147831] schedule+0x84/0xf8
[ 605.147838] cimdma_swap_buffer+0x244/0x338 [hobot_cim_dma][ 605.147856] kthread+0x160/0x188
[ 605.147862] ret_from_fork+0x10/0x18

這類問題，khungtaskd 進程會在觸發 crash 前，將長時間 D 狀態的進程棧都輸出出來，所以在 pstore 中能夠快速定位到對應的調用信息。
對于驅動中長時間處于 D 狀態，一般是由于在等待的資源無法獲取，需要分析業務流程，對于時間不可預期的資源，可以使用*_interruable 族、或*_timout 族函數進行優化。
另一類由于死鎖導致的 hungtask，需要代碼分析死鎖的根源，可以根據 log 中輸出的所有 D 狀態進程棧和 running 進程棧進行綜合分析，對于大量 D 狀態進程的復雜死鎖問題，只能抓取 ramdump 去分析了。

3.3. RCU Stall

RCU（Read-Copy-Update），顧名思義就是讀-拷貝修改，它是基于其原理命名的。對于被 RCU 保護的共享數據結構，讀者不需要獲得任何鎖就可以訪問它，但寫者在訪問它時首先拷貝一個副本，然后對副本進行修改，最后使用一個回調（callback）機制在適當的時機把指向原來數據的指針替換為新的被修改的數據。

釋放原來資源的工作由 RCU 軟中斷和 rcu 線程負責，RCU Stall 在 tick 中斷中檢查，當負責釋放線程一直未執行起來（RCU Stall timeout 時間為 30s），就會出現 RCU Stall panic。

例：

<3>[122235.050769] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
<3>[122235.050778] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-5):
<4>[122235.050789] (detected by 4, t=7502 jiffies, g=34171837, q=56419)
<3>[122235.050796] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4325451296-4325451295), jiffies_till_next_fqs=1, root ->qsmask 0x0
<0>[122235.050805] Kernel panic - not syncing:
<4>[122235.050809] RCU Stall
<4>[122235.050813] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P O 6.1.134-rt51-04836-g0b3e5cbe5431 #13
<4>[122235.050822] Hardware name: Horizon AI Technologies, Inc. HOBOT Sigi-P Matrix (DT)
<4>[122235.050828] Call trace:
<4>[122235.050830] dump_backtrace+0x0/0x1cc
<4>[122235.050846] show_stack+0x18/0x24
<4>[122235.050853] dump_stack+0xcc/0x12c
<4>[122235.050863] panic+0xcc/0x344
<4>[122235.050870] panic_on_rcu_stall+0x24/0x28
<4>[122235.050881] rcu_sched_clock_irq+0x830/0x900
<4>[122235.050889] update_process_times+0x60/0x88
<4>[122235.050898] tick_sched_handle.isra.0+0x50/0x68
<4>[122235.050907] tick_sched_timer+0x4c/0x90
<4>[122235.050914] __hrtimer_run_queues+0x108/0x19c
<4>[122235.050922] hrtimer_interrupt+0xb0/0x1b8
<4>[122235.050930] arch_timer_handler_phys+0x2c/0x44
<4>[122235.050940] handle_percpu_devid_irq+0x5c/0x104
<4>[122235.050951] generic_handle_irq+0x24/0x3c
<4>[122235.050958] __handle_domain_irq+0x9c/0xa0
<4>[122235.050965] gic_handle_irq+0x2d0/0x30c
<4>[122235.050975] el1_irq+0xcc/0x180
<4>[122235.050982] arch_cpu_idle+0x30/0x38
<4>[122235.050991] default_idle_call+0x30/0x3c
<4>[122235.050999] do_idle+0x144/0x270
<4>[122235.051008] cpu_startup_entry+0x24/0x54
<4>[122235.051015] secondary_start_kernel+0x1a4/0x1bc
<6>[122236.028018] [S0][V0]pym_subdev_stop
<6>[122236.028038] pym_check_stop_state cnt 10 pym->irq_status = 4
<6>[122236.046863] pym_check_exit_state cnt 9 pym->irq_status = 4
<2>[122236.061751] SMP: stopping secondary CPUs
<0>[122236.062276] Kernel Offset: disabled
<0>[122236.062730] CPU features: 0x0040426,2a00aa38
<0>[122236.063284] Memory Limit: none

所以，當出現 RCU Stall 問題一般是出現了調度問題：

長時間關閉硬/軟中斷、中斷風暴，RCU 軟中斷無法執行。
長時間關閉強占。
高優先級 RT 進程占用 CPU 導致 rcu 線程無法執行。
由于 RT 版本中軟中斷由 cfs:19 優先級的 ksoftirqd 負責，rcu 軟中斷的工作可能被各種高優先級強占導致超時。
從 RCU Stall 的 log 中能看到發生 Rcu Stall 的核，檢查對應 core 上進程，及調度情況定位問題，這種問題最好打開 ftrace，或者至少打開 CONFIG_SCHED_LOGGER 抓取到各核的調度信息才能更快的分析。

解決 RCU Stall 問題的方案：

檢查是否存在高負載的 RT 進程一直被調用。如果存在，請評估是否降低為非 RT 進程，或者降低優先級。
調整 RCU Stall Timeout。修改/sys/module/rcupdate/parameters/rcu_cpu_stall_timeout。
檢查/proc/interrupts 中所有中斷的次數，排除是否是中斷風暴導致。

如下面的 evt_thread 進程每次調度都會占用 1s 時間：

[122233.054801] [6] : swapper/6(0) -> evt_thread0(26886)
[122234.026784] [6] : evt_thread0(26886) -> swapper/6(0)
[122234.028451] [6] : swapper/6(0) -> grep(14873)
[122234.028515] [6] : grep(14873) -> swapper/6(0)
[122234.051471] [6] : swapper/6(0) -> grep(14873)
[122234.051541] [6] : grep(14873) -> swapper/6(0)
[122234.054793] [6] : swapper/6(0) -> evt_thread0(26886)
[122235.030776] [6] : evt_thread0(26886) -> swapper/6(0)

*博客內容為網友個人發布，僅代表博主個人觀點，如有侵權請聯系工作人員刪除。

久久ER99热精品一区二区-久久精品99国产精品日本-久久精品免费一区二区三区-久久综合九色综合欧美狠狠

博客專欄

征程 6X 常見 kernel panic 問題

相關推薦

技術專區