Occur kernel panic when using ray cluster

A kernel panic and server crash occurred on one of the k8s worker nodes. my server was down and saved syslog like below.
Did ‘ray::ImplicitFu’ cause ‘kernel panic’? Why happened this?

Sep 30 07:34:33 dgx kernel: [247957.120605] INFO: task ray::ImplicitFu:3887929 blocked for more than 120 seconds.
Sep 30 07:34:33 dgx kernel: [247957.129324]       Tainted: P        W  OE     5.4.0-80-generic #90-Ubuntu
Sep 30 07:34:33 dgx kernel: [247957.137236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 30 07:34:33 dgx kernel: [247957.146338] ray::ImplicitFu D    0 3887929 2673759 0x00004000
Sep 30 07:34:33 dgx kernel: [247957.153045] Call Trace:
Sep 30 07:34:33 dgx kernel: [247957.156315]  __schedule+0x2e3/0x740
Sep 30 07:34:33 dgx kernel: [247957.160424]  ? try_to_wake_up+0x224/0x6a0
Sep 30 07:34:33 dgx kernel: [247957.165141]  schedule+0x42/0xb0
Sep 30 07:34:33 dgx kernel: [247957.168852]  schedule_preempt_disabled+0xe/0x10
Sep 30 07:34:33 dgx kernel: [247957.174161]  __mutex_lock.isra.0+0x178/0x4d0
Sep 30 07:34:33 dgx kernel: [247957.179344]  ? _nv028582rm+0x178/0x240 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.184653]  __mutex_lock_slowpath+0x13/0x20
Sep 30 07:34:33 dgx kernel: [247957.189666]  mutex_lock+0x2e/0x40
Sep 30 07:34:33 dgx kernel: [247957.193692]  uvm_gpu_release+0x1a/0x40 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.206569]  uvm_va_space_register_gpu_va_space+0x3b4/0x630 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.221812]  uvm_api_register_gpu_va_space+0x3d/0x60 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.236273]  uvm_ioctl+0xa00/0x12c0 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.249216]  ? os_release_spinlock+0x1a/0x20 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.262737]  ? _nv037035rm+0xa1/0x190 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.275474]  ? _nv033630rm+0x67/0x100 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.287919]  ? os_acquire_spinlock+0x12/0x20 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.300918]  ? os_release_spinlock+0x1a/0x20 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.313877]  ? _nv037035rm+0xa1/0x190 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.325859]  ? rm_ioctl+0x63/0xb0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.337137]  ? thread_context_non_interrupt_add+0x109/0x1d0 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.351317]  uvm_unlocked_ioctl+0x36/0x60 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.363603]  uvm_unlocked_ioctl_entry+0x8d/0xb0 [nvidia_uvm]
Sep 30 07:34:33 dgx kernel: [247957.376307]  do_vfs_ioctl+0x407/0x670
Sep 30 07:34:33 dgx kernel: [247957.386606]  ksys_ioctl+0x67/0x90
Sep 30 07:34:33 dgx kernel: [247957.396318]  __x64_sys_ioctl+0x1a/0x20
Sep 30 07:34:33 dgx kernel: [247957.406331]  do_syscall_64+0x57/0x190
Sep 30 07:34:33 dgx kernel: [247957.416153]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 30 07:34:33 dgx kernel: [247957.427381] RIP: 0033:0x7fcdd7330317
Sep 30 07:34:33 dgx kernel: [247957.436728] Code: Bad RIP value.
Sep 30 07:34:33 dgx kernel: [247957.445483] RSP: 002b:00007fc7b8cddb68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 30 07:34:33 dgx kernel: [247957.459170] RAX: ffffffffffffffda RBX: 00007fc7b8cddbb0 RCX: 00007fcdd7330317
Sep 30 07:34:33 dgx kernel: [247957.472278] RDX: 00007fc7b8cddbb0 RSI: 0000000000000019 RDI: 000000000000003d
Sep 30 07:34:33 dgx kernel: [247957.485311] RBP: 00007fc7b8cddbb0 R08: 0000561fb8057270 R09: 00007fc7b8cddb0c
Sep 30 07:34:33 dgx kernel: [247957.498247] R10: 00007fc7b8cddaf0 R11: 0000000000000246 R12: 0000000000000019
Sep 30 07:34:33 dgx kernel: [247957.511081] R13: 000000000000003d R14: 0000000000000000 R15: 0000561fb8f89030
Sep 30 07:34:33 dgx kernel: [247957.523866] NMI backtrace for cpu 7
Sep 30 07:34:33 dgx kernel: [247957.532642] CPU: 7 PID: 1814 Comm: khungtaskd Kdump: loaded Tainted: P        W  OE     5.4.0-80-generic #90-Ubuntu
Sep 30 07:34:33 dgx kernel: [247957.554084] Hardware name: NVIDIA DGXA100
Sep 30 07:34:33 dgx kernel: [247957.568574] Call Trace:
Sep 30 07:34:33 dgx kernel: [247957.576514]  dump_stack+0x6d/0x8b
Sep 30 07:34:33 dgx kernel: [247957.585413]  ? lapic_can_unplug_cpu+0x80/0x80
Sep 30 07:34:33 dgx kernel: [247957.595508]  nmi_cpu_backtrace.cold+0x14/0x53
Sep 30 07:34:33 dgx kernel: [247957.605574]  nmi_trigger_cpumask_backtrace+0xe8/0xf0
Sep 30 07:34:33 dgx kernel: [247957.616352]  arch_trigger_cpumask_backtrace+0x19/0x20
Sep 30 07:34:33 dgx kernel: [247957.627250]  watchdog+0x32e/0x390
Sep 30 07:34:33 dgx kernel: [247957.636107]  kthread+0x104/0x140
Sep 30 07:34:33 dgx kernel: [247957.644772]  ? hungtask_pm_notify+0x40/0x40
Sep 30 07:34:33 dgx kernel: [247957.654545]  ? kthread_park+0x90/0x90
Sep 30 07:34:33 dgx kernel: [247957.663672]  ret_from_fork+0x22/0x40
Sep 30 07:34:33 dgx kernel: [247957.672669] Sending NMI from CPU 7 to CPUs 0-6,8-255:
Sep 30 07:34:33 dgx kernel: [247957.683421] NMI backtrace for cpu 135 skipped: idling at native_safe_halt+0xe/0x10
...
Sep 30 07:34:33 dgx kernel: [247957.683598] NMI backtrace for cpu 158
Sep 30 07:34:33 dgx kernel: [247957.683599] CPU: 158 PID: 3784984 Comm: ray::run() Kdump: loaded Tainted: P        W  OE     5.4.0-80-generic #90-Ubuntu
Sep 30 07:34:33 dgx kernel: [247957.683600] Hardware name: NVIDIA DGXA100 
Sep 30 07:34:33 dgx kernel: [247957.683600] RIP: 0010:do_task_stat+0x848/0xd60
Sep 30 07:34:33 dgx kernel: [247957.683601] Code: c6 de d5 dd b8 4c 89 e7 e8 d5 0e f9 ff 49 8b 97 38 01 00 00 48 c7 c6 de d5 dd b8 4c 89 e7 e8 bf 0e f9 ff 49 8b 97 40 01 00 00 <48> c7 c6 de d5 dd b8 4c 89 e7 e8 a9 0e f9 ff 49 8b 97 48 01 00 00
Sep 30 07:34:33 dgx kernel: [247957.683601] RSP: 0018:ffffa92f3437bcc8 EFLAGS: 00000282
Sep 30 07:34:33 dgx kernel: [247957.683602] RAX: 000000000000000f RBX: ffff9427ce4b0000 RCX: 00000000000036b0
Sep 30 07:34:33 dgx kernel: [247957.683603] RDX: 00007ffefa4a9051 RSI: 0000000000000033 RDI: 000000000000000e
Sep 30 07:34:33 dgx kernel: [247957.683603] RBP: ffffa92f3437bdd0 R08: ffffffffb8ddd5de R09: 8080808080808080
Sep 30 07:34:33 dgx kernel: [247957.683603] R10: ffff94053ea2800f R11: 0000000000000ff9 R12: ffff94085dcb7880
Sep 30 07:34:33 dgx kernel: [247957.683604] R13: 0000000000000001 R14: 000000000000009e R15: ffff9421d3f93300
Sep 30 07:34:33 dgx kernel: [247957.683604] FS:  00007f4d0eede700(0000) GS:ffff94094fb80000(0000) knlGS:0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 30 07:34:33 dgx kernel: [247957.683605] CR2: 00007f4d4ff59150 CR3: 000000198c29a000 CR4: 0000000000340ee0
Sep 30 07:34:33 dgx kernel: [247957.683605] Call Trace:
Sep 30 07:34:33 dgx kernel: [247957.683605]  proc_tgid_stat+0x14/0x20
Sep 30 07:34:33 dgx kernel: [247957.683605]  proc_single_show+0x53/0x90
Sep 30 07:34:33 dgx kernel: [247957.683605]  seq_read+0xdc/0x490
Sep 30 07:34:33 dgx kernel: [247957.683606]  __vfs_read+0x1b/0x40
Sep 30 07:34:33 dgx kernel: [247957.683606]  vfs_read+0xab/0x160
Sep 30 07:34:33 dgx kernel: [247957.683606]  ksys_read+0x67/0xe0
Sep 30 07:34:33 dgx kernel: [247957.683606]  __x64_sys_read+0x1a/0x20
Sep 30 07:34:33 dgx kernel: [247957.683606]  do_syscall_64+0x57/0x190
Sep 30 07:34:33 dgx kernel: [247957.683607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 30 07:34:33 dgx kernel: [247957.683607] RIP: 0033:0x7f584afd6474
Sep 30 07:34:33 dgx kernel: [247957.683608] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
Sep 30 07:34:33 dgx kernel: [247957.683608] RSP: 002b:00007f4d0eedcb60 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683609] RAX: ffffffffffffffda RBX: 00000000000000ed RCX: 00007f584afd6474
Sep 30 07:34:33 dgx kernel: [247957.683609] RDX: 0000000000002000 RSI: 00007f580c03f5e0 RDI: 00000000000000ed
Sep 30 07:34:33 dgx kernel: [247957.683609] RBP: 00007f580c03f5e0 R08: 0000000000000000 R09: 0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683610] R10: 00007f580c00ff50 R11: 0000000000000246 R12: 0000000000002000
Sep 30 07:34:33 dgx kernel: [247957.683610] R13: 00000000000000ed R14: 00007f580c03f5e0 R15: 00007f58352dc1a0
Sep 30 07:34:33 dgx kernel: [247957.683611] NMI backtrace for cpu 21
Sep 30 07:34:33 dgx kernel: [247957.683613] CPU: 21 PID: 3551814 Comm: ray::run() Kdump: loaded Tainted: P        W  OE     5.4.0-80-generic #90-Ubuntu
Sep 30 07:34:33 dgx kernel: [247957.683613] Hardware name: NVIDIA DGXA100 
Sep 30 07:34:33 dgx kernel: [247957.683614] RIP: 0010:strlen+0x3/0x20
Sep 30 07:34:33 dgx kernel: [247957.683615] Code: 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 80 3f 00 <74> 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 c0 c3 0f
Sep 30 07:34:33 dgx kernel: [247957.683615] RSP: 0018:ffffa92f1a777c98 EFLAGS: 00000202
Sep 30 07:34:33 dgx kernel: [247957.683616] RAX: 0000000000000005 RBX: ffff94064aa03380 RCX: 00000000000000c8
Sep 30 07:34:33 dgx kernel: [247957.683617] RDX: 00000000ffffffff RSI: ffffffffb8d789f0 RDI: ffffffffb8d789f0
Sep 30 07:34:33 dgx kernel: [247957.683617] RBP: ffffa92f1a777cb8 R08: ffffffffb8da931a R09: 0000000001001000
Sep 30 07:34:33 dgx kernel: [247957.683618] R10: 0000000000000008 R11: 0000000000000000 R12: ffff94064aa03380
Sep 30 07:34:33 dgx kernel: [247957.683618] R13: ffffffffb8d789f0 R14: 000000000000009e R15: ffff93e2cc3e6600
Sep 30 07:34:33 dgx kernel: [247957.683619] FS:  00007fa8894cf700(0000) GS:ffff94094f540000(0000) knlGS:0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 30 07:34:33 dgx kernel: [247957.683620] CR2: 00007fdc84023160 CR3: 0000001c17cda000 CR4: 0000000000340ee0
Sep 30 07:34:33 dgx kernel: [247957.683620] Call Trace:
Sep 30 07:34:33 dgx kernel: [247957.683620]  ? seq_puts+0x1c/0x60
Sep 30 07:34:33 dgx kernel: [247957.683621]  do_task_stat+0x3d1/0xd60
Sep 30 07:34:33 dgx kernel: [247957.683621]  proc_tgid_stat+0x14/0x20
Sep 30 07:34:33 dgx kernel: [247957.683622]  proc_single_show+0x53/0x90
Sep 30 07:34:33 dgx kernel: [247957.683622]  seq_read+0xdc/0x490
Sep 30 07:34:33 dgx kernel: [247957.683622]  __vfs_read+0x1b/0x40
Sep 30 07:34:33 dgx kernel: [247957.683623]  vfs_read+0xab/0x160
Sep 30 07:34:33 dgx kernel: [247957.683623]  ksys_read+0x67/0xe0
Sep 30 07:34:33 dgx kernel: [247957.683623]  __x64_sys_read+0x1a/0x20
Sep 30 07:34:33 dgx kernel: [247957.683624]  do_syscall_64+0x57/0x190
Sep 30 07:34:33 dgx kernel: [247957.683624]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 30 07:34:33 dgx kernel: [247957.683624] RIP: 0033:0x7fb3fc8ec474
Sep 30 07:34:33 dgx kernel: [247957.683625] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
Sep 30 07:34:33 dgx kernel: [247957.683626] RSP: 002b:00007fa8894cdb60 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683626] RAX: ffffffffffffffda RBX: 0000000000000113 RCX: 00007fb3fc8ec474
Sep 30 07:34:33 dgx kernel: [247957.683627] RDX: 0000000000002000 RSI: 00007fb3bc030910 RDI: 0000000000000113
Sep 30 07:34:33 dgx kernel: [247957.683627] RBP: 00007fb3bc030910 R08: 0000000000000000 R09: 0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.683627] R10: 00007fb3bc000b40 R11: 0000000000000246 R12: 0000000000002000
Sep 30 07:34:33 dgx kernel: [247957.683628] R13: 0000000000000113 R14: 00007fb3bc030910 R15: 00007fb3e52dbfc0
Sep 30 07:34:33 dgx kernel: [247957.683631] NMI backtrace for cpu 190 skipped: idling at native_safe_halt+0xe/0x10
Sep 30 07:34:33 dgx kernel: [247957.683632] NMI backtrace for cpu 62 skipped: idling at native_safe_halt+0xe/0x10
...
<repeat Kdump>
...
Sep 30 07:34:33 dgx kernel: [247957.687900] CPU: 249 PID: 3620328 Comm: ray::run() Kdump: loaded Tainted: P        W  OE     5.4.0-80-generic #90-Ubuntu
Sep 30 07:34:33 dgx kernel: [247957.687901] Hardware name: NVIDIA DGXA100 
Sep 30 07:34:33 dgx kernel: [247957.687901] RIP: 0010:_nv030350rm+0x12/0x40 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687903] Code: d2 0e 31 c0 e8 bf 0e 8b ff e8 ea 1b f9 ff 31 c0 48 83 c4 08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04 88 <48> 83 c4 08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 11 0a bf 0a ad
Sep 30 07:34:33 dgx kernel: [247957.687903] RSP: 0018:ffffa92eba67f9d8 EFLAGS: 00000212
Sep 30 07:34:33 dgx kernel: [247957.687904] RAX: 0000000000010072 RBX: ffff9469038cc008 RCX: 0000000000042807
Sep 30 07:34:33 dgx kernel: [247957.687905] RDX: ffff9469038cd2c0 RSI: ffff9469038cc008 RDI: ffff94694bec3008
Sep 30 07:34:33 dgx kernel: [247957.687905] RBP: ffff942105265740 R08: 0000000000000020 R09: ffff942105265758
Sep 30 07:34:33 dgx kernel: [247957.687905] R10: ffffffffc1867710 R11: ffff94694bec3008 R12: 000000000010a01c
Sep 30 07:34:33 dgx kernel: [247957.687906] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9469038cd2c0
Sep 30 07:34:33 dgx kernel: [247957.687906] FS:  00007fee33fff700(0000) GS:ffff94e8ffa40000(0000) knlGS:0000000000000000
Sep 30 07:34:33 dgx kernel: [247957.687907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 30 07:34:33 dgx kernel: [247957.687907] CR2: 00007f3c380220b0 CR3: 00000019165d0000 CR4: 0000000000340ee0
Sep 30 07:34:33 dgx kernel: [247957.687908] Call Trace:
Sep 30 07:34:33 dgx kernel: [247957.687908]  ? _nv019751rm+0x116/0x190 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687908]  ? _nv017707rm+0x50/0x60 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687908]  ? _nv032646rm+0x58/0x2e0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687909]  ? _nv032613rm+0xb5/0x1b0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687909]  ? _nv032369rm+0x6f/0xe0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687909]  ? _nv032633rm+0x2ca/0x340 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687909]  ? _nv032633rm+0x291/0x340 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687909]  ? _nv006023rm+0xbc/0x150 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687910]  ? _nv006023rm+0x81/0x150 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687910]  ? _nv009616rm+0xd6/0x1b0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687910]  ? _nv032193rm+0x18c/0x3b0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687910]  ? _nv033513rm+0x1ff/0x210 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687911]  ? _nv035011rm+0x1f4/0x260 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687911]  ? _nv008140rm+0x329/0x3f0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687911]  ? _nv033629rm+0x4d/0x90 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687911]  ? _nv033630rm+0x57/0x100 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687911]  ? _nv033630rm+0x37/0x100 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687912]  ? _nv007227rm+0x55/0xa0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687912]  ? _nv007227rm+0x34/0xa0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687912]  ? _nv000747rm+0x4ff/0x940 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687912]  ? rm_ioctl+0x54/0xb0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687912]  ? __check_object_size+0x51/0x150
Sep 30 07:34:33 dgx kernel: [247957.687913]  ? nvidia_ioctl+0x5b1/0x8a0 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687913]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
Sep 30 07:34:33 dgx kernel: [247957.687913]  ? do_vfs_ioctl+0x407/0x670
Sep 30 07:34:33 dgx kernel: [247957.687913]  ? ksys_ioctl+0x67/0x90
Sep 30 07:34:33 dgx kernel: [247957.687914]  ? __x64_sys_ioctl+0x1a/0x20
Sep 30 07:34:33 dgx kernel: [247957.687914]  ? do_syscall_64+0x57/0x190
Sep 30 07:34:33 dgx kernel: [247957.687914]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 30 07:34:33 dgx kernel: [247957.687914] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 3.718 msecs
Sep 30 07:34:33 dgx kernel: [247957.688996] Kernel panic - not syncing: hung_task: blocked tasks

server: nvidia dgxa100
os: ubuntu
env: k8s 1.20
ray: rayproject/ray-ml:1.6.0-py38-gpu (install using helm)