Project

General

Profile

Actions

Bug #45609

closed

kclient: kclient node get stuck when the async create/unlink enabled

Added by Xiubo Li almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Crash signature (v1):
Crash signature (v2):

Description

This issue was initially from Zheng Yan, and locally I have reproduced it by running for 5 hours:

Steps:

1, enable async create/unlink ops by specifying the `-o nowsync` when mounting
2, in one bash terminal run:

   # for i in {0..1000}; do pkill -9 ceph-mds; sleep 30; ./bin/ceph-mds -i a -c /data/ceph/build/ceph.conf; sleep 20; done

3, in another bash terminal run:
   # for i in {0..1000}; do for i in {0..10}; do mkdir /mnt/cephfs/dir_$i; touch /mnt/cephfs/dir_$i/$i; touch /mnt/cephfs/dir_$i/$i_1; echo "create /mnt/cephfs/dir_$i/$i"; done; sleep 2; for i in {0..10}; do rm -rf /mnt/cephfs/dir_$i; echo "unlinking /mnt/cephfs/dir_$i"; done; done

<4>[15912.549018] libceph: mds0 (1)10.72.36.245:6827 socket closed (con state OPEN)
<4>[15912.623341] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<4>[15913.327196] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<4>[15914.351224] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<4>[15916.335132] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<4>[15920.366979] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<4>[15928.494697] libceph: mds0 (1)10.72.36.245:6827 socket error on write
<6>[15949.636088] ceph: mds0 reconnect start
<6>[15949.636099] ceph: session 00000000f871e47b state reconnecting
<6>[15949.641863] ceph: mds0 reconnect success
<6>[15952.723364] ceph: mds0 recovery completed
<6>[15981.740583] libceph: mon2 (1)10.72.36.245:40083 session lost, hunting for new mon
<3>[16097.960293] INFO: task kworker/18:1:32111 blocked for more than 122 seconds.
<3>[16097.960860]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16097.961332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16097.961868] kworker/18:1    D    0 32111      2 0x80004080
<6>[16097.962151] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16097.962188] Call Trace:
<4>[16097.962353]  ? __schedule+0x276/0x6e0
<4>[16097.962359]  ? schedule+0x40/0xb0
<4>[16097.962364]  ? schedule_preempt_disabled+0xa/0x10
<4>[16097.962368]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16097.962460]  ? kick_requests+0x21/0x100 [ceph]
<4>[16097.962485]  ? ceph_mdsc_handle_mdsmap+0x19c/0x5f0 [ceph]
<4>[16097.962503]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16097.962523]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16097.962580]  ? dispatch+0x77/0x930 [libceph]
<4>[16097.962602]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16097.962619]  ? __switch_to_asm+0x40/0x70
<4>[16097.962623]  ? __switch_to_asm+0x34/0x70
<4>[16097.962627]  ? __switch_to_asm+0x40/0x70
<4>[16097.962631]  ? __switch_to_asm+0x34/0x70
<4>[16097.962635]  ? __switch_to_asm+0x40/0x70
<4>[16097.962654]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16097.962713]  ? process_one_work+0x1ad/0x370
<4>[16097.962717]  ? worker_thread+0x30/0x390
<4>[16097.962722]  ? create_worker+0x1a0/0x1a0
<4>[16097.962737]  ? kthread+0x112/0x130
<4>[16097.962742]  ? kthread_park+0x80/0x80
<4>[16097.962747]  ? ret_from_fork+0x35/0x40
<3>[16097.962758] INFO: task kworker/25:1:1747 blocked for more than 122 seconds.
<3>[16097.963233]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16097.963792] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16097.964298] kworker/25:1    D    0  1747      2 0x80004080
<6>[16097.964325] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16097.964331] Call Trace:
<4>[16097.964340]  ? __schedule+0x276/0x6e0
<4>[16097.964344]  ? schedule+0x40/0xb0
<4>[16097.964347]  ? schedule_preempt_disabled+0xa/0x10
<4>[16097.964351]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16097.964376]  ? handle_reply+0x33f/0x6f0 [ceph]
<4>[16097.964407]  ? dispatch+0xa6/0xbc0 [ceph]
<4>[16097.964429]  ? read_partial_message+0x214/0x770 [libceph]
<4>[16097.964449]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16097.964454]  ? __switch_to_asm+0x40/0x70
<4>[16097.964458]  ? __switch_to_asm+0x34/0x70
<4>[16097.964461]  ? __switch_to_asm+0x40/0x70
<4>[16097.964465]  ? __switch_to_asm+0x34/0x70
<4>[16097.964470]  ? __switch_to_asm+0x40/0x70
<4>[16097.964489]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16097.964494]  ? process_one_work+0x1ad/0x370
<4>[16097.964498]  ? worker_thread+0x30/0x390
<4>[16097.964501]  ? create_worker+0x1a0/0x1a0
<4>[16097.964506]  ? kthread+0x112/0x130
<4>[16097.964511]  ? kthread_park+0x80/0x80
<4>[16097.964516]  ? ret_from_fork+0x35/0x40
<3>[16097.964542] INFO: task kworker/27:2:25505 blocked for more than 122 seconds.
<3>[16097.965033]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16097.965521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16097.966053] kworker/27:2    D    0 25505      2 0x80004080
<6>[16097.966090] Workqueue: events delayed_work [ceph]
<4>[16097.966093] Call Trace:
<4>[16097.966100]  ? __schedule+0x276/0x6e0
<4>[16097.966104]  ? __switch_to_asm+0x40/0x70
<4>[16097.966107]  ? schedule+0x40/0xb0
<4>[16097.966137]  ? schedule_preempt_disabled+0xa/0x10
<4>[16097.966140]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16097.966144]  ? __switch_to_asm+0x40/0x70
<4>[16097.966148]  ? __switch_to_asm+0x40/0x70
<4>[16097.966152]  ? __switch_to_asm+0x34/0x70
<4>[16097.966176]  ? delayed_work+0x2e/0x290 [ceph]
<4>[16097.966181]  ? process_one_work+0x1ad/0x370
<4>[16097.966186]  ? worker_thread+0x30/0x390
<4>[16097.966189]  ? create_worker+0x1a0/0x1a0
<4>[16097.966194]  ? kthread+0x112/0x130
<4>[16097.966198]  ? kthread_park+0x80/0x80
<4>[16097.966203]  ? ret_from_fork+0x35/0x40
<3>[16220.835757] INFO: task kworker/18:1:32111 blocked for more than 245 seconds.
<3>[16220.836212]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16220.836682] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16220.837213] kworker/18:1    D    0 32111      2 0x80004080
<6>[16220.837253] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16220.837256] Call Trace:
<4>[16220.837270]  ? __schedule+0x276/0x6e0
<4>[16220.837274]  ? schedule+0x40/0xb0
<4>[16220.837277]  ? schedule_preempt_disabled+0xa/0x10
<4>[16220.837281]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16220.837310]  ? kick_requests+0x21/0x100 [ceph]
<4>[16220.837335]  ? ceph_mdsc_handle_mdsmap+0x19c/0x5f0 [ceph]
<4>[16220.837372]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16220.837386]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16220.837409]  ? dispatch+0x77/0x930 [libceph]
<4>[16220.837429]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16220.837437]  ? __switch_to_asm+0x40/0x70
<4>[16220.837441]  ? __switch_to_asm+0x34/0x70
<4>[16220.837444]  ? __switch_to_asm+0x40/0x70
<4>[16220.837448]  ? __switch_to_asm+0x34/0x70
<4>[16220.837452]  ? __switch_to_asm+0x40/0x70
<4>[16220.837472]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16220.837478]  ? process_one_work+0x1ad/0x370
<4>[16220.837482]  ? worker_thread+0x30/0x390
<4>[16220.837486]  ? create_worker+0x1a0/0x1a0
<4>[16220.837491]  ? kthread+0x112/0x130
<4>[16220.837496]  ? kthread_park+0x80/0x80
<4>[16220.837501]  ? ret_from_fork+0x35/0x40
<3>[16220.837516] INFO: task kworker/25:1:1747 blocked for more than 245 seconds.
<3>[16220.838157]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16220.838702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16220.839220] kworker/25:1    D    0  1747      2 0x80004080
<6>[16220.839248] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16220.839251] Call Trace:
<4>[16220.839258]  ? __schedule+0x276/0x6e0
<4>[16220.839262]  ? schedule+0x40/0xb0
<4>[16220.839266]  ? schedule_preempt_disabled+0xa/0x10
<4>[16220.839270]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16220.839296]  ? handle_reply+0x33f/0x6f0 [ceph]
<4>[16220.839318]  ? dispatch+0xa6/0xbc0 [ceph]
<4>[16220.839337]  ? read_partial_message+0x214/0x770 [libceph]
<4>[16220.839357]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16220.839363]  ? __switch_to_asm+0x40/0x70
<4>[16220.839367]  ? __switch_to_asm+0x34/0x70
<4>[16220.839371]  ? __switch_to_asm+0x40/0x70
<4>[16220.839375]  ? __switch_to_asm+0x34/0x70
<4>[16220.839378]  ? __switch_to_asm+0x40/0x70
<4>[16220.839397]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16220.839402]  ? process_one_work+0x1ad/0x370
<4>[16220.839406]  ? worker_thread+0x30/0x390
<4>[16220.839409]  ? create_worker+0x1a0/0x1a0
<4>[16220.839414]  ? kthread+0x112/0x130
<4>[16220.839418]  ? kthread_park+0x80/0x80
<4>[16220.839423]  ? ret_from_fork+0x35/0x40
<3>[16220.839436] INFO: task kworker/27:2:25505 blocked for more than 245 seconds.
<3>[16220.839982]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16220.840488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16220.841036] kworker/27:2    D    0 25505      2 0x80004080
<6>[16220.841066] Workqueue: events delayed_work [ceph]
<4>[16220.841075] Call Trace:
<4>[16220.841081]  ? __schedule+0x276/0x6e0
<4>[16220.841085]  ? __switch_to_asm+0x40/0x70
<4>[16220.841089]  ? schedule+0x40/0xb0
<4>[16220.841104]  ? schedule_preempt_disabled+0xa/0x10
<4>[16220.841114]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16220.841123]  ? __switch_to_asm+0x40/0x70
<4>[16220.841130]  ? __switch_to_asm+0x40/0x70
<4>[16220.841138]  ? __switch_to_asm+0x34/0x70
<4>[16220.841161]  ? delayed_work+0x2e/0x290 [ceph]
<4>[16220.841170]  ? process_one_work+0x1ad/0x370
<4>[16220.841174]  ? worker_thread+0x30/0x390
<4>[16220.841181]  ? create_worker+0x1a0/0x1a0
<4>[16220.841186]  ? kthread+0x112/0x130
<4>[16220.841194]  ? kthread_park+0x80/0x80
<4>[16220.841199]  ? ret_from_fork+0x35/0x40
<3>[16343.711131] INFO: task kworker/18:1:32111 blocked for more than 368 seconds.
<3>[16343.711610]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16343.712057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16343.712586] kworker/18:1    D    0 32111      2 0x80004080
<6>[16343.712642] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16343.712647] Call Trace:
<4>[16343.712713]  ? __schedule+0x276/0x6e0
<4>[16343.712718]  ? schedule+0x40/0xb0
<4>[16343.712721]  ? schedule_preempt_disabled+0xa/0x10
<4>[16343.712725]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16343.712758]  ? kick_requests+0x21/0x100 [ceph]
<4>[16343.712783]  ? ceph_mdsc_handle_mdsmap+0x19c/0x5f0 [ceph]
<4>[16343.712799]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16343.712813]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16343.712835]  ? dispatch+0x77/0x930 [libceph]
<4>[16343.712855]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16343.712865]  ? __switch_to_asm+0x40/0x70
<4>[16343.712869]  ? __switch_to_asm+0x34/0x70
<4>[16343.712873]  ? __switch_to_asm+0x40/0x70
<4>[16343.712876]  ? __switch_to_asm+0x34/0x70
<4>[16343.712880]  ? __switch_to_asm+0x40/0x70
<4>[16343.712899]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16343.712906]  ? process_one_work+0x1ad/0x370
<4>[16343.712909]  ? worker_thread+0x30/0x390
<4>[16343.712918]  ? create_worker+0x1a0/0x1a0
<4>[16343.712949]  ? kthread+0x112/0x130
<4>[16343.712954]  ? kthread_park+0x80/0x80
<4>[16343.712959]  ? ret_from_fork+0x35/0x40
<3>[16343.712970] INFO: task kworker/25:1:1747 blocked for more than 368 seconds.
<3>[16343.713489]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16343.714003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16343.714498] kworker/25:1    D    0  1747      2 0x80004080
<6>[16343.714536] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16343.714538] Call Trace:
<4>[16343.714543]  ? __schedule+0x276/0x6e0
<4>[16343.714547]  ? schedule+0x40/0xb0
<4>[16343.714550]  ? schedule_preempt_disabled+0xa/0x10
<4>[16343.714554]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16343.714580]  ? handle_reply+0x33f/0x6f0 [ceph]
<4>[16343.714604]  ? dispatch+0xa6/0xbc0 [ceph]
<4>[16343.714626]  ? read_partial_message+0x214/0x770 [libceph]
<4>[16343.714646]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16343.714651]  ? __switch_to_asm+0x40/0x70
<4>[16343.714654]  ? __switch_to_asm+0x34/0x70
<4>[16343.714658]  ? __switch_to_asm+0x40/0x70
<4>[16343.714662]  ? __switch_to_asm+0x34/0x70
<4>[16343.714665]  ? __switch_to_asm+0x40/0x70
<4>[16343.714685]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16343.714689]  ? process_one_work+0x1ad/0x370
<4>[16343.714693]  ? worker_thread+0x30/0x390
<4>[16343.714697]  ? create_worker+0x1a0/0x1a0
<4>[16343.714701]  ? kthread+0x112/0x130
<4>[16343.714706]  ? kthread_park+0x80/0x80
<4>[16343.714710]  ? ret_from_fork+0x35/0x40
<3>[16343.714724] INFO: task kworker/27:2:25505 blocked for more than 368 seconds.
<3>[16343.715235]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16343.715724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16343.716274] kworker/27:2    D    0 25505      2 0x80004080
<6>[16343.716304] Workqueue: events delayed_work [ceph]
<4>[16343.716311] Call Trace:
<4>[16343.716319]  ? __schedule+0x276/0x6e0
<4>[16343.716327]  ? __switch_to_asm+0x40/0x70
<4>[16343.716331]  ? schedule+0x40/0xb0
<4>[16343.716338]  ? schedule_preempt_disabled+0xa/0x10
<4>[16343.716342]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16343.716349]  ? __switch_to_asm+0x40/0x70
<4>[16343.716353]  ? __switch_to_asm+0x40/0x70
<4>[16343.716361]  ? __switch_to_asm+0x34/0x70
<4>[16343.716397]  ? delayed_work+0x2e/0x290 [ceph]
<4>[16343.716405]  ? process_one_work+0x1ad/0x370
<4>[16343.716409]  ? worker_thread+0x30/0x390
<4>[16343.716413]  ? create_worker+0x1a0/0x1a0
<4>[16343.716417]  ? kthread+0x112/0x130
<4>[16343.716422]  ? kthread_park+0x80/0x80
<4>[16343.716431]  ? ret_from_fork+0x35/0x40
<3>[16466.586567] INFO: task kworker/18:1:32111 blocked for more than 491 seconds.
<3>[16466.586820]       Tainted: G            E     5.7.0-rc5+ #80
<3>[16466.587063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[16466.587275] kworker/18:1    D    0 32111      2 0x80004080
<6>[16466.587311] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[16466.587318] Call Trace:
<4>[16466.587374]  ? __schedule+0x276/0x6e0
<4>[16466.587376]  ? schedule+0x40/0xb0
<4>[16466.587378]  ? schedule_preempt_disabled+0xa/0x10
<4>[16466.587380]  ? __mutex_lock.isra.8+0x2b5/0x4a0
<4>[16466.587407]  ? kick_requests+0x21/0x100 [ceph]
<4>[16466.587424]  ? ceph_mdsc_handle_mdsmap+0x19c/0x5f0 [ceph]
<4>[16466.587448]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16466.587461]  ? extra_mon_dispatch+0x34/0x40 [ceph]
<4>[16466.587472]  ? dispatch+0x77/0x930 [libceph]
<4>[16466.587489]  ? try_read+0x78b/0x11e0 [libceph]
<4>[16466.587499]  ? __switch_to_asm+0x40/0x70
<4>[16466.587537]  ? __switch_to_asm+0x34/0x70
<4>[16466.587545]  ? __switch_to_asm+0x40/0x70
<4>[16466.587553]  ? __switch_to_asm+0x34/0x70
<4>[16466.587561]  ? __switch_to_asm+0x40/0x70
<4>[16466.587576]  ? ceph_con_workfn+0x130/0x5e0 [libceph]
<4>[16466.587581]  ? process_one_work+0x1ad/0x370
<4>[16466.587583]  ? worker_thread+0x30/0x390
<4>[16466.587585]  ? create_worker+0x1a0/0x1a0
<4>[16466.587589]  ? kthread+0x112/0x130
<4>[16466.587591]  ? kthread_park+0x80/0x80
<4>[16466.587594]  ? ret_from_fork+0x35/0x40
Actions

Also available in: Atom PDF