Project

General

Profile

Actions

Bug #56115

open

Crimson: OSD crashed when random write with 4 OSDs

Added by Jianxin Li almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

INFO  2022-06-17 16:30:17,010 [shard 0] osd - osd.3: now active
INFO  2022-06-17 16:30:17,091 [shard 0] ms - [osd.3(client) v2:10.239.241.50:6813/2554329 >> client.4144 10.239.241.50:0/381747147@48482] execute_ready(): fault at READY on lossy channel, going to CLOSING -- std::system_error (error crimson::net:4, read eof)
INFO  2022-06-17 16:30:17,091 [shard 0] ms - [osd.3(client) v2:10.239.241.50:6813/2554329 >> client.4144 10.239.241.50:0/381747147@48482] closing: reset yes, replace no
WARN  2022-06-17 16:30:17,091 [shard 0] osd - ms_handle_reset
INFO  2022-06-17 16:30:17,236 [shard 0] osd - on_activate_complete: requesting backfill
INFO  2022-06-17 16:30:17,237 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped exit Started/Primary/Active/Activating 0.235579 5 0.006484
INFO  2022-06-17 16:30:17,237 [shard 0] osd - Exiting state: Started/Primary/Active/Activating, entered at 1655454617.0021074, 0.006484804 spent on 5 events
INFO  2022-06-17 16:30:17,237 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped enter Started/Primary/Active/WaitLocalBackfillReserved
INFO  2022-06-17 16:30:17,237 [shard 0] osd - Entering state: Started/Primary/Active/WaitLocalBackfillReserved
INFO  2022-06-17 16:30:17,237 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped+backfill_wait exit Started/Primary/Active/WaitLocalBackfillReserved 0.000051 1 0.000079
INFO  2022-06-17 16:30:17,237 [shard 0] osd - Exiting state: Started/Primary/Active/WaitLocalBackfillReserved, entered at 1655454617.2377176, 7.9777e-05 spent on 1 events
INFO  2022-06-17 16:30:17,237 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped+backfill_wait enter Started/Primary/Active/WaitRemoteBackfillReserved
INFO  2022-06-17 16:30:17,237 [shard 0] osd - Entering state: Started/Primary/Active/WaitRemoteBackfillReserved
INFO  2022-06-17 16:30:17,241 [shard 0] osd - on_activate_complete: requesting backfill
INFO  2022-06-17 16:30:17,242 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped+backfill_wait exit Started/Primary/Active/WaitRemoteBackfillReserved 0.004347 1 0.000030
INFO  2022-06-17 16:30:17,242 [shard 0] osd - Exiting state: Started/Primary/Active/WaitRemoteBackfillReserved, entered at 1655454617.2377834, 3.0454e-05 spent on 1 events
INFO  2022-06-17 16:30:17,242 [shard 0] osd -  pg_epoch 18 pg[1.24( v 15'8229 (15'2100,15'8229] local-lis/les=17/18 n=8229 ec=8/8 lis/c=17/14 les/c/f=18/15/0 sis=17) [2]/[3] backfill=[2] r=0 lpr=17 pi=[14,17)/1 crt=15'8229 mlcod 0'0 active+remapped+backfill_wait enter Started/Primary/Active/Backfilling
INFO  2022-06-17 16:30:17,242 [shard 0] osd - Entering state: Started/Primary/Active/Backfilling
WARN  2022-06-17 16:30:17,242 [shard 0] osd - Enqueuing: bi is old, rescanning of local backfill_info
ERROR 2022-06-17 16:30:17,242 [shard 0] osd - load_metadata: object -1:30f0e826:::osdmap.13:0 present but missing object info
ERROR 2022-06-17 16:30:17,242 [shard 0] none - ../src/crimson/common/errorator.h:837 : In function 'void crimson::errorator<AllowedErrors>::assert_all::operator()(ErrorT&&) [with ErrorT = const crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<84>))>; typename std::enable_if<contains_once_v<typename std::decay<_Functor>::type>, crimson::errorator<AllowedErrors>::Enabler>::type ...<anonymous> = {}; AllowedErrors = {crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<84>))>}]', abort(%s)
abort() called
INFO  2022-06-20 11:03:38,348 [shard 0] osd - osd.2: now active
INFO  2022-06-20 11:03:38,377 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.091261864s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6515 mlcod 0'0 remapped NOTIFY pruub 68.149230957s@ exit Started/Stray 1.029334 5 2022-06-20T11:03:37.348534+0800
INFO  2022-06-20 11:03:38,377 [shard 0] osd - Exiting state: Started/Stray, entered at 1655694217.3480396, 1655694217.348534 spent on 5 events
INFO  2022-06-20 11:03:38,377 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.091261864s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6515 mlcod 0'0 remapped NOTIFY pruub 68.149230957s@ enter Started/ReplicaActive
INFO  2022-06-20 11:03:38,377 [shard 0] osd - Entering state: Started/ReplicaActive
INFO  2022-06-20 11:03:38,377 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.091261864s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6515 mlcod 0'0 remapped NOTIFY pruub 68.149230957s@ enter Started/ReplicaActive/RepNotRecovering
INFO  2022-06-20 11:03:38,377 [shard 0] osd - Entering state: Started/ReplicaActive/RepNotRecovering
WARN  2022-06-20 11:03:38,643 [shard 0] io - oversized request (length 8118272 > 4194304) submitted.
INFO  2022-06-20 11:03:38,661 [shard 0] osd -  pg_epoch 15 pg[1.33( v 12'6488 lc 0'0 (0'0,12'6488] local-lis/les=0/0 n=6488 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.093381882s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6488 mlcod 0'0 remapped NOTIFY pruub 68.151412964s@ exit Started/Stray 1.313261 5 2022-06-20T11:03:37.348970+0800
INFO  2022-06-20 11:03:38,661 [shard 0] osd - Exiting state: Started/Stray, entered at 1655694217.3485901, 1655694217.3489702 spent on 5 events
INFO  2022-06-20 11:03:38,661 [shard 0] osd -  pg_epoch 15 pg[1.33( v 12'6488 lc 0'0 (0'0,12'6488] local-lis/les=0/0 n=6488 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.093381882s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6488 mlcod 0'0 remapped NOTIFY pruub 68.151412964s@ enter Started/ReplicaActive
INFO  2022-06-20 11:03:38,661 [shard 0] osd - Entering state: Started/ReplicaActive
INFO  2022-06-20 11:03:38,661 [shard 0] osd -  pg_epoch 15 pg[1.33( v 12'6488 lc 0'0 (0'0,12'6488] local-lis/les=0/0 n=6488 ec=8/8 lis/c=10/10 les/c/f=11/11/0 sis=14 pruub=8.093381882s) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 crt=12'6488 mlcod 0'0 remapped NOTIFY pruub 68.151412964s@ enter Started/ReplicaActive/RepNotRecovering
INFO  2022-06-20 11:03:38,661 [shard 0] osd - Entering state: Started/ReplicaActive/RepNotRecovering
INFO  2022-06-20 11:03:38,923 [shard 0] ms - [osd.2(client) v2:10.239.241.50:6809/2986987 >> client.4144 10.239.241.50:0/1677218398@50320] execute_ready(): fault at READY on lossy channel, going to CLOSING -- std::system_error (error crimson::net:4, read eof)
INFO  2022-06-20 11:03:38,923 [shard 0] ms - [osd.2(client) v2:10.239.241.50:6809/2986987 >> client.4144 10.239.241.50:0/1677218398@50320] closing: reset yes, replace no
WARN  2022-06-20 11:03:38,923 [shard 0] osd - ms_handle_reset
INFO  2022-06-20 11:03:38,951 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=14/10 les/c/f=15/11/0 sis=14) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 luod=0'0 crt=12'6515 mlcod 0'0 active+remapped exit Started/ReplicaActive/RepNotRecovering 0.574542 3 0.011799
INFO  2022-06-20 11:03:38,952 [shard 0] osd - Exiting state: Started/ReplicaActive/RepNotRecovering, entered at 1655694218.3774402, 0.011799805 spent on 3 events
INFO  2022-06-20 11:03:38,952 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=14/10 les/c/f=15/11/0 sis=14) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 luod=0'0 crt=12'6515 mlcod 0'0 active+remapped enter Started/ReplicaActive/RepWaitRecoveryReserved
INFO  2022-06-20 11:03:38,952 [shard 0] osd - Entering state: Started/ReplicaActive/RepWaitRecoveryReserved
INFO  2022-06-20 11:03:38,952 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=14/10 les/c/f=15/11/0 sis=14) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 luod=0'0 crt=12'6515 mlcod 0'0 active+remapped exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000012 1 0.000048
INFO  2022-06-20 11:03:38,952 [shard 0] osd - Exiting state: Started/ReplicaActive/RepWaitRecoveryReserved, entered at 1655694218.9520128, 4.8621e-05 spent on 1 events
INFO  2022-06-20 11:03:38,952 [shard 0] osd -  pg_epoch 15 pg[1.21( v 12'6515 lc 0'0 (0'0,12'6515] local-lis/les=0/0 n=6515 ec=8/8 lis/c=14/10 les/c/f=15/11/0 sis=14) [2]/[3] r=-1 lpr=14 pi=[8,14)/1 luod=0'0 crt=12'6515 mlcod 0'0 active+remapped enter Started/ReplicaActive/RepRecovering
INFO  2022-06-20 11:03:38,952 [shard 0] osd - Entering state: Started/ReplicaActive/RepRecovering
ERROR 2022-06-20 11:03:39,002 [shard 0] seastore - SeaStore::_do_transaction_step: bad op 39
ERROR 2022-06-20 11:03:39,002 [shard 0] seastore - SeaStore::on_error:  transaction dump:

ERROR 2022-06-20 11:03:39,002 [shard 0] seastore - SeaStore::on_error: {
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "1.21_head",
            "oid": "#1:84d6a16b:::benchmark_data_liangdev2_2987174_object7:head#" 
        },

Crimson SeaStore, run 4 OSD on same device(different partition)
Rados bench random write, 1 OSD crashed after writing 50s.
2 kinds of logs above facing same situation.

No data to display

Actions

Also available in: Atom PDF