Bug #50503
closedtasks/libcephsqlite throws "std::out_of_range"
0%
Description
rados/basic/{ceph clusters/{fixed-2 openstack} mon_election/classic msgr-failures/many msgr/async-v1only objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{rhel_8} tasks/libcephsqlite}
2021-04-23T19:54:40.360 INFO:tasks.workunit.client.0.smithi092.stderr:+ sqlite3 -cmd '.output /dev/null' -cmd '.load libcephsqlite.so' -cmd 'pragma journal_mode = PERSIST' -cmd '.open file:///cephsqlite:test_libcephsqlite.sh/baz.db?vfs=ceph' -cmd '.output stdout' 2021-04-23T19:55:00.357 INFO:tasks.workunit.client.0.smithi092.stdout:[1]+ 38807 Running sqlite3 -cmd '.output /dev/null' -cmd '.load libcephsqlite.so' -cmd 'pragma journal_mode = PERSIST' -cmd ".open file:///$pool:$ns/baz.db?vfs=ceph" -cmd '.output stdout ' <<< "$a" & 2021-04-23T19:55:00.358 INFO:tasks.workunit.client.0.smithi092.stderr:+ jobs -l 2021-04-23T19:55:00.359 INFO:tasks.workunit.client.0.smithi092.stderr:++ jobs -p 2021-04-23T19:55:00.360 INFO:tasks.workunit.client.0.smithi092.stderr:+ kill -STOP -- 38807 2021-04-23T19:55:00.360 INFO:tasks.workunit.client.0.smithi092.stderr:+ sleep 45 2021-04-23T19:55:45.362 INFO:tasks.workunit.client.0.smithi092.stderr:+ date 2021-04-23T19:55:45.362 INFO:tasks.workunit.client.0.smithi092.stdout:Fri Apr 23 19:55:45 UTC 2021 2021-04-23T19:55:45.363 INFO:tasks.workunit.client.0.smithi092.stderr:++ jobs -p 2021-04-23T19:55:45.363 INFO:tasks.workunit.client.0.smithi092.stderr:+ kill -CONT -- 38807 2021-04-23T19:55:45.363 INFO:tasks.workunit.client.0.smithi092.stderr:+ sleep 10 2021-04-23T19:55:45.373 INFO:tasks.workunit.client.0.smithi092.stderr:Error: near line 3: disk I/O error 2021-04-23T19:55:45.374 INFO:tasks.workunit.client.0.smithi092.stderr:terminate called after throwing an instance of 'std::out_of_range' 2021-04-23T19:55:45.374 INFO:tasks.workunit.client.0.smithi092.stderr: what(): map::at
/a/kchai-2021-04-23_14:12:53-rados-wip-kefu-testing-2021-04-23-2026-distro-basic-smithi/6068437
i reran the failed test in hope to reproduce it, but it didn't surface again: https://pulpito.ceph.com/kchai-2021-04-24_04:07:09-rados-wip-kefu-testing-2021-04-23-2026-distro-basic-smithi/6070020/
Updated by Kefu Chai about 3 years ago
Patrick, could you take a look? not sure how reproducible this issue is though.
Updated by Patrick Donnelly about 3 years ago
So the test "succeeded" in that we expect things errors but not an Abort/core dump:
2021-04-23T19:55:45.373 INFO:tasks.workunit.client.0.smithi092.stderr:Error: near line 3: disk I/O error 2021-04-23T19:55:45.374 INFO:tasks.workunit.client.0.smithi092.stderr:terminate called after throwing an instance of 'std::out_of_range' 2021-04-23T19:55:45.374 INFO:tasks.workunit.client.0.smithi092.stderr: what(): map::at 2021-04-23T19:55:55.365 INFO:tasks.workunit.client.0.smithi092.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_libcephsqlite.sh: line 129: 38807 Aborted (core dumped) sqlite3 -cmd '.output /dev/null' -cmd '.load libcephsqlite.so' -cmd 'pragma journal_mode = PERSIST' -cmd ".open file:///$pool:$ns/baz.db?vfs=ceph" -cmd '.output stdout' <<< "$a"
The client log has:
2021-04-23T19:55:45.370+0000 7f02c77fe700 -1 client.4614: SimpleRADOSStriper: lock_keeper_main: baz.db: lock renewal failed: (2) No such file or directory 2021-04-23T19:55:45.370+0000 7f02edf02740 1 -- 172.21.15.92:0/1739936527 --> v1:172.21.15.92:6812/33373 -- osd_op(unknown.0.0:16177 3.19 3:9d4380ad:test_libcephsqlite.sh::baz.db.000000000000000f:head [write 2797568~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e21) v8 -- 0x55d703bef830 con 0x7f02cc073350 2021-04-23T19:55:45.370+0000 7f02edf02740 10 client.4614: SimpleRADOSStriper: write: baz.db: dirty size: 65716224 2021-04-23T19:55:45.370+0000 7f02edf02740 5 cephsqlite: Write: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] = 4096 2021-04-23T19:55:45.370+0000 7f02edf02740 5 cephsqlite: Write: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 65716224~4096 2021-04-23T19:55:45.370+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: write: baz.db: 65716224~4096 2021-04-23T19:55:45.370+0000 7f02edf02740 5 cephsqlite: Write: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] write failed: (108) Cannot send after transport endpoint shutdown 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 <== mon.1 v1:172.21.15.194:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 2172502797 0 0) 0x7f02bc010080 con 0x55d7034d36e0 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 --> v1:172.21.15.194:6789/0 -- auth(proto 2 148 bytes epoch 0) v1 -- 0x55d7034cd110 con 0x55d7034d36e0 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: DeviceCharacteristics: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Close: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db-journal] 2021-04-23T19:55:45.371+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: ~SimpleRADOSStriper: baz.db-journal: 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Unlock: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 0 2021-04-23T19:55:45.371+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: unlock: baz.db: 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Unlock: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] failed: -108 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 <== mon.1 v1:172.21.15.194:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 249+0+0 (unknown 304898747 0 0) 0x55d7034cd110 con 0x55d7034d36e0 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 >> v1:172.21.15.92:6790/0 conn(0x55d7034d6ff0 legacy=0x7f02bc0053e0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 >> v1:172.21.15.92:6789/0 conn(0x7f02bc007020 legacy=0x7f02bc0094b0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 --> v1:172.21.15.194:6789/0 -- mon_subscribe({config=0+,mgrmap=0+,monmap=2+,osdmap=22}) v3 -- 0x7f02bc012620 con 0x55d7034d36e0 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 <== mon.1 v1:172.21.15.194:6789/0 3 ==== mon_map magic: 0 v1 ==== 308+0+0 (unknown 2178755549 0 0) 0x7f02c8001400 con 0x55d7034d36e0 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: FileControl: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 20, 0x7ffc1ef2a684 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: DeviceCharacteristics: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Unlock: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 0 2021-04-23T19:55:45.371+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: unlock: baz.db: 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Unlock: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] failed: -108 2021-04-23T19:55:45.371+0000 7f02edf02740 5 cephsqlite: Close: (client.4614) [cephsqlite:test_libcephsqlite.sh/baz.db] 2021-04-23T19:55:45.371+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: ~SimpleRADOSStriper: baz.db: 2021-04-23T19:55:45.371+0000 7f02edf02740 5 client.4614: SimpleRADOSStriper: unlock: baz.db: 2021-04-23T19:55:45.371+0000 7f02d5ffb700 1 -- 172.21.15.92:0/1739936527 <== mon.1 v1:172.21.15.194:6789/0 4 ==== config(0 keys) v1 ==== 4+0+0 (unknown 0 0 0) 0x7f02c80057c0 con 0x55d7034d36e0
So the lock keeper noticed it lost the lock and "self blocklists". The weird thing here is libcephsqlite.cc/SimpleRADOSStriper.* do not use std::map so this failure must be in librados somewhere? Unfortunately I suspect sqlite3 installs a signal handler for SIGABRT so we don't get the typical dump in the client log. Hard to tell...
Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to Fix Under Review
- Target version set to v17.0.0
- Source set to Q/A
- Backport set to pacific
- Pull request ID set to 42035
Probably fixed by this PR.
Updated by Patrick Donnelly almost 3 years ago
- Related to Bug #51372: pacific: libcephsqlite: segmentation fault added
Updated by Kefu Chai almost 3 years ago
- Related to deleted (Bug #51372: pacific: libcephsqlite: segmentation fault)
Updated by Kefu Chai almost 3 years ago
- Has duplicate Bug #51372: pacific: libcephsqlite: segmentation fault added
Updated by Patrick Donnelly almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot almost 3 years ago
- Copied to Backport #51431: pacific: tasks/libcephsqlite throws "std::out_of_range" added
Updated by Patrick Donnelly about 1 year ago
- Status changed from Pending Backport to Resolved