Project

General

Profile

Bug #41618

14.2.1->14.2.2 ceph-mgr hard segfault. devicehealth?

Added by Harry Coin 7 months ago. Updated 6 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
% Done:

0%

Source:
Tags:
ceph-mgr segfault
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Hard segfault on ceph-mgr at load.
"Thread 33 "devicehealth" received signal SIGSEGV, Segmentation fault."
Suspect possibly related to either:
one pg_num change from 4 to 16 on a metadata volume.
or
turning on disk health checking.

either way, in a kvm VM running linux 5.2.0-15-generic #16-Ubuntu SMP Fri Aug 23 20:16:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux, vm set to conroe/core2 dual processor, latest ubuntu/eoan, ceph 14.2.2
See attached gdb log, with debug symbols and backtraces.

100% reproducible. Kills ceph-mgr.

ceph-mgr-crashlog.txt View (114 KB) Harry Coin, 09/03/2019 03:58 PM

ceph-mgr-crashlog.txt View (237 KB) Harry Coin, 09/03/2019 05:36 PM

log.txt View (67 KB) Harry Coin, 09/04/2019 12:20 AM


Related issues

Duplicates RADOS - Bug #42082: pybind/rados: set_omap() crash on py3 Resolved 09/27/2019

History

#1 Updated by Harry Coin 7 months ago

Here attached is another run showing the same failure, but this time with src available and all the python3.7 debug/dev symbols as well.

#2 Updated by Harry Coin 7 months ago

The only other difference of note that happened just before the manager failure was discovered was:

Prior, there were a few osds that didn't have the ceph-bluestore-repair done in order to enable the new finer grain correct df. The repair process had just been completed on all the osds. The warning that some of the osds needed updating was no longer showing.

Here are the most immediate changes:

1) exploring device health monitoring.
2) Just prior to that was changing a pg group number from 4 to 16 I think maybe 64 on a metadata volume.
3) Just prior to that was running the bluestore repair on the last of the osds to get the more accurate df.

#3 Updated by Harry Coin 7 months ago

WORKAROUND:
The command
ceph config-key set config/mgr/mgr/devicehealth/enable_monitoring false
allowed normal operation of ceph-mgr but without the device health support.

#4 Updated by Harry Coin 7 months ago

Would someone that knows how change the severity of this to 'normal' now that there is a workaround?

#5 Updated by Brad Hubbard 7 months ago

  • Severity changed from 1 - critical to 3 - minor

Done. Would you be able to upload a coredump please (use ceph-post-file if it's large)? Assuming you are using binaries from download.ceph.com. Forgive my ignorance but which Ubuntu release is this? Xenial?

#6 Updated by Harry Coin 7 months ago

Canonical calls it 'eoan', it's their next point release. Presently somewhere between what Debian would call 'stable' and 'testing'. I compiled it from their source 14.2.2_ubuntu2. They since went up a rev disabling spdk to relieve the core-i7 dependency and fixing the py3 ceph-crash bug ('r'->'rb' in open(...)) I'll revert to the release I posted all the above in and change the config to generate a core dump.

#7 Updated by Harry Coin 7 months ago

See the zipped core.gz via gunzip core.gz at ceph-post-file: 7b0b4def-46e0-43a2-b819-0aafffd57b25

It was generated by the commands below. I started this instance, then I stopped the ceph-mgr service on the other nodes until crashy one woke up...which then crashed generating the 843M core.

Shell log of the run that generated the core attached.

#8 Updated by Brad Hubbard 7 months ago

Hmmm... the problem here is that I need a binary compatible environment in which to debug the core dump (IOW I need all of the same binaries that were in use when the core dump was produced plus the relevant debuginfo packages). Is this OS generally available as a container image?

#9 Updated by Brad Hubbard 7 months ago

My immediate suggestion would be to add debugging to the put_device_metrics function in src/pybind/mgr/devicehealth/module.py to see how sane omap calls are.

I'd also be inclined to reproduce this and capture a log with '--debug_mgr 20' as that should include logging from the module itself.

#10 Updated by Brad Hubbard 7 months ago

OK, I guess this is just Ubuntu 19. The use of "Canonical" threw me :P

#11 Updated by Brad Hubbard 7 months ago

From the core file.

$ eu-unstrip -n --core core |head -1
0x564f55435000+0x3f7000 6da3a059f4257d5fb849cbe8abfc251da7df4b21@0x564f554353a0 - - /usr/bin/ceph-mgr

ceph-mgr from ceph-mgr_14.2.2-0ubuntu2_amd64.deb

# eu-unstrip -n -e ./usr/bin/ceph-mgr
0+0x3f6698 c14b0e797e7887cbabe1b5b6979004fb423559c4@0x3a0 ./usr/bin/ceph-mgr -

ceph-mgr from ceph-mgr_14.2.2-0ubuntu3_amd64.deb

# eu-unstrip -n -e ./usr/bin/ceph-mgr
0+0x3f6698 47007a165b5fb8d9926325c36dd1302d722cf0ca@0x3a0 ./usr/bin/ceph-mgr -

I need the exact same binaries to be able to debug this.

#12 Updated by Harry Coin 7 months ago

Here's the ceph-mgr that generate it.
ceph-post-file: 28fc3ac1-9918-40eb-93fb-2fe88cba6bb6

#13 Updated by Harry Coin 7 months ago

You can build it easily enough yourself by taking an install of the latest ubuntu 'eoan' then:

apt-get build-dep ceph
apt-get source ceph
cd ceph
dpkg-buildpackage -b --no-sign
cd ..
dpkg -i *.deb *.ddeb

#14 Updated by Brad Hubbard 7 months ago

gdb, etc. use the buildid of the binary/core to look for the relevant debuginfo information. The buildid is unique to a specific build. I'll see what I can get using the ceph-mgr binary you uploaded (is it stripped?) but that's really only a small part of the picture. Do you only see this crash with the version you built?

#15 Updated by Harry Coin 7 months ago

I just installed the latest 14.2.2 binary from ubuntu, it disables spdk to remove the corei7 dependency and fixes the ceph_crash crash bugs. It behaves exactly like the above. It works in all other respects. When run
ceph config-key set config/mgr/mgr/devicehealth/enable_monitoring false
normal operations recur.

So, the answer is no, it's not unique to the binaries I built from the downloaded sources.

#16 Updated by Brad Hubbard 7 months ago

So does that mean you can provide a core dump produced by a binary that the packages and debuginfo are easily available for? That would make the debugging process less painful although it's looking increasingly likely that the problem is in the devicehealth python module which is not my area. I'm just trying to confirm the rados function was passed bad arguments by the python code.

#17 Updated by Harry Coin 7 months ago

The core file here:
ceph-post-file: 323c64e2-fee3-4932-bd00-eee0af46897f
was generated by the version of ceph now available as a binary from ubuntu v eoan.
Specifically: ceph-mgr/eoan,now 14.2.2-0ubuntu3 amd64

#18 Updated by Brad Hubbard 7 months ago

Thanks, that makes things much easier.

Core was generated by `/usr/bin/ceph-mgr -d --cluster ceph --id nocsupport4 --setuser ceph --setgroup'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fe61b7f0700 (LWP 2571))]
(gdb) bt 7
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000055ddf8a8dd1e in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:326
#2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007fe62851f368 in _rados_write_op_omap_set (write_op=0x55ddff0d1020, keys=keys@entry=0x55ddfb443678, vals=vals@entry=0x0, lens=lens@entry=0x55ddfb443620, num=num@entry=1) at ./src/librados/librados_c.cc:3308
#5  0x00007fe6286d6a85 in __pyx_pf_5rados_5Ioctx_100set_omap (__pyx_v_self=<optimized out>, __pyx_v_values=<optimized out>, __pyx_v_keys=<optimized out>, __pyx_v_write_op=<optimized out>) at ./obj-x86_64-linux-gnu/src/pybind/rados3/pyrex/rados.c:45051
#6  __pyx_pw_5rados_5Ioctx_101set_omap (__pyx_v_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at ./obj-x86_64-linux-gnu/src/pybind/rados3/pyrex/rados.c:44785
(More stack frames follow...)
(gdb) f 4
#4  0x00007fe62851f368 in _rados_write_op_omap_set (write_op=0x55ddff0d1020, keys=keys@entry=0x55ddfb443678, vals=vals@entry=0x0, lens=lens@entry=0x55ddfb443620, num=num@entry=1) at ./src/librados/librados_c.cc:3308
3308        bl.append(vals[i], lens[i]);
(gdb) l
3303      tracepoint(librados, rados_write_op_omap_set_enter, write_op, num);
3304      std::map<std::string, bufferlist> entries;
3305      for (size_t i = 0; i < num; ++i) {
3306        tracepoint(librados, rados_write_op_omap_set_entry, keys[i], vals[i], lens[i]);
3307        bufferlist bl(lens[i]);
3308        bl.append(vals[i], lens[i]);
3309        entries[keys[i]] = bl;
3310      }
3311      ((::ObjectOperation *)write_op)->omap_set(entries);
3312      tracepoint(librados, rados_write_op_omap_set_exit);

So we are indexing into 'vals' and 'lens' so we can append to the buffer. That makes sense given the instruction we faulted on.

(gdb) x/i $pc
=> 0x7fe62851f368 <_rados_write_op_omap_set(rados_write_op_t, char const* const*, char const* const*, size_t const*, size_t)+312>:      mov    (%rax,%r14,8),%rsi
(gdb) i r rax r14
rax            0x0                 0
r14            0x0                 0

The assembly is saying start at the address in %r14 and index %rax times in increments of 8 (sizeof(char*)).

(gdb) p i
$2 = 0
(gdb) p lens[i]
$3 = 13997
(gdb) p vals[i]
Cannot access memory at address 0x0

So 'vals', which was passed in to the function, is invalid which we can also see from frame 5.

(gdb) 
#5  0x00007fe6286d6a85 in __pyx_pf_5rados_5Ioctx_100set_omap (__pyx_v_self=<optimized out>, __pyx_v_values=<optimized out>, __pyx_v_keys=<optimized out>, __pyx_v_write_op=<optimized out>) at ./obj-x86_64-linux-gnu/src/pybind/rados3/pyrex/rados.c:45051
45051             rados_write_op_omap_set(__pyx_v__write_op->write_op, ((char const **)__pyx_v__keys), ((char const **)__pyx_v__values), ((size_t const *)__pyx_v__lens), __pyx_v_key_num);
(gdb) p __pyx_v__keys
$1 = (char **) 0x55ddfb443678
(gdb) p *__pyx_v__keys
$2 = 0x7fe6187927d0 "20190905-021608" 
(gdb) p *__pyx_v__values
Cannot access memory at address 0x0

If we look at the code for the 'devicehealth' module.

src/pybind/mgr/devicehealth/module.py:
385     def put_device_metrics(self, ioctx, devid, data):
...
410         key = datetime.utcnow().strftime(TIME_FORMAT)                                                                                                                                                                                    
411         self.log.debug('put_device_metrics device %s key %s = %s, erase %s' %                                                                                                                                                            
412                        (devid, key, data, erase))                                                                                                                                                                                        
413         with rados.WriteOpCtx() as op:                                                                                                                                                                                                   
414             ioctx.set_omap(op, (key,), (str(json.dumps(data)),))

Seems like line 414 is the only candidate. This is why I made the suggestions in comment 9 and they remain the logical next steps IMHO. As I'd like to get an understanding of what str(json.dumps(data)) evaluates to. In the meantime I'll see if I can reproduce the behaviour with some minimal test code.

#19 Updated by Harry Coin 7 months ago

I re-ran it with full debug logging. I added debug prints to the above, but the code never got that far. The debug log and new core file is here:
ceph-post-file: 399b946f-6b48-4dc5-b2fc-abc7341e09ad

#20 Updated by Harry Coin 7 months ago

I'm suspicious of this bit:

2019-09-05 09:04:54.512 7fd79b796700 5 --2- [fc00:1002:c7::b4]:0/3036297886 >> [v2:[fc00:1002:c7::42]:6808/4984,v1:[fc00:1002:c7::42]:6809/4984] conn(0x5627bfe47a80 0x5627bf852100 crc :-1 s=READ_MESSAGE_COMPLETE pgs=280 cs=0 l=1 rx=0 tx=0).handle_message received message m=0x5627bfb85440 seq=1 from=osd.9 type=43 osd_op_reply(1 ST1000DM003-1ER162_Z4YA57ZB [omap-get-keys] v0'0 uv0 ondisk = 2 ((2) No such file or directory)) v8
2019-09-05 09:04:54.512 7fd79b796700 1 -
[fc00:1002:c7::b4]:0/3036297886 <== osd.9 v2:[fc00:1002:c7::42]:6808/4984 1 ==== osd_op_reply(1 ST1000DM003-1ER162_Z4YA57ZB [omap-get-keys] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 171+0+0 (crc 0 0 0) 0x5627bfb85440 con 0x5627bfe47a80

#21 Updated by Harry Coin 7 months ago

Changing module.py for devicehealth to:
with rados.WriteOpCtx() as op:
+print("Data: "+str(data))
+print("JSON Data: "+str(json.dumps(data)))
ioctx.set_omap(op, (key,), (str(json.dumps(data)),))
if len(erase):
ioctx.remove_omap_keys(op, tuple(erase))
ioctx.operate_write_op(op, devid)

Gives:
ceph-post-file: 3f2d7352-7dc7-4c0d-a77e-541972bc4b73

#22 Updated by Harry Coin 7 months ago

Last line (wrapped badly in the copy/paste but ok in the file):

JSON Data: {"ata_sct_capabilities": {"data_table_supported": false, "error_recovery_control_supported": false, "feature_control_supported": false, "value": 4229}, "ata_smart_attributes": {"revision": 10, "table": [{"flags": {
"auto_keep": false, "error_rate": true, "event_count": false, "performance": true, "prefailure": true, "string": "POSR-- ", "updated_online": true, "value": 15}, "id": 1, "name": "Raw_Read_Error_Rate", "raw": {"string": "2174
89112", "value": 217489112}, "thresh": 6, "value": 119, "when_failed": "", "worst": 99}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": false, "performance": false, "prefailure": true, "string": "PO---- ",
"updated_online": true, "value": 3}, "id": 3, "name": "Spin_Up_Time", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 98, "when_failed": "", "worst": 97}, {"flags": {"auto_keep": true, "error_rate": false, "event_c
ount": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 4, "name": "Start_Stop_Count", "raw": {"string": "3116", "value": 3116}, "thresh": 20, "value": 97, "whe
n_failed": "", "worst": 97}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": true, "string": "PO--CK ", "updated_online": true, "value": 51}, "id": 5, "name": "Reall
ocated_Sector_Ct", "raw": {"string": "0", "value": 0}, "thresh": 10, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": false, "error_rate": true, "event_count": false, "performance": true, "prefailure":
true, "string": "POSR-- ", "updated_online": true, "value": 15}, "id": 7, "name": "Seek_Error_Rate", "raw": {"string": "508210978", "value": 508210978}, "thresh": 30, "value": 87, "when_failed": "", "worst": 60}, {"flags": {"
auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 9, "name": "Power_On_Hours", "raw": {"string": "29664", "
value": 29664}, "thresh": 0, "value": 67, "when_failed": "", "worst": 67}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": true, "performance": false, "prefailure": true, "string": "PO--C- ", "updated_onlin
e": true, "value": 19}, "id": 10, "name": "Spin_Retry_Count", "raw": {"string": "0", "value": 0}, "thresh": 97, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count":
true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 12, "name": "Power_Cycle_Count", "raw": {"string": "41", "value": 41}, "thresh": 20, "value": 100, "when_faile
d": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 183, "name": "Runtime_
Bad_Block", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "
string": "-O--CK ", "updated_online": true, "value": 50}, "id": 184, "name": "End-to-End_Error", "raw": {"string": "0", "value": 0}, "thresh": 99, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true,
"error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 187, "name": "Reported_Uncorrect", "raw": {"string": "0", "value": 0}, "thr
esh": 0, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value":
50}, "id": 188, "name": "Command_Timeout", "raw": {"string": "4 4 4", "value": 17180131332}, "thresh": 0, "value": 100, "when_failed": "", "worst": 99}, {"flags": {"auto_keep": true, "error_rate": true, "event_count": true,
"performance": false, "prefailure": false, "string": "-O-RCK ", "updated_online": true, "value": 58}, "id": 189, "name": "High_Fly_Writes", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 100, "when_failed": "", "wo
rst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": false, "performance": false, "prefailure": false, "string": "-O---K ", "updated_online": true, "value": 34}, "id": 190, "name": "Airflow_Temperatur
e_Cel", "raw": {"string": "26 (Min/Max 24/29)", "value": 488112154}, "thresh": 45, "value": 74, "when_failed": "", "worst": 66}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "p
refailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 191, "name": "G-Sense_Error_Rate", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updated_online": true, "value": 50}, "id": 192, "name": "Power-Off_Retract_Count", "raw": {"strin
g": "4", "value": 4}, "thresh": 0, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--CK ", "updat
ed_online": true, "value": 50}, "id": 193, "name": "Load_Cycle_Count", "raw": {"string": "25030", "value": 25030}, "thresh": 0, "value": 88, "when_failed": "", "worst": 88}, {"flags": {"auto_keep": true, "error_rate": false,
"event_count": false, "performance": false, "prefailure": false, "string": "-O---K ", "updated_online": true, "value": 34}, "id": 194, "name": "Temperature_Celsius", "raw": {"string": "26 (0 17 0 0 0)", "value": 73014444058},
"thresh": 0, "value": 26, "when_failed": "", "worst": 40}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": true, "performance": false, "prefailure": false, "string": "-O--C- ", "updated_online": true, "val
ue": 18}, "id": 197, "name": "Current_Pending_Sector", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 100, "when_failed": "", "worst": 100}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": true,
"performance": false, "prefailure": false, "string": "----C- ", "updated_online": false, "value": 16}, "id": 198, "name": "Offline_Uncorrectable", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 100, "when_failed":
"", "worst": 100}, {"flags": {"auto_keep": true, "error_rate": true, "event_count": true, "performance": true, "prefailure": false, "string": "-OSRCK ", "updated_online": true, "value": 62}, "id": 199, "name": "UDMA_CRC_Error
_Count", "raw": {"string": "0", "value": 0}, "thresh": 0, "value": 200, "when_failed": "", "worst": 200}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": false, "performance": false, "prefailure": false, "s
tring": "------ ", "updated_online": false, "value": 0}, "id": 240, "name": "Head_Flying_Hours", "raw": {"string": "29454h+21m+28.929s", "value": 5535907901895438}, "thresh": 0, "value": 100, "when_failed": "", "worst": 253}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": false, "performance": false, "prefailure": false, "string": "------ ", "updated_online": false, "value": 0}, "id": 241, "name": "Total_LBAs_Written", "raw": {"string": "62634320021", "value": 62634320021}, "thresh": 0, "value": 100, "when_failed": "", "worst": 253}, {"flags": {"auto_keep": false, "error_rate": false, "event_count": false, "performance": false, "prefailure": false
, "string": "------ ", "updated_online": false, "value": 0}, "id": 242, "name": "Total_LBAs_Read", "raw": {"string": "8326007212329", "value": 8326007212329}, "thresh": 0, "value": 100, "when_failed": "", "worst": 253}]}, "at
a_smart_data": {"capabilities": {"attribute_autosave_enabled": true, "conveyance_self_test_supported": true, "error_logging_supported": true, "exec_offline_immediate_supported": true, "gp_logging_supported": true, "offline_is
_aborted_upon_new_cmd": false, "offline_surface_scan_supported": true, "selective_self_test_supported": true, "self_tests_supported": true, "values": [123, 3]}, "offline_data_collection": {"completion_seconds": 72, "status": {"passed": true, "string": "was completed without error", "value": 130}}, "self_test": {"polling_minutes": {"conveyance": 2, "extended": 103, "short": 1}, "status": {"passed": true, "string": "completed without error", "value
": 0}}}, "ata_smart_error_log": {"summary": {"count": 0, "revision": 1}}, "ata_smart_selective_self_test_log": {"flags": {"remainder_scan_enabled": false, "value": 0}, "power_up_scan_resume_minutes": 0, "revision": 1, "table"
: [{"lba_max": 0, "lba_min": 0, "status": {"string": "Not_testing", "value": 0}}, {"lba_max": 0, "lba_min": 0, "status": {"string": "Not_testing", "value": 0}}, {"lba_max": 0, "lba_min": 0, "status": {"string": "Not_testing",
"value": 0}}, {"lba_max": 0, "lba_min": 0, "status": {"string": "Not_testing", "value": 0}}, {"lba_max": 0, "lba_min": 0, "status": {"string": "Not_testing", "value": 0}}]}, "ata_smart_self_test_log": {"standard": {"count":
21, "error_count_outdated": 0, "error_count_total": 0, "revision": 1, "table": [{"lifetime_hours": 5262, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value"
: 1}}, {"lifetime_hours": 592, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 568, "status": {"passed": true, "string": "Compl
eted without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 544, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "val
ue": 1}}, {"lifetime_hours": 520, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 496, "status": {"passed": true, "string": "Co
mpleted without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 474, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Extended offline"
, "value": 2}}, {"lifetime_hours": 472, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 448, "status": {"passed": true, "string
": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 424, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offli
ne", "value": 1}}, {"lifetime_hours": 400, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 376, "status": {"passed": true, "str
ing": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 352, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short of
fline", "value": 1}}, {"lifetime_hours": 328, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 306, "status": {"passed": true, "
string": "Completed without error", "value": 0}, "type": {"string": "Extended offline", "value": 2}}, {"lifetime_hours": 304, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Sh
ort offline", "value": 1}}, {"lifetime_hours": 280, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 256, "status": {"passed": t
rue, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 232, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string":
"Short offline", "value": 1}}, {"lifetime_hours": 208, "status": {"passed": true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}, {"lifetime_hours": 184, "status": {"passed"
: true, "string": "Completed without error", "value": 0}, "type": {"string": "Short offline", "value": 1}}]}}, "ata_version": {"major_value": 1008, "minor_value": 31, "string": "ACS-2, ACS-3 T13/2161-D revision 3b"}, "device"
: {"info_name": "/dev/sdc [SAT]", "name": "/dev/sdc", "protocol": "ATA", "type": "sat"}, "firmware_version": "CC46", "form_factor": {"ata_value": 2, "name": "3.5 inches"}, "in_smartctl_database": true, "interface_speed": {"cu
rrent": {"bits_per_unit": 100000000, "sata_value": 2, "string": "3.0 Gb/s", "units_per_second": 30}, "max": {"bits_per_unit": 100000000, "sata_value": 14, "string": "6.0 Gb/s", "units_per_second": 60}}, "json_format_version":
[1, 0], "local_time": {"asctime": "Thu Sep 5 09:22:32 2019 CDT", "time_t": 1567693352}, "logical_block_size": 512, "model_family": "Seagate Barracuda 7200.14 (AF)", "model_name": "ST1000DM003-1ER162", "nvme_smart_health_inf
ormation_add_log_error": "nvme returned an error: sudo: exit status: 1", "nvme_smart_health_information_add_log_error_code": -22, "nvme_vendor": "lvm", "physical_block_size": 4096, "power_cycle_count": 41, "power_on_time": {"
hours": 29664}, "rotation_rate": 7200, "sata_version": {"string": "SATA 3.1", "value": 126}, "serial_number": "Z4YA57ZB", "smart_status": {"passed": true}, "smartctl": {"argv": ["smartctl", "-a", "--json", "/dev/sdc"], "build
_info": "(local build)", "exit_status": 0, "platform_info": "x86_64-linux-5.2.0-15-generic", "svn_revision": "4883", "version": [7, 0]}, "temperature": {"current": 26}, "user_capacity": {"blocks": 1953525168, "bytes": 1000204
886016}, "wwn": {"id": 2267086309, "naa": 5, "oui": 3152}}2019-09-05 09:22:34.504 7f975f79d700 0 log_channel(audit) log [DBG] : from='client.3824576 -' entity='client.admin' cmd=[{"prefix": "crash post", "target": ["mgr", ""
]}]: dispatch
  • Caught signal (Segmentation fault) **

#23 Updated by Harry Coin 7 months ago

So, the original crash line you noted is only reached when debug logging is off. The JSON for your test case is above. When debug logging is on, the crash happens sooner, the debug write is never reached.
That's about all I can do for you/ceph. Hope this helps.
Harry Coin

#24 Updated by Brad Hubbard 7 months ago

Thanks, appreciate it. Hopefully that will be enough.

#25 Updated by Brad Hubbard 7 months ago

Unfortunately in both logs mgr debugging appears to be '1/ 5 mgr'. In mgr_error_log there is logging at 20 but that's coming from the messenger, not the manager which is unfortunate.

# gawk '/--- logging levels ---/,/--- end dump of recent events ---/' mgr_log |grep "mgr$" 
   1/ 5 mgr
   1/ 5 mgr
# gawk '/--- logging levels ---/,/--- end dump of recent events ---/' mgr_error_log |grep "mgr$" 
   1/ 5 mgr
   1/ 5 mgr
# gawk '/--- logging levels ---/,/--- end dump of recent events ---/' mgr_error_log |grep "20 " 
  20/20 ms
  20/20 ms

I'll see what I can do with what we have anyway.

#26 Updated by Brad Hubbard 7 months ago

(gdb) f 4
#4  0x00007f97681c5368 in _rados_write_op_omap_set (write_op=0x55f2d3a6f100, keys=keys@entry=0x55f2cfdd5438, vals=vals@entry=0x0, lens=lens@entry=0x55f2cfdd53e0, num=num@entry=1) at ./src/librados/librados_c.cc:3308
3308        bl.append(vals[i], lens[i]);
(gdb) p lens[0]
$1 = 13997

If I extract the json data to a file by itself and print it without the new line I get.

# echo $(cat tmp.txt)|wc 
      1    1485   13997

So we are definitely on the right track.

#27 Updated by Brad Hubbard 6 months ago

  • Duplicated by Bug #42082: pybind/rados: set_omap() crash on py3 added

#28 Updated by Brad Hubbard 6 months ago

  • Status changed from New to Duplicate

#29 Updated by Brad Hubbard 6 months ago

  • Duplicated by deleted (Bug #42082: pybind/rados: set_omap() crash on py3)

#30 Updated by Brad Hubbard 6 months ago

  • Duplicates Bug #42082: pybind/rados: set_omap() crash on py3 added

Also available in: Atom PDF