Project

General

Profile

Actions

Bug #53184

closed

failed to start new osd due to SIGSEGV in BlueStore::read()

Added by Satoru Takeuchi over 2 years ago. Updated 12 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
container
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A new OSD failed to start due to SIGSEGV. Here is the backtrace.

```
debug -3> 2021-11-08T07:06:17.324+0000 7f1c5d789f00 -1 ** Caught signal (Segmentation fault) *
in thread 7f1c5d789f00 thread_name:ceph-osd

ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f1c5de333c0]
2: (operator<<(std::ostream&, coll_t const&)+0x10) [0x560b9ba4a160]
3: (BlueStore::read(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x5b0) [0x560b9bba5170]
4: (OSD::read_superblock()+0x1ff) [0x560b9b5a768f]
5: (OSD::init()+0xb90) [0x560b9b5fbab0]
6: main()
7: __libc_start_main()
8: _start()
```

Restarting the osd didn't help and resulted in the same SIGSEGV.

software versions:
- ceph: v16.2.6
- rook: v1.7.5

attached files:
- osd.2.log.tar.xz: the exported bluefs
- osd-2.tar.xz: the boot log of osd (debug level is 20)


Files

osd-2.tar.xz (14.5 KB) osd-2.tar.xz Satoru Takeuchi, 11/08/2021 07:19 AM
osd.2.log.tar.xz (60.5 KB) osd.2.log.tar.xz Satoru Takeuchi, 11/08/2021 07:21 AM
osd-mkfs.log.tar.xz (32.3 KB) osd-mkfs.log.tar.xz Satoru Takeuchi, 11/30/2021 02:15 AM
ceph-volume.tar.gz (104 KB) ceph-volume.tar.gz Yuma Ogami, 03/09/2022 07:17 AM
osd.6.export.tar.gz (235 KB) osd.6.export.tar.gz Yuma Ogami, 03/09/2022 07:17 AM
osd.prepare.set1-data-8ngzwp.tar.gz (78.9 KB) osd.prepare.set1-data-8ngzwp.tar.gz Yuma Ogami, 03/09/2022 07:17 AM
osd.6.tar.gz (382 KB) osd.6.tar.gz Yuma Ogami, 03/09/2022 07:17 AM
ceph-volume.tar.gz (104 KB) ceph-volume.tar.gz Yuma Ogami, 03/23/2022 07:33 AM
osd.4.export.tar.gz (434 KB) osd.4.export.tar.gz Yuma Ogami, 03/23/2022 07:33 AM
osd.4.tar.gz (38.1 KB) osd.4.tar.gz Yuma Ogami, 03/23/2022 07:33 AM
osd.prepare.set1-data-585z6z.tar.gz (74.6 KB) osd.prepare.set1-data-585z6z.tar.gz Yuma Ogami, 03/23/2022 07:33 AM
ceph-volume.tar.gz (110 KB) ceph-volume.tar.gz Yuma Ogami, 04/05/2022 04:20 AM
osd.5.export.tar.gz (272 KB) osd.5.export.tar.gz Yuma Ogami, 04/05/2022 04:20 AM
osd.prepare.set1-data-3f85sf.tar.gz (79.9 KB) osd.prepare.set1-data-3f85sf.tar.gz Yuma Ogami, 04/05/2022 04:20 AM
osd.5.tar.gz (461 KB) osd.5.tar.gz Yuma Ogami, 04/05/2022 04:20 AM

Related issues 1 (1 open0 closed)

Related to bluestore - Bug #54707: crash: BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)New

Actions
Actions #1

Updated by Igor Fedotov over 2 years ago

Looks like list of OSD's collections is empty. I don't know the root cause but I'm getting pretty the same effects on a custom build which bypasses collections loading.

To check the relevant content in DB could you please run
ceph-kvstore-tool bluestore-kv <path-to-osd> list C >out
and share the content of 'out' if any.

Curious if you still have OSD's mkfs log? Please share if available. Or may be you can reproduce the issue for a different new OSD? Then it would be great to have 'mkfs' log with debug_bluestore set to 20.

Actions #2

Updated by Satoru Takeuchi over 2 years ago

To check the relevant content in DB could you please run ceph-kvstore-tool bluestore-kv <path-to-osd> list C >out and share the content of 'out' if any.

Here is the result of ceph-kvstore-tool.

```
  1. ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-2 list C >out
    2021-11-09T12:24:43.969+0000 7f69011aa240 0 bluestore(/var/lib/ceph/osd/ceph-2) _open_db_and_around read-only:0 repair:0
    2021-11-09T12:24:43.969+0000 7f69011aa240 1 bdev(0x55f20c8d4800 /var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
    2021-11-09T12:24:43.970+0000 7f69011aa240 1 bdev(0x55f20c8d4800 /var/lib/ceph/osd/ceph-2/block) open size 10737418240 (0x280000000, 10 GiB) block_size 4096 (4 KiB) rotational discard supported
    2021-11-09T12:24:43.970+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 data 0.06
    2021-11-09T12:24:43.970+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
    2021-11-09T12:24:43.971+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) open size 10737418240 (0x280000000, 10 GiB) block_size 4096 (4 KiB) rotational discard supported
    2021-11-09T12:24:43.971+0000 7f69011aa240 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-2/block size 10 GiB
    2021-11-09T12:24:43.971+0000 7f69011aa240 1 bluefs mount
    2021-11-09T12:24:43.972+0000 7f69011aa240 1 bluefs _init_alloc shared, id 1, capacity 0x280000000, block size 0x10000
    2021-11-09T12:24:43.990+0000 7f69011aa240 1 bluefs mount shared_bdev_used = 0
    2021-11-09T12:24:43.991+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _prepare_db_environment set db_paths to db,10200547328 db.slow,10200547328
    2021-11-09T12:24:44.010+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824
    2021-11-09T12:24:44.011+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta old nid_max 0
    2021-11-09T12:24:44.011+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta old blobid_max 0
    2021-11-09T12:24:44.011+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta freelist_type bitmap
    2021-11-09T12:24:44.011+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta ondisk_format 4 compat_ondisk_format 3
    2021-11-09T12:24:44.011+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_super_meta min_alloc_size 0x1000
    2021-11-09T12:24:44.013+0000 7f69011aa240 1 freelist init
    2021-11-09T12:24:44.013+0000 7f69011aa240 1 freelist _read_cfg
    2021-11-09T12:24:44.013+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _init_alloc opening allocation metadata
    2021-11-09T12:24:44.014+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _init_alloc loaded 10 GiB in 1 extents, allocator type hybrid, capacity 0x280000000, block size 0x1000, free 0x27fffe000, fragmentation 0
    2021-11-09T12:24:44.015+0000 7f69011aa240 1 bluefs umount
    2021-11-09T12:24:44.015+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) close
    2021-11-09T12:24:44.241+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
    2021-11-09T12:24:44.242+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) open size 10737418240 (0x280000000, 10 GiB) block_size 4096 (4 KiB) rotational discard supported
    2021-11-09T12:24:44.243+0000 7f69011aa240 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-2/block size 10 GiB
    2021-11-09T12:24:44.243+0000 7f69011aa240 1 bluefs mount
    2021-11-09T12:24:44.243+0000 7f69011aa240 1 bluefs _init_alloc shared, id 1, capacity 0x280000000, block size 0x10000
    2021-11-09T12:24:44.255+0000 7f69011aa240 1 bluefs mount shared_bdev_used = 91750400
    2021-11-09T12:24:44.255+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _prepare_db_environment set db_paths to db,10200547328 db.slow,10200547328
    2021-11-09T12:24:44.293+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824
    2021-11-09T12:24:44.297+0000 7f69011aa240 1 bluestore(/var/lib/ceph/osd/ceph-2) umount
    2021-11-09T12:24:44.308+0000 7f69011aa240 1 bluefs umount
    2021-11-09T12:24:44.311+0000 7f69011aa240 1 bdev(0x55f20c8d4c00 /var/lib/ceph/osd/ceph-2/block) close
    2021-11-09T12:24:44.581+0000 7f69011aa240 1 freelist shutdown
    2021-11-09T12:24:44.581+0000 7f69011aa240 1 bdev(0x55f20c8d4800 /var/lib/ceph/osd/ceph-2/block) close
    ```

The above output was to stderr. "out" was empty.

Curious if you still have OSD's mkfs log? Please share if available.

Here it is.

===
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 2 --monmap /var/lib/ceph/osd/ceph-2/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-2/ --osd-uuid 22cb4447-b487-486e-9a2c-559aabb15542 --setuser ceph --setgroup ceph
stderr: 2021-11-05T06:38:29.243+0000 7fcef6301f00 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _read_fsid unparsable uuid ===

Or may be you can reproduce the issue for a different new OSD? Then it would be great to have 'mkfs' log with debug_bluestore set to 20.

I'll try it later.

Actions #3

Updated by Satoru Takeuchi over 2 years ago

Or may be you can reproduce the issue for a different new OSD? Then it would be great to have 'mkfs' log with debug_bluestore set to 20.

I'll try it later.

Sorry, I found that my test script, which I originally hit this problem, was wrong. I'll confirm whether this problem happens by the updated script.

Actions #4

Updated by Satoru Takeuchi over 2 years ago

Sorry, I found that my test script, which I originally hit this problem, was wrong. I'll confirm whether this problem happens by the updated script.

This problem happened even after fixing the test script. I attached the log of mkfs with debug information.

Actions #5

Updated by Igor Fedotov over 2 years ago

Satoru Takeuchi wrote:

Sorry, I found that my test script, which I originally hit this problem, was wrong. I'll confirm whether this problem happens by the updated script.

This problem happened even after fixing the test script. I attached the log of mkfs with debug information.

Satoru, thanks for the update.
Curious if run Ceph commands following mkfs in the same container/pod? Could you please run the ceph-kvstore-tool's list comamdn as per above immediately after mkfs in exactly the same container?
Also wondering if you have any non-default ceph settings?

Actions #6

Updated by Satoru Takeuchi over 2 years ago

Curious if run Ceph commands following mkfs in the same container/pod? Could you please run the ceph-kvstore-tool's list comamdn as per above immediately after mkfs in exactly the same container?

It's difficult to run ceph-kvstore-tool immediately after mkfs. However, it's possible to run this command in the near situation.

OSD initialization process in Rook is as follows.

1. In osd provisioning container: Run "ceph-volume raw prepare" and mkfs is called inside it.
2. In osd expand container: Run "ceph-bluestore-tool bluefs-bdev-expand". This container is to expand OSD when expanding the size of the block device. It would be nop just after creating OSD.
3. In osd daemon container: Run osd daemon. When this problem happens, this daemon fails forever due to this problem.

IIUC, in step 3, bluestore's internal structure hasn't been changed after mkfs. So I guess that it would be sufficient to run ceph-kvstore-tool with suspending osd daemon. Is this idea make sense?

Also wondering if you have any non-default ceph settings?

Here is my ceph.conf.

```
[mgr]
; Suppress the folowing warning.
;
; ```
; health: HEALTH_WARN
; 1 pools have many more objects per pg than average
; ```
mon_pg_warn_max_object_skew = 0
[client]
rgw enable ops log = true
debug rgw = 20/20
```

In addition, when gathering the debug logs, the following parameters are also set. But it's not necessary to reproduce this problem. It's just for debugging.

```
[global]
debug_bluefs = 20/20
debug_bluestore = 20/20
log_to_stderr = true
err_to_stderr = true
log_file =
```

Actions #7

Updated by Igor Fedotov over 2 years ago

Satoru Takeuchi wrote:

Curious if run Ceph commands following mkfs in the same container/pod? Could you please run the ceph-kvstore-tool's list comamdn as per above immediately after mkfs in exactly the same container?

It's difficult to run ceph-kvstore-tool immediately after mkfs. However, it's possible to run this command in the near situation.

OSD initialization process in Rook is as follows.

1. In osd provisioning container: Run "ceph-volume raw prepare" and mkfs is called inside it.
2. In osd expand container: Run "ceph-bluestore-tool bluefs-bdev-expand". This container is to expand OSD when expanding the size of the block device. It would be nop just after creating OSD.
3. In osd daemon container: Run osd daemon. When this problem happens, this daemon fails forever due to this problem.

IIUC, in step 3, bluestore's internal structure hasn't been changed after mkfs. So I guess that it would be sufficient to run ceph-kvstore-tool with suspending osd daemon. Is this idea make sense?

I believe that OSD/DB are already in invalid state at step 3. Hence calling ceph-kvstore-tool makes a little sense at this point but definitely you can try...

So far I have 2 hypothesis, the first one looks more likely to me:
1) DB data isn't flushed properly after mkfs at step 1. Hence it might be inaccessible out of that container - just speculating, don't know much about containers environments. So running ceph-kvstore-tool in this first container would reveal the DB content is valid there. May be you can add it as the last "sub-step" after ceph-volume raw prepare?

2) DB data is overwritten at some point, e.g. at step 2. Try to remove it or run ceph-kvstore-tool as the first action at step 2 before the expand attempt? Or introduce "step 1.5" which runs ceph-kvstore-tool in a its own container...

As the last resort you might want to collect both mkfs and subsequent OSD startup attempt logs - for the same experiment (!) - so we can try to learn what are the differences in bluefs log at stage 3.

Actions #8

Updated by Satoru Takeuchi over 2 years ago

I understood. I'll try to modify Ceph (and Rook?) to collect requested information.

Actions #9

Updated by Igor Fedotov over 2 years ago

Curious if this might be relevant: https://github.com/ceph/ceph/pull/41889 (Pacific backport - to be published in 16.2.7 - is https://github.com/ceph/ceph/pull/43417)

Actions #10

Updated by Satoru Takeuchi over 2 years ago

Thank you! I'll run the reproducer with this fix (probably v16.2.7-rc?).

Actions #11

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Need More Info

@Satoru - any updates?

Actions #12

Updated by Satoru Takeuchi over 2 years ago

I'll update my ceph clusters to 16.2.7 next week or next to next week (upgrading rook for now). I'll run the reproducer and will report the result.

Actions #13

Updated by Satoru Takeuchi about 2 years ago

Reproduction test in progress.

Updated by Yuma Ogami about 2 years ago

I'm sat's colleague.

We experimented in a updated environment and confirmed the same result. Could you please check the attached logs?

Software environment:
- Ceph: v16.2.7
- Rook: v1.8.3

```
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fe6a8eb83c0]
2: (operator<<(std::ostream&, coll_t const&)+0x10) [0x5587d0427ad0]
3: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x5b0) [0x5587d0582c40]
4: (OSD::read_superblock()+0x1ff) [0x5587cff8469f]
5: (OSD::init()+0xb90) [0x5587cffd9130]
6: main()
7: __libc_start_main()
8: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
```

The mkfs is called from within the rook and the log is included in osd.prepare.set1-data-8ngzwp.log.

Please tell me if you want any additional logs.

Actions #15

Updated by Igor Fedotov about 2 years ago

Hi Yuma,
thanks a lot for the information. It's really helpful.

May I ask some questions, please?
1) How easily can you reproduce this issue?
2) Can you try set bluefs-buffered-io to false and check if it's still reproducible?
3) Is osd.6 main volume encrypted?
4) Am I correct that this ceph-volume's prepare stuff (which includes mkfs call) is run from a standalone container instance? Are subsequent ceph-osd calls involve different instances or not?

As for now I can see that last mkfs-related BlueStore operation(s) (namely superblock creation) aren't present in BlueFS log on OSD subsequent startup although they were applied during mkfs. Looks like they weren't properly flushed to disk and hence have been overwritten during the next OSD run.

So apparently some data flush or racing issue...

Actions #16

Updated by Yuma Ogami about 2 years ago

Thank you for your reply.

1) How easily can you reproduce this issue?

We could reproduce this issue by about 100 times trial (12 times recreate ceph cluster * 9 OSDs).

2) Can you try set bluefs-buffered-io to false and check if it's still reproducible?

Yes, we can. We will try it.

3) Is osd.6 main volume encrypted?

Yes, it is. We use dm-crypt. And the output of lsblk is as follows.

$ lsblk
NAME                    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
...
vdc                     253:32   0   50G  0 disk
`-crypt-vdc             254:2    0   50G  0 crypt
...

We assigned crypt-vdc for osd without lvm.

4) Am I correct that this ceph-volume's prepare stuff (which includes mkfs call) is run from a standalone container instance? Are subsequent ceph-osd calls involve different instances or not?

We checked the log of prepare pod(osd.prepare.set1-data-8ngzwp.tar.gz). ceph-osd command and ceph-volume command are called in the same container.

2022-03-07 11:21:20.158694 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/set1-data-8ngzwp --format json
...
2022-03-07 11:21:21.181519 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/set1-data-8ngzwp raw prepare --bluestore --data /dev/mapper/crypt-vdc
...
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 6 --monmap /var/lib/ceph/osd/ceph-6/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-6/ --osd-uuid de04cbaf-1a8f-4ff5-8b42-5c16193a1607 --setuser ceph --setgroup ceph
Actions #17

Updated by Igor Fedotov about 2 years ago

Yuma Ogami wrote:

4) Am I correct that this ceph-volume's prepare stuff (which includes mkfs call) is run from a standalone container instance? Are subsequent ceph-osd calls involve different instances or not?

We checked the log of prepare pod(osd.prepare.set1-data-8ngzwp.tar.gz). ceph-osd command and ceph-volume command are called in the same container.

Yeah, ceph-osd's mkfs and ceph-volume are called from within a single container instance. But what's about subsequent operations, e.g. the first regular osd start-up?
osd.6 log you shared is pretty distant in time from the osd prepare log. The latter ends at 11:21:24. What happened afterwards? Was it run from another container instance?

Actions #18

Updated by Yuma Ogami about 2 years ago

The contents of osd.6.log are the log of a different Pod than the prepare pod.

Yeah, ceph-osd's mkfs and ceph-volume are called from within a single container instance. But what's about subsequent operations, e.g. the first regular osd start-up?

We did not get that log and will get it now.

osd.6 log you shared is pretty distant in time from the osd prepare log. The latter ends at 11:21:24. What happened afterwards? Was it run from another container instance?

After the prepare pod was started, the OSD Pod repeatedly failed to start and retried, so the time has jumped. Because of this situation, we have not been able to log the prepare pod. osd.6 is the log of the latest restart. There is no log of the first time.

Actions #19

Updated by Yuma Ogami about 2 years ago

2) Can you try set bluefs-buffered-io to false and check if it's still reproducible?

Yes, we can. We will try it.

We couldn't reproduce this by about 5400 times trial (600 times * 9 OSDs). We are going to reproduce the problem by setting `bluefs-buffered-io=true`. And collect the log of the OSD pod immediately after the phenomenon occurred.

Updated by Yuma Ogami about 2 years ago

We are going to reproduce the problem by setting `bluefs-buffered-io=true`. And collect the log of the OSD pod immediately after the phenomenon occurred.

We got the logs of the OSD pod immediately after the phenomenon occurred. We could reproduce this issue by 27 trials this time (3 times recreate ceph cluster * 9 OSDs). The version of ceph and other configurations are the same as #53184-14.

Actions #21

Updated by Igor Fedotov about 2 years ago

Hi Yuma,
thanks a lot for new logs.

But it still looks like there is some gap between mkfs and osd startup during which we are unaware what had been happening.
Look 'osd prepare' log ends on 04:41:02:
stderr: 2022-03-22T04:41:02.282+0000 7f1628839f00 0 created object store /var/lib/ceph/osd/ceph-4/ for osd.4 fsid 0b52af16-1117-4227-b3d1-005d26812929

"OSD startup" logs starts on 04:42:52:
2022-03-22T04:42:52Z {container="osd", filename="/var/log/pods/ceph-hdd_rook-ceph-osd-4-6d5c4895b7-w96bn_e99000c1-fc05-4c71-84fe-1b1a7370da2f/osd/0.log", stream="stderr"} debug 2022-03-22T04:42:52.322+0000 7fb0862c8f00 0 set uid:gid to 64045:64045 (ceph:ceph)

and in this log one can see bluefs inodes created in the questionable period (i.e. since 00:41:02 till 00:42:52),
note ino 27 mktime 2022-03-22T04:41:18.359858+0000:

2022-03-22T04:42:53Z {container="osd", filename="/var/log/pods/ceph-hdd_rook-ceph-osd-4-6d5c4895b7-w96bn_e99000c1-fc05-4c71-84fe-1b1a7370da2f/osd/0.log", stream="stderr"} debug 2022-03-22T04:42:53.445+0000 7fb0862c8f00 20 bluefs _replay 0x27000: op_file_update file(ino 27 size 0x0 mtime 2022-03-22T04:41:18.359858+0000 allocated 0 extents [])

Curious lines in osd.4 log are referring to bluefs expansion:
2022-03-22T04:42:50Z {container="expand-bluefs", filename="/var/log/pods/ceph-hdd_rook-ceph-osd-4-6d5c4895b7-w96bn_e99000c1-fc05-4c71-84fe-1b1a7370da2f/expand-bluefs/0.log", stream="stdout"} inferring bluefs devices from bluestore path
2022-03-22T04:42:50Z {container="expand-bluefs", filename="/var/log/pods/ceph-hdd_rook-ceph-osd-4-6d5c4895b7-w96bn_e99000c1-fc05-4c71-84fe-1b1a7370da2f/expand-bluefs/0.log", stream="stdout"} 1 : device size 0xc7fe00000 : using 0x4c2000(4.8 MiB)
2022-03-22T04:42:50Z {container="expand-bluefs", filename="/var/log/pods/ceph-hdd_rook-ceph-osd-4-6d5c4895b7-w96bn_e99000c1-fc05-4c71-84fe-1b1a7370da2f/expand-bluefs/0.log", stream="stdout"} Expanding DB/WAL...

I was about to claim that stuff but its timestamp doesn't match to ino 27 one. The former happened later in time.
Although still curious why do you need that bluefs expansion?

And the general question is still the same. What had been happening since 00:41:02 till 00:42:52?
It still looks like we lose DB content during that period as OSD startup misses some bluefs operations present in osd prepare log.
Hence my working hypothesis is like the following: that unknown stuff (which is bluestore/bluefs aware) running in the questionable period is unable to see the latest bluefs updates from the prepare stage and overwrites bluefs log with its own operations. Which effectively eliminates osd superblock update and finally causes the assertion on OSD startup.

Besides the above questions you might also want to set debug-bdev to 10 for your next experiments. I'd like to verify whether device flush is performed properly on shutdown during the prepare stage.

Actions #22

Updated by Yuma Ogami about 2 years ago

Thank you for your investigation. That hypothesis seems to be correct. I understand that more detailed logs are needed for it.

Besides the above questions you might also want to set debug-bdev to 10 for your next experiments.

We will try it next week.

Updated by Yuma Ogami about 2 years ago

We will try it next week.

We tried it and attached the logs.

But it still looks like there is some gap between mkfs and osd startup during which we are unaware what had been happening.

It turned out to be our mistake. We collected the logs in the wrong way. This time, the gap between the end of prepare job and the start of the osd is 4 seconds. So it looks good.

Actions #24

Updated by Yuma Ogami almost 2 years ago

If you need additional information, please comment.

Actions #25

Updated by Igor Fedotov over 1 year ago

  • Related to Bug #54707: crash: BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) added
Actions #26

Updated by Neha Ojha over 1 year ago

Hey Igor, do we have any more insight on this issue?

Actions #27

Updated by Shinya Hayashi over 1 year ago

Hi Igor

I'm working with Satoru and Yuma, and I was trying to reproduce the problem with Ceph v17.2.5 and Rook v1.10.5.
I repeated the creation and deletion of the Ceph cluster for two weeks, but the problem did not occur.

Now, we think that we can close this issue. If we meet a similar problem in the future, we will create a new issue.
Thank you for helping us a lot!

Actions #28

Updated by Igor Fedotov 12 months ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF