Project

General

Profile

Actions

Bug #51418

closed

[pwl] segment fault on syncpoint stack

Added by CONGMIN YIN almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

segment fault, stack is very very long.

Thread 137 "tp_pwl" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff9f7fe700 (LWP 807360)]
0x00007ffff50d7f38 in std::__shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fff74dbc108, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
730 _M_pi->_M_release();
(gdb) bt
#0 0x00007ffff50d7f38 in std::
_shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fff74dbc108, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#1 0x00007fffb2133118 in std::
_shared_ptr<librbd::cache::pwl::SyncPointLogEntry, (_gnu_cxx::_Lock_policy)2>::~_shared_ptr (this=0x7fff74dbc100, _in_chrg=<optimized out>)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#2 0x00007fffb2133138 in std::shared_ptrlibrbd::cache::pwl::SyncPointLogEntry::~shared_ptr (this=0x7fff74dbc100, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#3 0x00007fffb217dd6c in librbd::cache::pwl::SyncPointLogEntry::~SyncPointLogEntry (this=0x7fff74dbc080, __in_chrg=<optimized out>) at ../src/librbd/cache/pwl/LogEntry.h:97
#4 0x00007fffb216dbf5 in __gnu_cxx::new_allocatorlibrbd::cache::pwl::SyncPointLogEntry::destroylibrbd::cache::pwl::SyncPointLogEntry (this=0x7fff74dbc080, __p=0x7fff74dbc080)
at /usr/include/c++/9/ext/new_allocator.h:153
#5 0x00007fffb216da0d in std::allocator_traits<std::allocatorlibrbd::cache::pwl::SyncPointLogEntry >::destroylibrbd::cache::pwl::SyncPointLogEntry (
_a=..., _p=0x7fff74dbc080)
at /usr/include/c++/9/bits/alloc_traits.h:497
#6 0x00007fffb216ca59 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPointLogEntry, std::allocatorlibrbd::cache::pwl::SyncPointLogEntry, (
_gnu_cxx::_Lock_policy)2>::_M_dispose (
this=0x7fff74dbc070) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#7 0x00007ffff50d8cd0 in std::_Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff74dbc070) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#8 0x00007ffff50d7f3d in std::
_shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fff72397b28, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#9 0x00007fffb2133118 in std::
_shared_ptr<librbd::cache::pwl::SyncPointLogEntry, (_gnu_cxx::_Lock_policy)2>::~_shared_ptr (this=0x7fff72397b20, _in_chrg=<optimized out>)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#10 0x00007fffb2133138 in std::shared_ptrlibrbd::cache::pwl::SyncPointLogEntry::~shared_ptr (this=0x7fff72397b20, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#11 0x00007fffb217dd6c in librbd::cache::pwl::SyncPointLogEntry::~SyncPointLogEntry (this=0x7fff72397aa0, __in_chrg=<optimized out>) at ../src/librbd/cache/pwl/LogEntry.h:97
#12 0x00007fffb216dbf5 in __gnu_cxx::new_allocatorlibrbd::cache::pwl::SyncPointLogEntry::destroylibrbd::cache::pwl::SyncPointLogEntry (this=0x7fff72397aa0, __p=0x7fff72397aa0)
at /usr/include/c++/9/ext/new_allocator.h:153
#13 0x00007fffb216da0d in std::allocator_traits<std::allocatorlibrbd::cache::pwl::SyncPointLogEntry >::destroylibrbd::cache::pwl::SyncPointLogEntry (
_a=..., _p=0x7fff72397aa0)
at /usr/include/c++/9/bits/alloc_traits.h:497
#14 0x00007fffb216ca59 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPointLogEntry, std::allocatorlibrbd::cache::pwl::SyncPointLogEntry, (
_gnu_cxx::_Lock_policy)2>::_M_dispose (
this=0x7fff72397a90) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#15 0x00007ffff50d8cd0 in std::_Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff72397a90) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#16 0x00007ffff50d7f3d in std::
_shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fff724b8398, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#17 0x00007fffb2133118 in std::
_shared_ptr<librbd::cache::pwl::SyncPointLogEntry, (_gnu_cxx::_Lock_policy)2>::~_shared_ptr (this=0x7fff724b8390, _in_chrg=<optimized out>)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#18 0x00007fffb2133138 in std::shared_ptrlibrbd::cache::pwl::SyncPointLogEntry::~shared_ptr (this=0x7fff724b8390, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#19 0x00007fffb217dd6c in librbd::cache::pwl::SyncPointLogEntry::~SyncPointLogEntry (this=0x7fff724b8310, __in_chrg=<optimized out>) at ../src/librbd/cache/pwl/LogEntry.h:97
#20 0x00007fffb216dbf5 in __gnu_cxx::new_allocatorlibrbd::cache::pwl::SyncPointLogEntry::destroylibrbd::cache::pwl::SyncPointLogEntry (this=0x7fff724b8310, __p=0x7fff724b8310)
at /usr/include/c++/9/ext/new_allocator.h:153
#21 0x00007fffb216da0d in std::allocator_traits<std::allocatorlibrbd::cache::pwl::SyncPointLogEntry >::destroylibrbd::cache::pwl::SyncPointLogEntry (
_a=..., _p=0x7fff724b8310)
at /usr/include/c++/9/bits/alloc_traits.h:497
#22 0x00007fffb216ca59 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPointLogEntry, std::allocatorlibrbd::cache::pwl::SyncPointLogEntry, (
_gnu_cxx::_Lock_policy)2>::_M_dispose (
this=0x7fff724b8300) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#23 0x00007ffff50d8cd0 in std::_Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff724b8300) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#24 0x00007ffff50d7f3d in std::
_shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fffaff28ca8, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#25 0x00007fffb2133118 in std::
_shared_ptr<librbd::cache::pwl::SyncPointLogEntry, (_gnu_cxx::_Lock_policy)2>::~_shared_ptr (this=0x7fffaff28ca0, _in_chrg=<optimized out>)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#26 0x00007fffb2133138 in std::shared_ptrlibrbd::cache::pwl::SyncPointLogEntry::~shared_ptr (this=0x7fffaff28ca0, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#27 0x00007fffb217dd6c in librbd::cache::pwl::SyncPointLogEntry::~SyncPointLogEntry (this=0x7fffaff28c20, __in_chrg=<optimized out>) at ../src/librbd/cache/pwl/LogEntry.h:97
#28 0x00007fffb216dbf5 in __gnu_cxx::new_allocatorlibrbd::cache::pwl::SyncPointLogEntry::destroylibrbd::cache::pwl::SyncPointLogEntry (this=0x7fffaff28c20, __p=0x7fffaff28c20)
at /usr/include/c++/9/ext/new_allocator.h:153
#29 0x00007fffb216da0d in std::allocator_traits<std::allocatorlibrbd::cache::pwl::SyncPointLogEntry >::destroylibrbd::cache::pwl::SyncPointLogEntry (
_a=..., _p=0x7fffaff28c20)
at /usr/include/c++/9/bits/alloc_traits.h:497
#30 0x00007fffb216ca59 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPointLogEntry, std::allocatorlibrbd::cache::pwl::SyncPointLogEntry, (
_gnu_cxx::_Lock_policy)2>::_M_dispose (
this=0x7fffaff28c10) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#31 0x00007ffff50d8cd0 in std::_Sp_counted_base<(_gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fffaff28c10) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#32 0x00007ffff50d7f3d in std::
_shared_count<(_gnu_cxx::_Lock_policy)2>::~_shared_count (this=0x7fff7208ce18, _in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#33 0x00007fffb2133118 in std::
_shared_ptr<librbd::cache::pwl::SyncPointLogEntry, (_gnu_cxx::_Lock_policy)2>::~_shared_ptr (this=0x7fff7208ce10, __in_chrg=<optimized out>)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#34 0x00007fffb2133138 in std::shared_ptrlibrbd::cache::pwl::SyncPointLogEntry::~shared_ptr (this=0x7fff7208ce10, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
--Type <RET> for more, q to quit, c to continue without paging--


Related issues 3 (0 open3 closed)

Related to rbd - Bug #52258: [pwl] The write back time of cache is too longClosedCONGMIN YIN

Actions
Related to rbd - Bug #52465: [pwl ssd] assert in AbstractWriteLog::handle_flushed_sync_point()ResolvedCONGMIN YIN

Actions
Copied to rbd - Backport #52092: pacific: [pwl] segment fault on syncpoint stackResolvedDeepika UpadhyayActions
Actions #1

Updated by Ilya Dryomov almost 3 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Hualong Feng
  • Pull request ID set to 42149
Actions #2

Updated by Ilya Dryomov almost 3 years ago

  • Subject changed from [pwl ssd] segment fault on syncpoint stack to [pwl] segment fault on syncpoint stack
Actions #3

Updated by CONGMIN YIN almost 3 years ago

https://github.com/ceph/ceph/pull/42149, supplement the cleanup of syncpoint. But we still don't understand the mechanics here. From the test results, something that leads to a segfault without the patch and runs fine for an extended period of time with the patch.

Actions #4

Updated by CONGMIN YIN almost 3 years ago

use gdb and fio to reproduce.

gdb fio
set args test.conf
run

#cat test.conf
[global]                                                                                                                                                                                                   

ioengine=rbd
clientname=admin
rw=randwrite
#bs=1m
bs=16k
time_based=1
runtime=3h
iodepth=16
group_reporting

[volumes]
pool=test
rbdname=image10

when fio run into segment fault, execute 'bt' to show the stack, but the stack is very long, about 220000+ frame.
Note: this bug is not inevitable, but is likely to occur. When the bug does not occur after half an hour, it can be run again.

Actions #5

Updated by Ilya Dryomov almost 3 years ago

How big is the cache (rbd_persistent_cache_size)?

Actions #6

Updated by CONGMIN YIN almost 3 years ago

default cache size 1GB.

Actions #7

Updated by Ilya Dryomov over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to pacific
Actions #8

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52092: pacific: [pwl] segment fault on syncpoint stack added
Actions #9

Updated by Deepika Upadhyay over 2 years ago

  • Related to Bug #52258: [pwl] The write back time of cache is too long added
Actions #10

Updated by Ilya Dryomov over 2 years ago

  • Related to Bug #52465: [pwl ssd] assert in AbstractWriteLog::handle_flushed_sync_point() added
Actions #11

Updated by Ilya Dryomov about 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF