Bug #10399: kvm die with assert(m_seed < old_pg_num) - Ceph - Ceph

Actions

Copy link

Bug #10399

closed

kvm die with assert(m_seed < old_pg_num)

Added by Mehdi Abaakouk over 9 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Jason Dillaman

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

infernalis,hammer,firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi, when I increase the pg_num on a pool, all the kvm processes that use the pool die with the following backtrace:

osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned int, std::set<pg_t>*) const' thread 7f09351f4700 time 2014-12-19 14:57:39.577364
osd/osd_types.cc: 411: FAILED assert(m_seed < old_pg_num)
ceph version 0.87-73-g70a5569 (70a5569e34786d4124e37561473f1aa02c80f779)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0x7f095afd8f86]
2: (()+0x36cad1) [0x7f095b09fad1]
3: (pg_interval_t::is_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > c
onst&, int, int, unsigned int, unsigned int, pg_t)+0xc0) [0x7f095b09fba0]
4: (Objecter::_calc_target(Objecter::op_target_t*, bool)+0x84d) [0x7f095d4c82dd]
5: (Objecter::_recalc_linger_op_target(Objecter::LingerOp*, RWLock::Context&)+0x45) [0x7f095d4dae55]
6: (Objecter::_scan_requests(Objecter::OSDSession*, bool, bool, std::map<unsigned long, Objecter::Op*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Objecter::Op*> > >&, std::list<Objecter::LingerOp*, std::allo
cator<Objecter::LingerOp*> >&, std::map<unsigned long, Objecter::CommandOp*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Objecter::CommandOp*> > >&)+0x21a) [0x7f095d4db67a]
7: (Objecter::handle_osd_map(MOSDMap*)+0x6f1) [0x7f095d4ddb51]
8: (Objecter::ms_dispatch(Message*)+0x1df) [0x7f095d4e389f]
9: (DispatchQueue::entry()+0x66c) [0x7f095b0f29bc]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f095b16c7bd]
11: (()+0x80a4) [0x7f09566e70a4]
12: (clone()+0x6d) [0x7f095641bccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'

kvm version is 2.1.2
ceph version is 0.87-73-g70a5569

Let's me known if you need more information

Cheers,
sileht

Files

ceph-report.gz (2.11 MB) ceph-report.gz

Mehdi Abaakouk, 12/19/2014 12:20 PM

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Could you please send the output of ceph report ? It would also be great to have (if possible) a set of steps to follow to reproduce, even if in theory only.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Priority changed from Normal to Urgent
Source changed from other to Community (user)

Actions

Copy link

Updated by Mehdi Abaakouk over 9 years ago

File ceph-report.gz ceph-report.gz added

I have attached the ceph-report file.

The crashed kvm processes have a block device on the pool 'r2', the one where I have changed the pg_num.
The pg_num before the crash was 512.
I have used: "ceph osd pool set r2 1024" to change the pg_num
And then the kvm processes has stopped, before ceph have finish to create the new pg and before I have changed the pgp_num.

Actions

Copy link

Updated by Laurent GUERBY over 9 years ago

If this helps: the cluster was in health_ok state when the command was launched (ceph-report.gz is now during the recovery of the pgnum change).

Actions

Copy link

Updated by Sage Weil about 9 years ago

Status changed from New to Need More Info

what version are you running now, and have you seen this since? this version was a random development version shortly after giant, and we do lots of split testing in our qa so i would expect we would see this if it is still present...

Actions

Copy link

Updated by Sage Weil about 9 years ago

Status changed from Need More Info to Can't reproduce

Actions

Copy link

Updated by Roy Keene over 8 years ago

I just ran into this issue as well, while growing pg_num and pgp_num.


2015-07-29 16:53:16.018+0000: starting up libvirt version: 1.2.16, qemu version: 2.3.0
LC_ALL=C PATH=/bin:/usr/bin QEMU_AUDIO_DRV=none /bin/qemu-system-x86_64 -name one-30 -S -machine pc-i440fx-2.2,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid cb6683fc-d1d1-42ad-aa5c-bcb8a0d322c4 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-30.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:rbd/one-16-30-0:auth_supported=none:mon_host=aurae-storage-1\:6789\;aurae-storage-2\:6789\;aurae-storage-3\:6789\;aurae-storage-4\:6789\;aurae-storage-5\:6789\;aurae-storage-6\:6789,if=none,id=drive-ide0-0-0,format=raw,cache=writeback -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/var/lib/one//datastores/0/30/disk.2,if=none,id=drive-ide0-0-1,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=rbd:rbd/one-21-30-1:auth_supported=none:mon_host=aurae-storage-1\:6789\;aurae-storage-2\:6789\;aurae-storage-3\:6789\;aurae-storage-4\:6789\;aurae-storage-5\:6789\;aurae-storage-6\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=12,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=02:00:0a:50:00:13,bus=pci.0,addr=0x3 -vnc 0.0.0.0:30 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox on -msg timestamp=on
osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned int, std::set<pg_t>*) const' thread 7f9f895a8700 time 2015-08-13 14:57:29.443302
osd/osd_types.cc: 459: FAILED assert(m_seed < old_pg_num)
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (()+0x11dfe8) [0x7f9f8f607fe8]
 2: (()+0x1e99e1) [0x7f9f8f6d39e1]
 3: (()+0x1e9abd) [0x7f9f8f6d3abd]
 4: (()+0x8f939) [0x7f9f8f579939]
 5: (()+0xa6c73) [0x7f9f8f590c73]
 6: (()+0xa74ba) [0x7f9f8f5914ba]
 7: (()+0xa89f2) [0x7f9f8f5929f2]
 8: (()+0xae8ff) [0x7f9f8f5988ff]
 9: (()+0x2887aa) [0x7f9f8f7727aa]
 10: (()+0x2b501d) [0x7f9f8f79f01d]
 11: (()+0x8354) [0x7f9f8d909354]
 12: (clone()+0x6d) [0x7f9f8d64871d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
2015-08-13 14:57:29.766+0000: shutting down

Actions

Copy link

Updated by Josh Durgin over 8 years ago

Status changed from Can't reproduce to Need More Info
Regression set to No

Which version of librados was this Roy? How many VMs saw this crash, and how many total are there (trying to figure out how hard it is to reproduce in tests)?

Actions

Copy link