Project

General

Profile

Actions

Bug #10399

closed

kvm die with assert(m_seed < old_pg_num)

Added by Mehdi Abaakouk over 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Jason Dillaman
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
infernalis,hammer,firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, when I increase the pg_num on a pool, all the kvm processes that use the pool die with the following backtrace:

osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned int, std::set<pg_t>*) const' thread 7f09351f4700 time 2014-12-19 14:57:39.577364
osd/osd_types.cc: 411: FAILED assert(m_seed < old_pg_num)
ceph version 0.87-73-g70a5569 (70a5569e34786d4124e37561473f1aa02c80f779)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0x7f095afd8f86]
2: (()+0x36cad1) [0x7f095b09fad1]
3: (pg_interval_t::is_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > c
onst&, int, int, unsigned int, unsigned int, pg_t)+0xc0) [0x7f095b09fba0]
4: (Objecter::_calc_target(Objecter::op_target_t*, bool)+0x84d) [0x7f095d4c82dd]
5: (Objecter::_recalc_linger_op_target(Objecter::LingerOp*, RWLock::Context&)+0x45) [0x7f095d4dae55]
6: (Objecter::_scan_requests(Objecter::OSDSession*, bool, bool, std::map<unsigned long, Objecter::Op*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Objecter::Op*> > >&, std::list<Objecter::LingerOp*, std::allo
cator<Objecter::LingerOp*> >&, std::map<unsigned long, Objecter::CommandOp*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Objecter::CommandOp*> > >&)+0x21a) [0x7f095d4db67a]
7: (Objecter::handle_osd_map(MOSDMap*)+0x6f1) [0x7f095d4ddb51]
8: (Objecter::ms_dispatch(Message*)+0x1df) [0x7f095d4e389f]
9: (DispatchQueue::entry()+0x66c) [0x7f095b0f29bc]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f095b16c7bd]
11: (()+0x80a4) [0x7f09566e70a4]
12: (clone()+0x6d) [0x7f095641bccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'

kvm version is 2.1.2
ceph version is 0.87-73-g70a5569

Let's me known if you need more information

Cheers,
sileht


Files

ceph-report.gz (2.11 MB) ceph-report.gz Mehdi Abaakouk, 12/19/2014 12:20 PM

Related issues 4 (0 open4 closed)

Has duplicate Ceph - Bug #10543: "FAILED assert(m_seed < old_pg_num)" in upgrade:giant-x-hammer-distro-basic-vps runDuplicate01/14/2015

Actions
Copied to Ceph - Backport #12751: kvm die with assert(m_seed < old_pg_num)ResolvedJosh Durgin12/19/2014Actions
Copied to Ceph - Backport #12752: kvm die with assert(m_seed < old_pg_num)ResolvedJosh Durgin12/19/2014Actions
Copied to Ceph - Backport #12881: kvm die with assert(m_seed < old_pg_num)RejectedJason DillamanActions
Actions #1

Updated by Loïc Dachary over 9 years ago

Could you please send the output of ceph report ? It would also be great to have (if possible) a set of steps to follow to reproduce, even if in theory only.

Actions #2

Updated by Sage Weil over 9 years ago

  • Priority changed from Normal to Urgent
  • Source changed from other to Community (user)
Actions #3

Updated by Mehdi Abaakouk over 9 years ago

I have attached the ceph-report file.

The crashed kvm processes have a block device on the pool 'r2', the one where I have changed the pg_num.
The pg_num before the crash was 512.
I have used: "ceph osd pool set r2 1024" to change the pg_num
And then the kvm processes has stopped, before ceph have finish to create the new pg and before I have changed the pgp_num.

Actions #4

Updated by Laurent GUERBY over 9 years ago

If this helps: the cluster was in health_ok state when the command was launched (ceph-report.gz is now during the recovery of the pgnum change).

Actions #5

Updated by Sage Weil about 9 years ago

  • Status changed from New to Need More Info

what version are you running now, and have you seen this since? this version was a random development version shortly after giant, and we do lots of split testing in our qa so i would expect we would see this if it is still present...

Actions #6

Updated by Sage Weil about 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #7

Updated by Roy Keene over 8 years ago

I just ran into this issue as well, while growing pg_num and pgp_num.


2015-07-29 16:53:16.018+0000: starting up libvirt version: 1.2.16, qemu version: 2.3.0
LC_ALL=C PATH=/bin:/usr/bin QEMU_AUDIO_DRV=none /bin/qemu-system-x86_64 -name one-30 -S -machine pc-i440fx-2.2,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid cb6683fc-d1d1-42ad-aa5c-bcb8a0d322c4 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-30.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:rbd/one-16-30-0:auth_supported=none:mon_host=aurae-storage-1\:6789\;aurae-storage-2\:6789\;aurae-storage-3\:6789\;aurae-storage-4\:6789\;aurae-storage-5\:6789\;aurae-storage-6\:6789,if=none,id=drive-ide0-0-0,format=raw,cache=writeback -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/var/lib/one//datastores/0/30/disk.2,if=none,id=drive-ide0-0-1,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=rbd:rbd/one-21-30-1:auth_supported=none:mon_host=aurae-storage-1\:6789\;aurae-storage-2\:6789\;aurae-storage-3\:6789\;aurae-storage-4\:6789\;aurae-storage-5\:6789\;aurae-storage-6\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=12,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=02:00:0a:50:00:13,bus=pci.0,addr=0x3 -vnc 0.0.0.0:30 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox on -msg timestamp=on
osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned int, std::set<pg_t>*) const' thread 7f9f895a8700 time 2015-08-13 14:57:29.443302
osd/osd_types.cc: 459: FAILED assert(m_seed < old_pg_num)
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (()+0x11dfe8) [0x7f9f8f607fe8]
 2: (()+0x1e99e1) [0x7f9f8f6d39e1]
 3: (()+0x1e9abd) [0x7f9f8f6d3abd]
 4: (()+0x8f939) [0x7f9f8f579939]
 5: (()+0xa6c73) [0x7f9f8f590c73]
 6: (()+0xa74ba) [0x7f9f8f5914ba]
 7: (()+0xa89f2) [0x7f9f8f5929f2]
 8: (()+0xae8ff) [0x7f9f8f5988ff]
 9: (()+0x2887aa) [0x7f9f8f7727aa]
 10: (()+0x2b501d) [0x7f9f8f79f01d]
 11: (()+0x8354) [0x7f9f8d909354]
 12: (clone()+0x6d) [0x7f9f8d64871d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
2015-08-13 14:57:29.766+0000: shutting down

Actions #8

Updated by Josh Durgin over 8 years ago

  • Status changed from Can't reproduce to Need More Info
  • Regression set to No

Which version of librados was this Roy? How many VMs saw this crash, and how many total are there (trying to figure out how hard it is to reproduce in tests)?

Actions #9

Updated by Roy Keene over 8 years ago

This was Ceph 0.94.1 on Linux/x86_64, so whatever version of librados came with that release.

About 2 of 10 VMs died this way simultaneously.

The QEMU version used is 2.3.0.

Actions #10

Updated by Wei-Chung Cheng over 8 years ago

I have the same issue with ceph-0.94.2
Ubuntu version: 12.04.5
kernel version: 3.13.0-35
Qemu version: 2.0.0

8 of 20 VMs shutoff (with normal I/O stress)

Actually without VMs, it hard to reproduce(I do not reproduce successfully until now).

I still try to figure out whether m_seed or old_pg_num is wrong.

thanks!!!!

Actions #11

Updated by Sage Weil over 8 years ago

is this teh same as #10543 ?

Actions #12

Updated by Wei-Chung Cheng over 8 years ago

Sage Weil wrote:

is this teh same as #10543 ?

yes, I think so.

It looks like very similar on stack trace and environment.

Actions #13

Updated by Jason Dillaman over 8 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Jason Dillaman

Easily repeatable when "rbd bench-write" is running in the background while you double the number of PGs.

Actions #14

Updated by Jason Dillaman over 8 years ago

  • Backport set to hammer,giant,firefly
Actions #15

Updated by Jason Dillaman over 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #16

Updated by Loïc Dachary over 8 years ago

  • Backport changed from hammer,giant,firefly to hammer,firefly

giant is retired

Actions #17

Updated by Jason Dillaman over 8 years ago

  • Backport changed from hammer,firefly to infernalis,hammer,firefly
Actions #18

Updated by Jason Dillaman over 8 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #19

Updated by Nathan Cutler over 8 years ago

  • Status changed from Pending Backport to Resolved
Actions #20

Updated by Nathan Cutler over 8 years ago

Already in infernalis. Hammer and firefly backports have been merged.

Actions

Also available in: Atom PDF