Project

General

Profile

Actions

Bug #13545

closed

QEMU processes using librbd crashing with FAILED assert(m_ictx->owner_lock.is_locked())

Added by Edmund Rhudy over 8 years ago. Updated over 8 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I performed an upgrade of Ceph from 0.94.2 to 0.94.4 in a lab scenario (we had been waiting for 0.94.4 to resolve the bug described in http://tracker.ceph.com/issues/10399). In the course of testing after the upgrade, some QEMU instances crashed with the following stack trace:

librbd/LibrbdWriteback.cc: In function 'virtual ceph_tid_t librbd::LibrbdWriteback::write(const object_t&, const object_locator_t&, uint64_t, uint64_t, const SnapContext&, const bufferlist&, utime_t, uint64_t, __u32, Context*)' thread 7f28edffb700 time 2015-10-20 11:49:08.120786
librbd/LibrbdWriteback.cc: 160: FAILED assert(m_ictx->owner_lock.is_locked())
 ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
 1: (()+0x17258b) [0x7f291798858b]
 2: (()+0xa9573) [0x7f29178bf573]
 3: (()+0x3a90ca) [0x7f2917bbf0ca]
 4: (()+0x3b583d) [0x7f2917bcb83d]
 5: (()+0x7212c) [0x7f291788812c]
 6: (()+0x9590f) [0x7f29178ab90f]
 7: (()+0x969a3) [0x7f29178ac9a3]
 8: (()+0x4782a) [0x7f291785d82a]
 9: (()+0x56599) [0x7f291786c599]
 10: (()+0x7284e) [0x7f291788884e]
 11: (()+0x162b7e) [0x7f2917978b7e]
 12: (()+0x163c10) [0x7f2917979c10]
 13: (()+0x8182) [0x7f2910e66182]
 14: (clone()+0x6d) [0x7f2910b9347d]

The commit that introduced the assert (https://github.com/ceph/ceph/commit/a38f9e5104a6e08e130dc4f15ad19a06d9e63719) is present in both 0.94.2 and 0.94.4, so I don't think this was introduced by a bug in 0.94.4 itself, but I would like to understand the circumstances that are leading to the crashes.

Timeline of upgrade:

  • uploaded about 20 copies of the same Ubuntu 14.04 cloud image in RAW format via OpenStack Glance to pool "images"
  • launched approximately 50 instances in OpenStack across pools "vms" and "volumes-ssd"; the RADOS block device backing each instance is a COW clone of a randomly selected image from the images pool
  • set noout
  • upgraded each mon one at a time and OSDs hosted alongside the mon, also upgraded QEMU from 2.2+dfsg-5expubuntu9.3 to 2.2+dfsg-5expubuntu9.4
  • upgraded remaining OSD-only servers and upgraded QEMU on them as well
  • verified cluster health (cluster was in HEALTH_WARN due to suboptimal PG settings on some pools but was not in recovery, all OSDs were up and in)
  • unset noout

After the upgrades were complete, I began testing for bug #10399 by creating an image named "test" in the vms pool and writing to it with rbd bench-write while increasing pg_num/pgp_num on the vms pool repeatedly, going from 128 PGs to 512 PGs in steps of 128. I then restarted all QEMU instances; most of them using the vms pool had crashed by this point because they had been using librbd 0.94.2. Instances that had crashed were started again via OpenStack Nova, and instances that had survived were soft rebooted. I then began testing again in the same fashion as previously, increasing PGs on the vms pool from 512 to 1024, again in steps of 128.

At this point I noticed a pair of instances on one hypervisor were indicated as Shutoff in Nova. I restarted them again and checked the QEMU instance logs, where I found the stack trace posted above. Both instances had previously crashed due to #10399 and restarted normally, but crashed again within a few minutes.

I checked other hypervisors for similar crashes. Since doing the upgrade, I have seen occasional spontaneous crashes of test instances launched by our monitoring with this particular assertion failing. We checked a production cluster running Ceph 0.94.2 and QEMU 2.2+dfsg-5expubuntu9.3 and did not find any evidence of any instances ever crashing with a failure of this assertion, so something that changed in this upgrade is provoking instances to crash occasionally.

I have preserved the crime scene and can provide more information as needed.

Actions #1

Updated by Jason Dillaman over 8 years ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)
Actions #2

Updated by Jason Dillaman over 8 years ago

  • Status changed from New to Duplicate

Marking duplicate of #13559

Actions

Also available in: Atom PDF