Actions
Bug #3075
closedrados python tests occasionally hang with ms failures
Status:
Resolved
Priority:
High
Assignee:
-
Category:
librados
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
kernel: &id001 branch: testing kdb: true nuke-on-error: true overrides: ceph: branch: wip-3070 conf: client: debug client: 20 debug monc: 20 global: debug ms: 20 ms inject socket failures: 500 mds: debug mds: 20 fs: btrfs log-whitelist: - slow request roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana54.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC60+3L8IN2WBpWHY94YAuCOMVdKs3xUqYQpO1ie127fBomk7fiEhR0RovhmDHWIzNr3qvvNkIp9Y+MHUcpZ7C4MFFMYsy2+zq026Ag3XLEOyZWDSyPfMapd5+nmuvxJqEvAx4wAWBhYVEB3aPFmDmz4mayZ9aSYoA1lhsClxfYpAHZ0zRWX3kY1KxXlk6UrZy0igYGvKIvmubkYcmFzOPsI3aWpgWU1rEXGWsFHOlwaor0KJPnpEsZYTrlPyLZqJcKbI/EcHgti0ak22vsDT7LVMKoyPXeUFL5ZGUEpuqQ+IMiECCMKa8X8vPG2MN9V6DK3gQezF+lo5CRCAu7DYdn ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdrzGTR0Fbl6sedYlwlX+FlmF6fuE3l/RTu2kzOkmG47rPEn5CI37Injb7Epc50RXCbUIfzmDqtEY6uZT3YssYrE4jvhQlynPndbn1KmiTbgxTyuumGXv7O4OOntezighA1W49phUNZys1DhdEEO8VSQAIdHrBgBLhY9DDgC4LAhrP4BSbDTN0rUXtYYHBj4aa3sJV0o3sKjpsyjjlieEQnto6JkjK6EGZCSuY+AyMZyLJjFTgMwJ9i4aC5eZoWZAWSDfDsxo8PtFR+kjUmz5uiheyn5lAzKBxmd4ZNojf7wOhSGia0ghbtUeQkdoRZXZhP2ourNn3uAguf1xt43kX ubuntu@plana57.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuMOcu2XPQovy/Qzmwyvc9tvGP9JZVJ6cqiJ3RPOSGgAifKLTxe2ramHpD8AKcdthu8VAfouFpZK4CtBWKJowurR+4yZKgEugzvYuZ/nK/np56vreBQmRBWD1vLPtxPsTT3YGu5qx+ixdSwrSxexxc0/7+EW9x1D6knL+OGUNWksoGIRlXxjh9qafbw/1XKeQQF28vxBXHofXUFY8USMUcq5HDuaFfmgKzufH6vk84oqyr/jtGej6b4g6tbGiHPYR+o5tmTQHyxpOxqLZP2RFFqHlQ/QaOmRvSNIoOo+1UbqdcWsLk16/lXIS1mI+BZsZouk1H+fGeMTEUDGktiPW7 tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: null - ceph-fuse: null - workunit: clients: client.0: - rados/test_python.sh
i got 2 hangs out of ~200 runs
gdb attach to python didn't say much, except that it was blocking in RadosClient::pool_delete in one case and
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007f6bfc279196 in Wait (mutex=..., this=0x7fff537bf790) at ./common/Cond.h:55 #2 librados::RadosClient::pool_create (this=<optimized out>, name=..., auid=0, crush_rule=0 '\000') at librados/RadosClient.cc:413 #3 0x00007f6bfc26c839 in rados_pool_create (cluster=0x12faf10, name=<optimized out>) at librados/librados.cc:1652
in another. probably an issue in the monclient/mon interaction...
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-fuse2/14739
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-fuse2/14564
Updated by Sage Weil over 11 years ago
void Objecter::maybe_request_map(epoch_t epoch) { int flag = 0; if (osdmap->test_flag(CEPH_OSDMAP_FULL)) { ldout(cct, 10) << "maybe_request_map subscribing (continuous) to next osd map (FULL flag is set)" << dendl; } else { ldout(cct, 10) << "maybe_request_map subscribing (onetime) to next osd map" << dendl; flag = CEPH_SUBSCRIBE_ONETIME; } if (!epoch) { epoch = osdmap->get_epoch() ? osdmap->get_epoch()+1 : 0; } if (monc->sub_want("osdmap", epoch, flag)) monc->renew_subs(); }
this is broken. the epoch we send with the subscription is the start epoch, not what we (think we) want.
Updated by Sage Weil over 11 years ago
- Status changed from New to Resolved
Actions