Project

General

Profile

Actions

Bug #2454

closed

"rbd info xyz" hanging forever sometimes

Added by Simon Frerichs almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're running ceph with 3 mon and 21 osds to host about 80 KVM VMs.
Sometimes "rbd info" is hanging forever without any clear reason and shows something like "tid 1 on osd.X is laggy". Sometimes the request gets directed to another osd and the problem doesn't occur. This problem isn't bound to a specific osd.
The whole cluster is performing well when this happens and disk load on the "laggy" osd is low.

2012-05-20 18:14:24.558379    pg v2743437: 2112 pgs: 2112 active+clean; 2466 GB data, 4896 GB used, 31304 GB / 36201 GB avail

We're running current next branch ( git rev 4277d4d3378dde4264e2b8d211371569219c6e4b ). The problem isn't new and already happened with last 3 stable versions.

Here is some log when running rbd with "--debug-rbd 20 --debug-objecter 20 --debug-ms=1"

rbd info kvm1148  --debug-rbd 20 --debug-objecter 20 --debug-ms=1
2012-05-20 17:05:16.678000 7f9209606760  1 -- :/0 messenger.start
2012-05-20 17:05:16.678509 7f9209606760  1 -- :/1024998 --> 10.100.1.1:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7f9200001750 con 0x7f92000013f0
2012-05-20 17:05:16.679178 7f92095fe700  1 -- 10.100.1.16:0/1024998 learned my addr 10.100.1.16:0/1024998
2012-05-20 17:05:16.680364 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 1 ==== mon_map v1 ==== 751+0+0 (4218198165 0 0) 0x27b6300 con 0x7f92000013f0
2012-05-20 17:05:16.680458 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 2 ==== auth_reply(proto 2 0 Success) v1 ==== 33+0+0 (3384519404 0 0) 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.680628 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x27b60d0 con 0x7f92000013f0
2012-05-20 17:05:16.681371 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 3 ==== auth_reply(proto 2 0 Success) v1 ==== 206+0+0 (1788477452 0 0) 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.681469 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.682348 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 4 ==== auth_reply(proto 2 0 Success) v1 ==== 409+0+0 (696677561 0 0) 0x27b7250 con 0x7f92000013f0
2012-05-20 17:05:16.682410 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f9200001950 con 0x7f92000013f0
2012-05-20 17:05:16.682562 7f9209606760 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:16.682595 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=0}) v2 -- ?+0 0x7f9200001860 con 0x7f92000013f0
2012-05-20 17:05:16.682615 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=0}) v2 -- ?+0 0x7f9200001d00 con 0x7f92000013f0
2012-05-20 17:05:16.683203 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 5 ==== mon_map v1 ==== 751+0+0 (4218198165 0 0) 0x27b6de0 con 0x7f92000013f0
2012-05-20 17:05:16.683262 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x27b7960 con 0x7f92000013f0
2012-05-20 17:05:16.683819 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 7 ==== osd_map(6883..6883 src has 6383..6883) v3 ==== 11213+0+0 (2197969998 0 0) 0x27b6de0 con 0x7f92000013f0
2012-05-20 17:05:16.683861 7f92061ee700  3 client.11755.objecter handle_osd_map got epochs [6883,6883] > 0
2012-05-20 17:05:16.683871 7f92061ee700  3 client.11755.objecter handle_osd_map decoding full epoch 6883
2012-05-20 17:05:16.684043 7f92061ee700 20 client.11755.objecter dump_active .. 0 homeless
2012-05-20 17:05:16.684077 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f9200002e20 con 0x7f92000013f0
2012-05-20 17:05:16.684210 7f9209606760 20 librbd: open_image: ictx =  0x7f920000b8d0 name =  'kvm1148' snap_name = ''
2012-05-20 17:05:16.684231 7f9209606760 20 librbd: ictx_refresh 0x7f920000b8d0
2012-05-20 17:05:16.684298 7f9209606760 10 client.11755.objecter recalc_op_target tid 1 pgid 2.7d848fd9 acting [2,15]
2012-05-20 17:05:16.684355 7f9209606760 20 client.11755.objecter  note: not requesting commit
2012-05-20 17:05:16.684357 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 9 ==== osd_map(6883..6883 src has 6383..6883) v3 ==== 11213+0+0 (2197969998 0 0) 0x7f920000c560 con 0x7f92000013f0
2012-05-20 17:05:16.684375 7f9209606760 10 client.11755.objecter op_submit oid kvm1148.rbd @2 [read 0~4096] tid 1 osd.2
2012-05-20 17:05:16.684388 7f9209606760 15 client.11755.objecter send_op 1 to osd.2
2012-05-20 17:05:16.684397 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- osd_op(client.11755.0:1 kvm1148.rbd [read 0~4096] 2.7d848fd9) v4 -- ?+0 0x7f920000d6b0 con 0x7f920000ce90
2012-05-20 17:05:16.684414 7f9209606760  5 client.11755.objecter 1 unacked, 0 uncommitted
2012-05-20 17:05:16.684427 7f92061ee700  3 client.11755.objecter handle_osd_map ignoring epochs [6883,6883] <= 6883
2012-05-20 17:05:16.684431 7f92061ee700 20 client.11755.objecter dump_active .. 0 homeless
2012-05-20 17:05:16.684433 7f92061ee700 20 client.11755.objecter 1 2.7d848fd9 osd.2 kvm1148.rbd [read 0~4096]
2012-05-20 17:05:16.684448 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f920000c8c0 con 0x7f92000013f0
2012-05-20 17:05:21.682650 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:26.682758 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:31.682864 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:31.682881 7f91fffff700  2 client.11755.objecter  tid 1 on osd.2 is laggy
2012-05-20 17:05:31.682889 7f91fffff700 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:31.682898 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=6884}) v2 -- ?+0 0x27b8100 con 0x7f92000013f0
2012-05-20 17:05:31.682937 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x27b8390 con 0x7f920000ce90
2012-05-20 17:05:31.683522 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 11 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f920000c8c0 con 0x7f92000013f0
2012-05-20 17:05:36.683040 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:36.683058 7f91fffff700  2 client.11755.objecter  tid 1 on osd.2 is laggy
2012-05-20 17:05:36.683062 7f91fffff700 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:36.683067 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x7f920000d800 con 0x7f920000ce90

This progress keeps looping forever.

Looking at the osd2 admin socket for in flight ops shows nothing:

ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_ops_in_flight
{ "num_ops": 0,
  "ops": []}


Files

cephlog1.txt (65.9 KB) cephlog1.txt Simon Frerichs, 05/20/2012 10:34 PM
_rbd.log (90.3 KB) _rbd.log Simon Frerichs, 05/22/2012 12:36 AM
_osd2.log (44.6 MB) _osd2.log Simon Frerichs, 05/22/2012 12:36 AM
6886 (10.9 KB) 6886 Simon Frerichs, 05/29/2012 10:57 PM
valgrind.log (5.57 KB) valgrind.log Simon Frerichs, 05/30/2012 08:07 AM
rbd.log (179 KB) rbd.log Simon Frerichs, 05/30/2012 08:07 AM
osd.log.gz (8.49 MB) osd.log.gz Simon Frerichs, 05/30/2012 08:07 AM
osd2.log.gz (8.48 MB) osd2.log.gz Simon Frerichs, 05/30/2012 08:22 AM
osd.log.gz (11 MB) osd.log.gz Simon Frerichs, 05/30/2012 09:37 AM
valgrind.log (6.89 KB) valgrind.log Simon Frerichs, 05/30/2012 09:37 AM
osdmap.6924__0_0A051208 (10.9 KB) osdmap.6924__0_0A051208 Simon Frerichs, 05/30/2012 10:03 AM
osdmap.6925__0_0A0513D8 (10.9 KB) osdmap.6925__0_0A0513D8 Simon Frerichs, 05/30/2012 10:03 AM
6939 (11.4 KB) 6939 Simon Frerichs, 05/30/2012 10:12 AM
osdmap.6939__0_0A052908 (11.4 KB) osdmap.6939__0_0A052908 Simon Frerichs, 05/30/2012 10:12 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #8515: rbd ls or rbd info hangs for ever on specific clientRejectedSage Weil06/03/2014

Actions
Actions

Also available in: Atom PDF