Project

General

Profile

Bug #2454

"rbd info xyz" hanging forever sometimes

Added by Simon Frerichs almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're running ceph with 3 mon and 21 osds to host about 80 KVM VMs.
Sometimes "rbd info" is hanging forever without any clear reason and shows something like "tid 1 on osd.X is laggy". Sometimes the request gets directed to another osd and the problem doesn't occur. This problem isn't bound to a specific osd.
The whole cluster is performing well when this happens and disk load on the "laggy" osd is low.

2012-05-20 18:14:24.558379    pg v2743437: 2112 pgs: 2112 active+clean; 2466 GB data, 4896 GB used, 31304 GB / 36201 GB avail

We're running current next branch ( git rev 4277d4d3378dde4264e2b8d211371569219c6e4b ). The problem isn't new and already happened with last 3 stable versions.

Here is some log when running rbd with "--debug-rbd 20 --debug-objecter 20 --debug-ms=1"

rbd info kvm1148  --debug-rbd 20 --debug-objecter 20 --debug-ms=1
2012-05-20 17:05:16.678000 7f9209606760  1 -- :/0 messenger.start
2012-05-20 17:05:16.678509 7f9209606760  1 -- :/1024998 --> 10.100.1.1:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7f9200001750 con 0x7f92000013f0
2012-05-20 17:05:16.679178 7f92095fe700  1 -- 10.100.1.16:0/1024998 learned my addr 10.100.1.16:0/1024998
2012-05-20 17:05:16.680364 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 1 ==== mon_map v1 ==== 751+0+0 (4218198165 0 0) 0x27b6300 con 0x7f92000013f0
2012-05-20 17:05:16.680458 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 2 ==== auth_reply(proto 2 0 Success) v1 ==== 33+0+0 (3384519404 0 0) 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.680628 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x27b60d0 con 0x7f92000013f0
2012-05-20 17:05:16.681371 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 3 ==== auth_reply(proto 2 0 Success) v1 ==== 206+0+0 (1788477452 0 0) 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.681469 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x27b6670 con 0x7f92000013f0
2012-05-20 17:05:16.682348 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 4 ==== auth_reply(proto 2 0 Success) v1 ==== 409+0+0 (696677561 0 0) 0x27b7250 con 0x7f92000013f0
2012-05-20 17:05:16.682410 7f92061ee700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f9200001950 con 0x7f92000013f0
2012-05-20 17:05:16.682562 7f9209606760 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:16.682595 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=0}) v2 -- ?+0 0x7f9200001860 con 0x7f92000013f0
2012-05-20 17:05:16.682615 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=0}) v2 -- ?+0 0x7f9200001d00 con 0x7f92000013f0
2012-05-20 17:05:16.683203 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 5 ==== mon_map v1 ==== 751+0+0 (4218198165 0 0) 0x27b6de0 con 0x7f92000013f0
2012-05-20 17:05:16.683262 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x27b7960 con 0x7f92000013f0
2012-05-20 17:05:16.683819 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 7 ==== osd_map(6883..6883 src has 6383..6883) v3 ==== 11213+0+0 (2197969998 0 0) 0x27b6de0 con 0x7f92000013f0
2012-05-20 17:05:16.683861 7f92061ee700  3 client.11755.objecter handle_osd_map got epochs [6883,6883] > 0
2012-05-20 17:05:16.683871 7f92061ee700  3 client.11755.objecter handle_osd_map decoding full epoch 6883
2012-05-20 17:05:16.684043 7f92061ee700 20 client.11755.objecter dump_active .. 0 homeless
2012-05-20 17:05:16.684077 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f9200002e20 con 0x7f92000013f0
2012-05-20 17:05:16.684210 7f9209606760 20 librbd: open_image: ictx =  0x7f920000b8d0 name =  'kvm1148' snap_name = ''
2012-05-20 17:05:16.684231 7f9209606760 20 librbd: ictx_refresh 0x7f920000b8d0
2012-05-20 17:05:16.684298 7f9209606760 10 client.11755.objecter recalc_op_target tid 1 pgid 2.7d848fd9 acting [2,15]
2012-05-20 17:05:16.684355 7f9209606760 20 client.11755.objecter  note: not requesting commit
2012-05-20 17:05:16.684357 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 9 ==== osd_map(6883..6883 src has 6383..6883) v3 ==== 11213+0+0 (2197969998 0 0) 0x7f920000c560 con 0x7f92000013f0
2012-05-20 17:05:16.684375 7f9209606760 10 client.11755.objecter op_submit oid kvm1148.rbd @2 [read 0~4096] tid 1 osd.2
2012-05-20 17:05:16.684388 7f9209606760 15 client.11755.objecter send_op 1 to osd.2
2012-05-20 17:05:16.684397 7f9209606760  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- osd_op(client.11755.0:1 kvm1148.rbd [read 0~4096] 2.7d848fd9) v4 -- ?+0 0x7f920000d6b0 con 0x7f920000ce90
2012-05-20 17:05:16.684414 7f9209606760  5 client.11755.objecter 1 unacked, 0 uncommitted
2012-05-20 17:05:16.684427 7f92061ee700  3 client.11755.objecter handle_osd_map ignoring epochs [6883,6883] <= 6883
2012-05-20 17:05:16.684431 7f92061ee700 20 client.11755.objecter dump_active .. 0 homeless
2012-05-20 17:05:16.684433 7f92061ee700 20 client.11755.objecter 1 2.7d848fd9 osd.2 kvm1148.rbd [read 0~4096]
2012-05-20 17:05:16.684448 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f920000c8c0 con 0x7f92000013f0
2012-05-20 17:05:21.682650 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:26.682758 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:31.682864 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:31.682881 7f91fffff700  2 client.11755.objecter  tid 1 on osd.2 is laggy
2012-05-20 17:05:31.682889 7f91fffff700 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:31.682898 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.1:6789/0 -- mon_subscribe({monmap=9+,osdmap=6884}) v2 -- ?+0 0x27b8100 con 0x7f92000013f0
2012-05-20 17:05:31.682937 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x27b8390 con 0x7f920000ce90
2012-05-20 17:05:31.683522 7f92061ee700  1 -- 10.100.1.16:0/1024998 <== mon.0 10.100.1.1:6789/0 11 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (496062740 0 0) 0x7f920000c8c0 con 0x7f92000013f0
2012-05-20 17:05:36.683040 7f91fffff700 10 client.11755.objecter tick
2012-05-20 17:05:36.683058 7f91fffff700  2 client.11755.objecter  tid 1 on osd.2 is laggy
2012-05-20 17:05:36.683062 7f91fffff700 10 client.11755.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-20 17:05:36.683067 7f91fffff700  1 -- 10.100.1.16:0/1024998 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x7f920000d800 con 0x7f920000ce90

This progress keeps looping forever.

Looking at the osd2 admin socket for in flight ops shows nothing:

ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_ops_in_flight
{ "num_ops": 0,
  "ops": []}

cephlog1.txt View (65.9 KB) Simon Frerichs, 05/20/2012 10:34 PM

_rbd.log View (90.3 KB) Simon Frerichs, 05/22/2012 12:36 AM

_osd2.log View (44.6 MB) Simon Frerichs, 05/22/2012 12:36 AM

6886 (10.9 KB) Simon Frerichs, 05/29/2012 10:57 PM

valgrind.log View (5.57 KB) Simon Frerichs, 05/30/2012 08:07 AM

rbd.log View (179 KB) Simon Frerichs, 05/30/2012 08:07 AM

osd.log.gz (8.49 MB) Simon Frerichs, 05/30/2012 08:07 AM

osd2.log.gz (8.48 MB) Simon Frerichs, 05/30/2012 08:22 AM

osd.log.gz (11 MB) Simon Frerichs, 05/30/2012 09:37 AM

valgrind.log View (6.89 KB) Simon Frerichs, 05/30/2012 09:37 AM

osdmap.6924__0_0A051208 (10.9 KB) Simon Frerichs, 05/30/2012 10:03 AM

osdmap.6925__0_0A0513D8 (10.9 KB) Simon Frerichs, 05/30/2012 10:03 AM

6939 (11.4 KB) Simon Frerichs, 05/30/2012 10:12 AM

osdmap.6939__0_0A052908 (11.4 KB) Simon Frerichs, 05/30/2012 10:12 AM


Related issues

Related to Ceph - Bug #8515: rbd ls or rbd info hangs for ever on specific client Rejected 06/03/2014

Associated revisions

Revision 519fadbc (diff)
Added by Sage Weil over 11 years ago

osd: fix rewewight_by_utilization

Update the incremental, not the in-memory OSDMap!

Fixes: #2454
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil almost 12 years ago

  • Priority changed from Normal to High

#2 Updated by Sage Weil almost 12 years ago

  • Status changed from New to Need More Info

How reproducible is this? If you can reproduce with 'debug ms = 20' on the client side, we can be sure it's the osd's fault.

If it does look like the OSD, if you're doing it on a specific image, we can then bump the logging on osd.2 (or whichever it is) to see what happens on that end... but first let's make sure it's not the client's fault.

#3 Updated by Simon Frerichs almost 12 years ago

I wrote a little script which calls "rbd ls" and then loops through "rbd info $x" for every volume.
We've about 80 volumes and the bug is happening on about ~5. So it's "easy" to reproduce.

I reproduced the bug with "debug ms = 20" and extracted some loop part. ( full log is attached )

...
2012-05-21 07:27:58.158249 7ff19e4e3700 10 client.11820.objecter tick
2012-05-21 07:27:58.158265 7ff19e4e3700  2 client.11820.objecter  tid 1 on osd.2 is laggy
2012-05-21 07:27:58.158268 7ff19e4e3700 10 client.11820.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-21 07:27:58.158273 7ff19e4e3700  1 -- 10.100.1.16:0/1016781 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x7ff190003d80 con 0x7ff19800b5c0
2012-05-21 07:27:58.158281 7ff19e4e3700 20 -- 10.100.1.16:0/1016781 submit_message ping v1 remote 10.100.1.11:6802/30752
2012-05-21 07:27:58.158303 7ff19d4e1700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:27:58.158325 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer encoding 5 0x7ff190003d80 ping v1
2012-05-21 07:27:58.158332 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer sending 5 0x7ff190003d80
2012-05-21 07:27:58.158336 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).write_message 0x7ff190003d80
2012-05-21 07:27:58.158358 7ff19d4e1700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:27:58.158368 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer sleeping
2012-05-21 07:27:58.365138 7ff19d3e0700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).reader got ACK
2012-05-21 07:27:58.365163 7ff19d3e0700 15 reader got ack seq 5
2012-05-21 07:27:58.365164 7ff19d3e0700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).reader reading tag...
2012-05-21 07:28:01.153723 7ff19f5e6700 20 -- 10.100.1.16:0/1016781 send_keepalive con 0x7ff198001520, have pipe.
2012-05-21 07:28:01.153762 7ff1a39f8700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.1:6789/0 pipe(0x7ff1980012b0 sd=3 pgs=253 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:28:01.153779 7ff1a39f8700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.1:6789/0 pipe(0x7ff1980012b0 sd=3 pgs=253 cs=1 l=1).write_keepalive
2012-05-21 07:28:01.153803 7ff1a39f8700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.1:6789/0 pipe(0x7ff1980012b0 sd=3 pgs=253 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:28:01.153812 7ff1a39f8700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.1:6789/0 pipe(0x7ff1980012b0 sd=3 pgs=253 cs=1 l=1).writer sleeping
2012-05-21 07:28:03.158374 7ff19e4e3700 10 client.11820.objecter tick
2012-05-21 07:28:03.158390 7ff19e4e3700  2 client.11820.objecter  tid 1 on osd.2 is laggy
2012-05-21 07:28:03.158393 7ff19e4e3700 10 client.11820.objecter maybe_request_map subscribing (onetime) to next osd map
2012-05-21 07:28:03.158398 7ff19e4e3700  1 -- 10.100.1.16:0/1016781 --> 10.100.1.11:6802/30752 -- ping v1 -- ?+0 0x7ff190004ab0 con 0x7ff19800b5c0
2012-05-21 07:28:03.158406 7ff19e4e3700 20 -- 10.100.1.16:0/1016781 submit_message ping v1 remote 10.100.1.11:6802/30752
2012-05-21 07:28:03.158422 7ff19d4e1700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:28:03.158448 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer encoding 6 0x7ff190004ab0 ping v1
2012-05-21 07:28:03.158458 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer sending 6 0x7ff190004ab0
2012-05-21 07:28:03.158463 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).write_message 0x7ff190004ab0
2012-05-21 07:28:03.158539 7ff19d4e1700 10 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer: state = 2 policy.server=0
2012-05-21 07:28:03.158550 7ff19d4e1700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).writer sleeping
2012-05-21 07:28:03.365087 7ff19d3e0700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).reader got ACK
2012-05-21 07:28:03.365111 7ff19d3e0700 15 reader got ack seq 6
2012-05-21 07:28:03.365113 7ff19d3e0700 20 -- 10.100.1.16:0/1016781 >> 10.100.1.11:6802/30752 pipe(0x7ff19800b350 sd=5 pgs=1566 cs=1 l=1).reader reading tag...
2012-05-21 07:28:08.158500 7ff19e4e3700 10 client.11820.objecter tick
2012-05-21 07:28:08.158515 7ff19e4e3700  2 client.11820.objecter  tid 1 on osd.2 is laggy
...

#4 Updated by Greg Farnum almost 12 years ago

Yep, the OSD gets the message, acks it, and acks several following pings. Looks like some strange bug on the OSD side of things, from what's come up so far.

Can you capture this happening with "debug osd = 20" output for the OSD?

#5 Updated by Sage Weil almost 12 years ago

  • Category set to OSD
  • Priority changed from High to Urgent

#6 Updated by Simon Frerichs almost 12 years ago

I've attached rbd and osd log. The hanging rbd was started at 09:10:33 and ended at 09:12:18.

One thing came to my attention when scrolling through the osd log:

May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773771 7fc66fe34700 20 osd.2 6886 _share_map_incoming client.11845 46.19.94.16:0/1020342 6886
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773744 7fc66fe34700 20 osd.2 6886 _dispatch 0x2efbd80 osd_op(client.11845.0:1 kvm1148.rbd [read 0~4096] 2.7d848fd9) v4
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773766 7fc66fe34700 15 osd.2 6886 require_same_or_newer_map 6886 (i am 6886) 0x2efbd80
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773771 7fc66fe34700 20 osd.2 6886 _share_map_incoming client.11845 46.19.94.16:0/1020342 6886
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773780 7fc66fe34700  7 osd.2 6886 hit non-existent pg 2.1d9
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773791 7fc66fe34700  7 osd.2 6886 don't have sender's osdmap; assuming it was valid and that client will resend
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773780 7fc66fe34700  7 osd.2 6886 hit non-existent pg 2.1d9
May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773791 7fc66fe34700  7 osd.2 6886 don't have sender's osdmap; assuming it was valid and that client will resend

Seems like the the requests get directed to the wrong osd. Sometimes the rbd info gets directed to another osd and works fine.

ceph pg dump|grep 2.1d9
2.1d9    1352    0    0    0    5216599152    124370    124370    active+clean    2012-05-20 16:43:11.461383    6889'677645    6856'723376    [3,15]    [3,15]    6740'430330    2012-05-14 06:38:38.406414

Seems like this pg is placed on osd3 + 15? Not sure if i'm right :)

#7 Updated by Greg Farnum almost 12 years ago

  • Project changed from Ceph to Linux kernel client
  • Category deleted (OSD)
  • Assignee set to Alex Elder

Argh, you're right. This looks like a problem with a CRUSH mismatch between the userspace and kernelspace implementations. It was fixed recently and I suspect a cherry-pick of the right commit into your kernel would fix it...
Alex? Sage?

#8 Updated by Josh Durgin almost 12 years ago

  • Assignee deleted (Alex Elder)

Greg, this has nothing to do with the kernel - it's the rbd command line tool. I can't seem to change the tracker back to Ceph though.

#9 Updated by Sage Weil almost 12 years ago

  • Project changed from Linux kernel client to Ceph

#10 Updated by Greg Farnum almost 12 years ago

Oh geeze, my bad!

#11 Updated by Simon Frerichs almost 12 years ago

I had some spare time to do some testing, today.
As mentioned on my initial post we're running KVM VPS on Ceph.
I rebooted the VPS running on volume 'kvm1148' which is hanging in rbd info sometimes several times today.
It's also hanging on KVM boot every second or third attempt so i'm sure this problem doesn't belong to the 'rbd' command.

#12 Updated by Simon Frerichs almost 12 years ago

Some additional information:
I'm running Kernel 3.3.6 from kernel.org compiled two days ago from kernel.org on the test system:

root@fcnode06:/home/frerichs# uname -a
Linux fcnode06 3.3.6 #1 SMP Sun May 20 16:17:36 CEST 2012 x86_64 GNU/Linux

#13 Updated by Greg Farnum almost 12 years ago

I notice that the pgid output on these lines don't match, even though they're using the same output function and the value in question (while encoded and decoded between the client and the OSD) should be the same:

2012-05-22 09:10:33.771956 7f1e3a4b6760 10 client.11845.objecter recalc_op_target tid 1 pgid 2.7d848fd9 acting [2,15]

2012-05-22 09:10:33.773780 7fc66fe34700  7 osd.2 6886 hit non-existent pg 2.1d9

Does this give you any ideas, Sage? That client output line should be of the sanitized (not raw) pgid, which makes me think that maybe the client is mistakenly using the wrong value.

#14 Updated by Simon Frerichs over 11 years ago

if you need more information, feel free to ping me.

#15 Updated by Sage Weil over 11 years ago

Well, i see one (unrelated) oddity in this code, but it doesn't explain how the non-existent message is coming up. (It does offer some hope for #2022, though!) Do you have a copy of osdmap epoch 6886? Or can you reproduce again, and this time also attach the osdmap epoch that appears in the messages?

#16 Updated by Simon Frerichs over 11 years ago

Found a copy of the old osdmap at one monitor. I've attached it.

#17 Updated by Sage Weil over 11 years ago

Ok, the puzzling bit is that it isn't able to find the pg in question. It appears a bit hgher up in th elog:

May 22 09:10:30 fcnode01 ceph-osd: 2012-05-22 09:10:30.502471 7f94ea584700 0 osd.3 6886 pg[2.1d9( v 6886'676932 (6883'675931,6886'676932] n=1352 ec=1 les/c 6882/6883 6882/6882/6856) [3,15] r=0 lpr=6882 mlcod 6886'676931 active+clean] watch: ctx->obc=0x1288340 cookie=1 oi.ver

but later, in the same epoch, it is gone:

May 22 09:10:33 fcnode01 ceph-osd: 2012-05-22 09:10:33.773780 7fc66fe34700 7 osd.2 6886 hit non-existent pg 2.1d9

which suggests to me that there is some heap corruption going on. Would it be possible for you to run osd.2 through valgrind?

valgrind --tool=memcheck --log-file=/tmp/foo ceph-osd -i 2 --debug-osd 20 --debug-ms 1

and attach the osd, valgrind, and client logs?

#18 Updated by Simon Frerichs over 11 years ago

Here we go.
I've started the hanging rbd info process at 2012-05-30 16:46:35.281623

#19 Updated by Simon Frerichs over 11 years ago

I've reattached the osd log because the first version hat some old data in it.

#20 Updated by Simon Frerichs over 11 years ago

Simon Frerichs wrote:

I've reattached the osd log because the first version hat some old data in it.

hum, lost the file i had attached. so another post... i'm sorry for this mess.

#21 Updated by Sage Weil over 11 years ago

Simon Frerichs wrote:

Simon Frerichs wrote:

I've reattached the osd log because the first version hat some old data in it.

hum, lost the file i had attached. so another post... i'm sorry for this mess.

Is it possible to attach the whole osd2 log? Or a section that includes an osd map update. e.g., do something like 'ceph osd setmaxosd 22' (does nothing) and include that section the osd log. It's not clear why the PG is showing up periodically in the logs, but isn't being found when that request is sent.

Is this a long-running ceph-osd process, or has it been restarted recently? (If it hasn't, let's not restart it just yet.)

#22 Updated by Simon Frerichs over 11 years ago

Sage Weil wrote:

Is it possible to attach the whole osd2 log? Or a section that includes an osd map update. e.g., do something like 'ceph osd setmaxosd 22' (does nothing) and include that section the osd log. It's not clear why the PG is showing up periodically in the logs, but isn't being found when that request is sent.

Is this a long-running ceph-osd process, or has it been restarted recently? (If it hasn't, let's not restart it just yet.)

I restarted it when you asked for the valgrind data.

So here is what i did this time:

I had to restart the osd to enable logging again:

1. restarted the osd with logging enabled at 18:19
2. waited until pgs are active clean 18:21:53
3. ran rbd info kvm1148 --debug-rbd 20 --debug-objecter 20 --debug-ms=20 --debug-osd=20 and killed it after i noticed it's hanging at 18:24
4. ran ceph setmaxosd 22 ( 2012-05-30 18:25:14.167179 osd e6939: 21 osds: 21 up, 21 in )
5. ran rbd info again and it's hanging again.

full osd log from 18:15 to 18:26 is attached

#23 Updated by Sage Weil over 11 years ago

Just realized those were osd.3 messages mixed into the log, that's why it was confusing. However, osd.3 seems to think 2.1d9 maps to 3,15 instead of 2,15. Can you grab the osdmap epochs 6924 and 6925 from osd.2's data directory? (current/meta/osdmap_full_6924 or similar.. the names are a bit mangled, though).

Also, what version of librbd is in use?

#24 Updated by Simon Frerichs over 11 years ago

Sage Weil wrote:

Just realized those were osd.3 messages mixed into the log, that's why it was confusing. However, osd.3 seems to think 2.1d9 maps to 3,15 instead of 2,15. Can you grab the osdmap epochs 6924 and 6925 from osd.2's data directory? (current/meta/osdmap_full_6924 or similar.. the names are a bit mangled, though).

Also, what version of librbd is in use?

I've attached the maps. Not 100% sure if these are the files you want.

On the 'client' where I'm running 'rbd info' I use:

- Linux fcnode06 3.3.6 #1 SMP Sun May 20 16:17:36 CEST 2012 x86_64 GNU/Linux - compiled from kernel.org source
- ceph 'stable' branch

On the 'server' where the osds are running I use:

- Linux fcnode01 3.2.0-2-amd64 #1 SMP Sun Apr 15 16:47:38 UTC 2012 x86_64 GNU/Linux - debian wheezy kernel
- ceph 'stable' branch

The hanging rbd problem happens on both, client and server.

#25 Updated by Sage Weil over 11 years ago

Aha, ok I think we're on the right track. osd.2's map for 6924 says [3,15], but the client sent it to [2,15].

Can you get 6939 from the monitor and from the osd, and attach those both? (That's the epoch for the failure in the most recent log.)

And can you confirm the librbd client version? ('dpkg -l librbd1', and/or 'rbd -v')

#26 Updated by Simon Frerichs over 11 years ago

Sage Weil wrote:

Aha, ok I think we're on the right track. osd.2's map for 6924 says [3,15], but the client sent it to [2,15].

Can you get 6939 from the monitor and from the osd, and attach those both? (That's the epoch for the failure in the most recent log.)

And can you confirm the librbd client version? ('dpkg -l librbd1', and/or 'rbd -v')

- file 6939 is from mon.a
- file osdmap.6939__0_0A052908 is from osd.2

root@fcnode06:/home/frerichs/ceph# rbd -v
ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)

#27 Updated by Sage Weil over 11 years ago

progress:

fatty:2454 10:18 AM $ ~/src/ceph/src/osdmaptool --test-map-pg 2.1d9 6939
/home/sage/src/ceph/src/osdmaptool: osdmap file '6939'
 parsed '2.1d9' -> 2.1d9
2.1d9 raw [2,15] up [2,15] acting [2,15]
fatty:2454 10:18 AM $ ~/src/ceph/src/osdmaptool --test-map-pg 2.1d9 osdmap.6939__0_0A052908
/home/sage/src/ceph/src/osdmaptool: osdmap file 'osdmap.6939__0_0A052908'
 parsed '2.1d9' -> 2.1d9
2.1d9 raw [3,15] up [3,15] acting [3,15]

#28 Updated by Sage Weil over 11 years ago

Ok, the problem is that the weights are different on the osd's and mon's map. I'm not sure how that could have happened. Can you look for the oldest map on the monitor, do osdmaptool -p <filename>, and see if the weights are all 1 or some of them are small? Same goes for osd.2. (Currently the osd's map has weights of 1, but the monitor does not for osd's 3-10.)

Also, do you know how those weights got changed? They have kind very small values, which is a bit unusual...

#29 Updated by Simon Frerichs over 11 years ago

Sage Weil wrote:

Ok, the problem is that the weights are different on the osd's and mon's map. I'm not sure how that could have happened. Can you look for the oldest map on the monitor, do osdmaptool -p <filename>, and see if the weights are all 1 or some of them are small? Same goes for osd.2. (Currently the osd's map has weights of 1, but the monitor does not for osd's 3-10.)

Also, do you know how those weights got changed? They have kind very small values, which is a bit unusual...

oldest map on mon.a is 6442:

...
osd.3 up   in  weight 0.000259399 up_from 5803 up_thru 6424 down_at 5802 last_clean_interval [5796,5799) 46.19.94.11:6820/1742 46.19.94.11:6821/1742 46.19.94.11:6822/1742 exists,up
osd.4 up   in  weight 0.000564575 up_from 5805 up_thru 6438 down_at 5804 last_clean_interval [5798,5799) 46.19.94.11:6823/1859 46.19.94.11:6824/1859 46.19.94.11:6825/1859 exists,up
osd.5 up   in  weight 0.00927734 up_from 5786 up_thru 6433 down_at 5785 last_clean_interval [5665,5784) 46.19.94.12:6800/13511 46.19.94.12:6801/13511 46.19.94.12:6802/13511 exists,up
osd.6 up   in  weight 0.414642 up_from 5790 up_thru 6436 down_at 5788 last_clean_interval [5668,5784) 46.19.94.12:6806/13746 46.19.94.12:6807/13746 46.19.94.12:6808/13746 exists,up
osd.7 up   in  weight 0.0010376 up_from 5780 up_thru 6406 down_at 5777 last_clean_interval [5755,5771) 46.19.94.12:6803/13155 46.19.94.12:6804/13155 46.19.94.12:6805/13155 exists,up
osd.8 up   in  weight 0.131973 up_from 5814 up_thru 6418 down_at 5813 last_clean_interval [5742,5806) 46.19.94.13:6810/29431 46.19.94.13:6812/29431 46.19.94.13:6813/29431 exists,up
osd.9 up   in  weight 9.15527e-05 up_from 5811 up_thru 6397 down_at 5810 last_clean_interval [5644,5806) 46.19.94.13:6806/29276 46.19.94.13:6807/29276 46.19.94.13:6809/29276 exists,up
osd.10 up   in  weight 0.420349 up_from 5808 up_thru 6294 down_at 5807 last_clean_interval [5643,5806) 46.19.94.13:6803/29109 46.19.94.13:6804/29109 46.19.94.13:6805/29109 exists,up
...

oldest map on osd.2 is 6440 and every weight value is 1.

I used reweight-by-utilization one or two times when 0.43 was stable because of very different disk usage on the osds. This didn't help so I don't tried it again. I think i saw this weight values when running this command.

Any chance to fix this?

#30 Updated by Sage Weil over 11 years ago

That's too bad.. i'll futz with it some here and see if i can reproduce.

You can fix your cluster with:

for f in `seq 3 10`; do ceph osd reweight $f 1 ; done

#31 Updated by Simon Frerichs over 11 years ago

Sage Weil wrote:

for f in `seq 3 10`; do ceph osd reweight $f 1 ; done

Thank you very much.
I'm unable to produce the rbd hang / hanging kvm on boot now.

We'll add some more osd's and lot's of new KVM on top next week.

#32 Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to Resolved

reweight-by-utilization was broken (modified in-memory OSDMap instead of updating pending incremental). fixed by 519fadbc96bc496b7da7920531d19472e6c43786

Also available in: Atom PDF