Project

General

Profile

Bug #1127

RBD got silent after 1 month

Added by Yoshi Tamura over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

RBD got silent after about 1 month running.
Although I restarted the daemons, the symptom doesn't go away.
Attached a log from /var/log/ceph/osd.0.log. It seems that mon got
wrong at 2011-05-30 09:08:19.
Before then, it had no problem.

Using the Debian package version, 0.26-1~bpo60+1.

2011-05-30 09:05:51.258591 7f0213209700 osd0 2 OSD::ms_handle_reset()
2011-05-30 09:05:51.258620 7f0213209700 osd0 2 OSD::ms_handle_reset()
s=0xafeac60
2011-05-30 09:08:19.143404 7f0213209700 monclient: hunting for new mon
2011-05-30 09:08:19.153778 7f0213209700 osd0 2 OSD::ms_handle_reset()
2011-05-30 09:08:19.157018 7f0211205700 -- 192.168.100.7:6801/27764 >>
192.168.100.7:6789/0 pipe(0xac4ea00 sd=12 pgs=0 cs=0 l=0).fault first
fault
2011-05-30 09:08:20.220057 7f0213209700 osd0 2 OSD::ms_handle_reset()
2011-05-30 09:08:20.220086 7f0213209700 osd0 2 OSD::ms_handle_reset()
s=0xafead80
ceph version 0.26.commit: 9981ff90968398da43c63106694d661f5e3d07d5.
process: cosd. pid: 1120
2011-05-30 09:12:12.346156 7fdf06fa8760 filestore(/data/osd0) mount
FIEMAP ioctl is NOT supported
2011-05-30 09:12:12.346260 7fdf06fa8760 filestore(/data/osd0) mount
did NOT detect btrfs
2011-05-30 09:12:12.353443 7fdf06fa8760 filestore(/data/osd0) mount
found snaps <>
2011-05-30 09:12:12.384327 7fdf06fa8760 filestore(/data/osd0) mount:
WRITEAHEAD journal mode explicitly enabled in conf
2011-05-30 09:12:12.384359 7fdf06fa8760 filestore(/data/osd0) mount
WARNING: not btrfs or ext3; data may be lost
2011-05-30 09:12:12.384438 7fdf06fa8760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 09:12:12.423439 7fdf06fa8760 journal read_entry 727867392 :
seq 555687 1045 bytes
2011-05-30 09:12:12.423522 7fdf06fa8760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 16:07:09.388801 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
2011-05-30 16:07:09.388843 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
s=0x2b247e0
2011-05-30 16:09:18.108682 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
2011-05-30 16:09:18.108706 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
s=0x2b24a20
2011-05-30 16:13:08.438687 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
2011-05-30 16:13:08.438713 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
s=0x2b24b40
2011-05-30 16:15:01.244245 7fdefbab9700 monclient: hunting for new mon
2011-05-30 16:15:01.244379 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
2011-05-30 16:15:01.244449 7fdef9ab5700 -- 192.168.100.7:6801/1120 >>
192.168.100.7:6789/0 pipe(0x2b28780 sd=12 pgs=0 cs=0 l=0).fault first
fault
2011-05-30 16:15:02.593141 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
2011-05-30 16:15:02.593170 7fdefbab9700 osd0 6 OSD::ms_handle_reset()
s=0x2b24240
ceph version 0.26.commit: 9981ff90968398da43c63106694d661f5e3d07d5.
process: cosd. pid: 2665
2011-05-30 16:15:05.177239 7f6375dab760 filestore(/data/osd0) mount
FIEMAP ioctl is NOT supported
2011-05-30 16:15:05.177340 7f6375dab760 filestore(/data/osd0) mount
did NOT detect btrfs
2011-05-30 16:15:05.177377 7f6375dab760 filestore(/data/osd0) mount
found snaps <>
2011-05-30 16:15:05.177411 7f6375dab760 filestore(/data/osd0) mount:
WRITEAHEAD journal mode explicitly enabled in conf
2011-05-30 16:15:05.177423 7f6375dab760 filestore(/data/osd0) mount
WARNING: not btrfs or ext3; data may be lost
2011-05-30 16:15:05.177468 7f6375dab760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 16:15:05.203112 7f6375dab760 journal read_entry 727900160 :
seq 555691 1757 bytes
2011-05-30 16:15:05.203206 7f6375dab760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 16:15:29.368764 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
2011-05-30 16:15:29.368793 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
s=0x1c9e5a0
2011-05-30 16:16:58.408591 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
2011-05-30 16:16:58.408627 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
s=0x1c9e7e0
2011-05-30 16:17:27.008678 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
2011-05-30 16:17:27.008697 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
s=0x1c9e900
2011-05-30 16:19:42.595055 7f636a8bc700 monclient: hunting for new mon
2011-05-30 16:19:42.595112 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
2011-05-30 16:19:42.595246 7f63688b8700 -- 192.168.100.7:6801/2665 >>
192.168.100.7:6789/0 pipe(0x1ca4780 sd=12 pgs=0 cs=0 l=0).fault first
fault
2011-05-30 16:19:43.691896 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
2011-05-30 16:19:43.691920 7f636a8bc700 osd0 9 OSD::ms_handle_reset()
s=0x1c9e6c0
ceph version 0.26.commit: 9981ff90968398da43c63106694d661f5e3d07d5.
process: cosd. pid: 3470
2011-05-30 16:22:38.347927 7f7658be4760 filestore(/data/osd0) mount
FIEMAP ioctl is NOT supported
2011-05-30 16:22:38.348029 7f7658be4760 filestore(/data/osd0) mount
did NOT detect btrfs
2011-05-30 16:22:38.348067 7f7658be4760 filestore(/data/osd0) mount
found snaps <>
2011-05-30 16:22:38.348099 7f7658be4760 filestore(/data/osd0) mount:
WRITEAHEAD journal mode explicitly enabled in conf
2011-05-30 16:22:38.348119 7f7658be4760 filestore(/data/osd0) mount
WARNING: not btrfs or ext3; data may be lost
2011-05-30 16:22:38.348169 7f7658be4760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 16:22:38.395563 7f7658be4760 journal read_entry 727924736 :
seq 555694 1909 bytes
2011-05-30 16:22:38.395646 7f7658be4760 journal _open
/data/osd0/journal fd 11: 1048576000 bytes, block size 4096 bytes,
directio = 1
2011-05-30 16:22:52.468693 7f764d6f5700 osd0 11 OSD::ms_handle_reset()
2011-05-30 16:22:52.468728 7f764d6f5700 osd0 11 OSD::ms_handle_reset()
s=0xdf35a0
2011-05-30 16:31:46.888685 7f764d6f5700 osd0 12 OSD::ms_handle_reset()
2011-05-30 16:31:46.888710 7f764d6f5700 osd0 12 OSD::ms_handle_reset()
s=0xdf37e0

History

#1 Updated by Josh Durgin over 9 years ago

  • Category changed from librbd to OSD

Looked into this a bit on irc yesterday. This part of the osd log looks problematic - there's only one osd, so the pg should exist (hit when doing 'rbd ls'):

2011-06-01 11:00:29.000224 7f2eb8984700 -- 192.168.100.7:6801/13315 <== client1721498 192.168.100.7:0/1013358 1 ==== osd_op(client1721498.0:1 rbd_directory [read 0~0] 3.8c1c) v2 ==== 128+0+0 (3850352068 0 0) 0x1a156c0 con 0x1a04000
2011-06-01 11:00:29.000255 7f2eb8984700 osd0 26 _dispatch 0x1a156c0 osd_op(client1721498.0:1 rbd_directory [read 0~0] 3.8c1c) v2
2011-06-01 11:00:29.000273 7f2eb8984700 osd0 26 require_same_or_newer_map 26 (i am 26) 0x1a156c0
2011-06-01 11:00:29.000290 7f2eb8984700 osd0 26 _share_map_incoming client1721498 192.168.100.7:0/1013358 26
2011-06-01 11:00:29.000305 7f2eb8984700 osd0 26 hit non-existent pg 3.1c, waiting
2011-06-01 11:00:29.217670 7f2ebf191700 osd0 26 tick
2011-06-01 11:00:29.217731 7f2ebf191700 osd0 26 scrub_should_schedule loadavg 0.08 < max 0.5 = yes
2011-06-01 11:00:29.217745 7f2ebf191700 osd0 26 sched_scrub
2011-06-01 11:00:29.217753 7f2ebf191700 osd0 26 sched_scrub done
2011-06-01 11:00:29.495798 7f2eb3c76700 osd0 26 OSD::ms_verify_authorizer name=client.admin auid=18446744073709551615
2011-06-01 11:00:29.495834 7f2eb3c76700 osd0 26  new session 0x19db480 con=0x1a04500 addr=192.168.100.7:0/1011731
2011-06-01 11:00:29.496126 7f2eb8984700 -- 192.168.100.7:6801/13315 <== client1721306 192.168.100.7:0/1011731 1 ==== osd_op(client1721306.0:1 volume-00000053.rbd [stat 0~0] 3.9da6 RETRY) v2 ==== 134+0+0 (2181691772 0 0) 0x1a15900 con 0x1a04500
2011-06-01 11:00:29.496150 7f2eb8984700 osd0 26 _dispatch 0x1a15900 osd_op(client1721306.0:1 volume-00000053.rbd [stat 0~0] 3.9da6 RETRY) v2
2011-06-01 11:00:29.496163 7f2eb8984700 osd0 26 require_same_or_newer_map 26 (i am 26) 0x1a15900
2011-06-01 11:00:29.496176 7f2eb8984700 osd0 26 _share_map_incoming client1721306 192.168.100.7:0/1011731 26
2011-06-01 11:00:29.496194 7f2eb8984700 osd0 26 hit non-existent pg 3.26, waiting
2011-06-01 11:00:29.718199 7f2eb487b700 osd0 26 update_osd_stat osd_stat(15209 MB used, 131 GB avail, 146 GB total, peers []/[])
2011-06-01 11:00:29.718231 7f2eb487b700 osd0 26 heartbeat: stat(2011-06-01 11:00:29.718146 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2011-06-01 11:00:29.718240 7f2eb487b700 osd0 26 heartbeat: osd_stat(15209 MB used, 131 GB avail, 146 GB total, peers []/[])

#2 Updated by Sage Weil over 9 years ago

  • Target version set to v0.30

Yoshi, can you attach 'ceph osd dump -o - 26', 'ceph osd dump -o -', and 'ceph pg dump -o -' outputs?

#3 Updated by Josh Durgin over 9 years ago

  • Status changed from New to 4

Hi Yoshi, I put a fix in the filestore_debugging branch, which will generate a new package in about 1/2 an hour.

Let me know if it works for you. I was able to reproduce and fix this on a local xfs filesystem.

#4 Updated by Josh Durgin over 9 years ago

  • Assignee set to Josh Durgin

#5 Updated by Yoshi Tamura over 9 years ago

Hi Josh, Fix confirmed. Please commit the fix to the upstream.

#6 Updated by Sage Weil over 9 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF