Project

General

Profile

Actions

Bug #7744

closed

osd: assert(last_e.version.version < e.version.version)

Added by Kevinsky Dy about 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Dumpling
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I currently have 2 OSDs that won't start and it's preventing my
cluster from running my VMs.
My cluster is running on:
  1. ceph -v
    ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)

My OSD's are found on different hosts as per the default CRUSH Map
rules and I got the logs from the problem OSDs as follows

osd.1:
-10> 2014-03-16 11:51:25.183093 7f50923e4780 20 read_log
12794'1377363 (12794'1377357) modify
5ee8f77d/rb.0.1f7b.238e1f29.0000000118fb/head//2 by
client.1077470.0:376704 2014-03-15 12:37:43.175319
-9> 2014-03-16 11:51:25.183122 7f50923e4780 20 read_log
12794'1377364 (12794'1377362) modify
af8f477d/rb.0.24f4.238e1f29.00000000e378/head//2 by
client.1074837.0:1295631 2014-03-15 12:38:03.613604
-8> 2014-03-16 11:51:25.183146 7f50923e4780 20 read_log
12794'1377365 (12794'1377348) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1396709 2014-03-15 12:38:33.720354
-7> 2014-03-16 11:51:25.183179 7f50923e4780 20 read_log
12794'1377366 (12794'1377365) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1396712 2014-03-15 12:38:33.726419
-6> 2014-03-16 11:51:25.183207 7f50923e4780 20 read_log
12794'1377367 (12794'1377366) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1397305 2014-03-15 12:39:09.863260
-5> 2014-03-16 11:51:25.183231 7f50923e4780 20 read_log
12794'1377368 (12794'1377367) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1398903 2014-03-15 12:40:13.096258
-4> 2014-03-16 11:51:25.183258 7f50923e4780 20 read_log
12794'1377369 (12794'1377363) modify
5ee8f77d/rb.0.1f7b.238e1f29.0000000118fb/head//2 by
client.1077470.0:377159 2014-03-15 12:40:13.105469
-3> 2014-03-16 11:51:25.183282 7f50923e4780 20 read_log
12794'1377370 (12794'1377360) modify
19463f7d/rb.0.ecb29.238e1f29.000000000101/head//2 by
client.1058212.1:358750 2014-03-15 12:40:24.998076
-2> 2014-03-16 11:51:25.183309 7f50923e4780 20 read_log
12794'1377371 (12794'1377368) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1399253 2014-03-15 12:40:28.134624
-1> 2014-03-16 11:51:25.183333 7f50923e4780 20 read_log
13740'1377371 (12605'1364064) modify
94c6137d/rb.0.d07d.2ae8944a.000000002524/head//2 by
client.1088028.0:10498 2014-03-16 00:06:12.643968
0> 2014-03-16 11:51:25.185685 7f50923e4780 -1 osd/PGLog.cc: In
function 'static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t,
const pg_info_t&, std::map<eversion_t, hobject_t>&,
PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
std::set<std::basic_string<char> >*)' thread 7f50923e4780 time
2014-03-16 11:51:25.183350
osd/PGLog.cc: 677: FAILED assert(last_e.version.version < e.version.version)

osd.2:
-10> 2014-03-16 11:28:45.015366 7fe5a7539780 20 read_log 12794'1377363
(12794'1377357) modify
5ee8f77d/rb.0.1f7b.238e1f29.0000000118fb/head//2 by
client.1077470.0:376704 2014-03-15 12:37:43.175319
-9> 2014-03-16 11:28:45.015381 7fe5a7539780 20 read_log
12794'1377364 (12794'1377362) modify
af8f477d/rb.0.24f4.238e1f29.00000000e378/head//2 by
client.1074837.0:1295631 2014-03-15 12:38:03.613604
-8> 2014-03-16 11:28:45.015394 7fe5a7539780 20 read_log
12794'1377365 (12794'1377348) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1396709 2014-03-15 12:38:33.720354
-7> 2014-03-16 11:28:45.015405 7fe5a7539780 20 read_log
12794'1377366 (12794'1377365) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1396712 2014-03-15 12:38:33.726419
-6> 2014-03-16 11:28:45.015418 7fe5a7539780 20 read_log
12794'1377367 (12794'1377366) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1397305 2014-03-15 12:39:09.863260
-5> 2014-03-16 11:28:45.015428 7fe5a7539780 20 read_log
12794'1377368 (12794'1377367) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1398903 2014-03-15 12:40:13.096258
-4> 2014-03-16 11:28:45.015441 7fe5a7539780 20 read_log
12794'1377369 (12794'1377363) modify
5ee8f77d/rb.0.1f7b.238e1f29.0000000118fb/head//2 by
client.1077470.0:377159 2014-03-15 12:40:13.105469
-3> 2014-03-16 11:28:45.015452 7fe5a7539780 20 read_log
12794'1377370 (12794'1377360) modify
19463f7d/rb.0.ecb29.238e1f29.000000000101/head//2 by
client.1058212.1:358750 2014-03-15 12:40:24.998076
-2> 2014-03-16 11:28:45.015464 7fe5a7539780 20 read_log
12794'1377371 (12794'1377368) modify
cbb2fb7d/rb.0.2355.2ae8944a.00000000849e/head//2 by
client.1077379.0:1399253 2014-03-15 12:40:28.134624
-1> 2014-03-16 11:28:45.015475 7fe5a7539780 20 read_log
13740'1377371 (12605'1364064) modify
94c6137d/rb.0.d07d.2ae8944a.000000002524/head//2 by
client.1088028.0:10498 2014-03-16 00:06:12.643968
0> 2014-03-16 11:28:45.016656 7fe5a7539780 -1 osd/PGLog.cc: In
function 'static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t,
const pg_info_t&, std::map<eversion_t, hobject_t>&,
PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
std::set<std::basic_string<char> >*)' thread 7fe5a7539780 time
2014-03-16 11:28:45.015497
osd/PGLog.cc: 677: FAILED assert(last_e.version.version < e.version.version)

It seems that the latest version of the object taken from omap is not
newer than the ones represented in the logs and this failed assert is
preventing those 2 OSDs to start.

Here's a link to the code.
https://github.com/ceph/ceph/blob/dumpling/src/osd/PGLog.cc#L667


Files

ceph.log.1.gz (721 KB) ceph.log.1.gz Kevinsky Dy, 03/16/2014 12:54 PM
ceph-osd.37.log.1.gz (247 KB) ceph-osd.37.log.1.gz One of the affected OSDs. Kevinsky Dy, 03/16/2014 12:54 PM
ceph-osd.67.log.1.gz (1.08 MB) ceph-osd.67.log.1.gz One of the affected OSDs. Kevinsky Dy, 03/16/2014 12:54 PM
ceph.log.gz (207 KB) ceph.log.gz Jake Young, 03/29/2014 04:18 PM
ceph-osd.8.log.gz (54.6 MB) ceph-osd.8.log.gz Jake Young, 03/29/2014 04:18 PM
ceph-osd.10.log.gz (52.5 MB) ceph-osd.10.log.gz Jake Young, 03/29/2014 04:18 PM
Actions #1

Updated by Sage Weil about 10 years ago

  • Priority changed from Normal to High

Updated by Jake Young about 10 years ago

Two of my osds are crashing with the same signature:

osd/PGLog.cc: 672: FAILED assert(last_e.version.version < e.version.version)

I turned on "debug osd = 20" and restarted the osds again.

I had a hard power failure yesterday on this lab single node ceph cluster. The system has a BBU RAID card, so I'm suprised to find corrupt XFS file systems.

Osd 10 was actually working after the hard crash,

Osd 8's XFS file system could not be mounted. I booted off a recovery cd and ran xfs_restore on it. After this Osd 8 came up.

Things were looking good for a little while until Osd 8 and Osd 10 started crashing after doing a "stop ceph-all" while the cluster was recovering (it wasn't rebalancing, I had run "ceph osd set noout" while I tried to recover some other disks).

Hopefully these logs help. Let me know if there's anything else you'd like to know.

ceph@ceph1:/tmp$ ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

Actions #3

Updated by Ian Colle about 10 years ago

  • Project changed from devops to Ceph
Actions #4

Updated by Sage Weil about 10 years ago

  • Subject changed from OSD fails to start due a failure to assert(last_e.version.version < e.version.version) to osd: assert(last_e.version.version < e.version.version)
  • Status changed from New to 12
  • Assignee deleted (Sage Weil)
Actions #5

Updated by Brian Cline almost 10 years ago

I'm getting this as well on a single OSD (osd.3) that mounts fine, but does not start. Same precise assertion failure, and it always happens when it reaches the same PG in the log (7.35a).

That same PG has been stuck in active+scrubbing for the last week or so. Restarting the OSD would reset its status, but it'd get stuck again in subsequent attempts.

The pool for that particular PG was single-replica (as it was a test that required a lot of space). Last night I decided I was finished with the test rbd image in that pool (the only thing in the pool), so I ran a rbd rm pool/myimage.

I let it run for about 6 hours, eventually reaching 40% completion, before I was disconnected unexpectedly from my network. After reconnecting and reinitiating the same command, it quickly counted back up to 40%, but then gets stuck there and never moves again.

Checking ceph health and seeing that 7.35a was still stuck in active+scrub, I stopped the rbd rm, set noout, and restarted osd.3. At that point, I see this error, and osd.3 will no longer boot.

Hope this helps. If, in my case, I can simply remove the data dirs for 7.35a, let me know. As I said pool 7 was a single-replica test pool that's no longer needed, so I'm not concerned about losing it. Just concerned about osd.3's integrity at this point.

Using same version as above, 0.72.2 (Emperor).

Actions #6

Updated by Sage Weil almost 10 years ago

  • Status changed from 12 to Need More Info

we need a full log to see how this happens on dumpling.

Brian, to work around this and get your osd up, you need to run this branch: wip-dumpling-log-assert

Actions #7

Updated by Sage Weil over 9 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF