Project

General

Profile

Bug #14213

osd/OSD.cc: 2940: FAILED assert(0 == "Missing map in load_pgs")

Added by Sage Weil about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-01-02T01:52:54.277 INFO:tasks.ceph.osd.2.smithi009.stderr:2016-01-02 04:52:54.278578 7fcf24bd27c0 -1 osd.2 370 load_pgs: have pgid 0.7f at epoch 255, but missing map.  Crashing.
2016-01-02T01:52:54.279 INFO:tasks.ceph.osd.2.smithi009.stderr:osd/OSD.cc: In function 'void OSD::load_pgs()' thread 7fcf24bd27c0 time 2016-01-02 04:52:54.278586
2016-01-02T01:52:54.280 INFO:tasks.ceph.osd.2.smithi009.stderr:osd/OSD.cc: 2940: FAILED assert(0 == "Missing map in load_pgs")
2016-01-02T01:52:54.280 INFO:tasks.ceph.osd.2.smithi009.stderr: ceph version 10.0.1-1040-g6479fc8 (6479fc8a7c28c5eb3723fa2268d596d4588906aa)
2016-01-02T01:52:54.281 INFO:tasks.ceph.osd.2.smithi009.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fcf255e96a5]
2016-01-02T01:52:54.281 INFO:tasks.ceph.osd.2.smithi009.stderr: 2: (OSD::load_pgs()+0x1eed) [0x7fcf24f5e11d]
2016-01-02T01:52:54.281 INFO:tasks.ceph.osd.2.smithi009.stderr: 3: (OSD::init()+0xf56) [0x7fcf24f6d296]
2016-01-02T01:52:54.282 INFO:tasks.ceph.osd.2.smithi009.stderr: 4: (main()+0x2adf) [0x7fcf24ed54ff]
2016-01-02T01:52:54.282 INFO:tasks.ceph.osd.2.smithi009.stderr: 5: (__libc_start_main()+0xf5) [0x7fcf21d34af5]
2016-01-02T01:52:54.282 INFO:tasks.ceph.osd.2.smithi009.stderr: 6: (()+0x32f6c9) [0x7fcf24f1e6c9]

/a/sage-2016-01-01_13:26:03-rados-wip-sage-testing---basic-smithi/11334

Related issues

Related to Ceph - Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) Resolved 12/05/2015

Associated revisions

Revision a0b0ae00 (diff)
Added by David Zafman about 8 years ago

Revert "osd: do not keep ref of old osdmap in pg"

This reverts commit 8978cd6fcf40076dbd944a2938b335d26c62b8d5.

Fixes: #14213

History

#1 Updated by Sage Weil about 8 years ago

2016-01-02T01:51:34.108 INFO:teuthology.orchestra.run.smithi009:Running: 'sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --journal-path /var/lib/ceph/osd/ceph-2/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op import --file /home/ubuntu/cephtest/data/exp.0.cf.2'
2016-01-02T01:51:34.157 INFO:teuthology.orchestra.run.smithi009.stderr:warning: line 35: 'osd_pool_default_size' in section 'global' redefined
2016-01-02T01:51:34.193 INFO:teuthology.orchestra.run.smithi009.stdout:Importing pgid 0.cf
2016-01-02T01:51:34.194 INFO:teuthology.orchestra.run.smithi009.stdout:Import successful
2016-01-02T01:51:34.197 INFO:teuthology.orchestra.run.smithi009:Running: 'rm -f /home/ubuntu/cephtest/data/exp.0.cf.2'

#2 Updated by David Zafman about 8 years ago

I can't find the original wip-sage-tesitng branch starting at 6479fc8a7c28c5eb3723fa2268d596d4588906aa

The crash involved pgid 0.7f on osd.2, but the only import on osd.2 was 0.cf and it happened as an export and immediate import, which is the safest import scenario.

I'm not clear on how maps are pruned, but it looks to me like mon.1 told osd.1 to remove what it thought were old maps and osd.1 told osd.2 to do the same. Eventually, a load_pgs() wanted map 255 and it wasn't there anymore.

osd.1
2016-01-02 04:48:55.490618 7fdd328d7700 1 -- 172.21.15.9:6812/17916 <== mon.1 172.21.15.34:6789/0 102 ==== osd_map(349..350 src has 264..350) v3 ==== 560+0+0 (2862957661 0 0) 0x7fdd51fadf80 con 0x7fdd526e2e80
2016-01-02 04:48:55.490641 7fdd328d7700 10 osd.1 349 do_waiters -- start
2016-01-02 04:48:55.490643 7fdd328d7700 10 osd.1 349 do_waiters -- finish
2016-01-02 04:48:55.490645 7fdd328d7700 20 osd.1 349 _dispatch 0x7fdd51fadf80 osd_map(349..350 src has 264..350) v3
2016-01-02 04:48:55.490667 7fdd328d7700 3 osd.1 349 handle_osd_map epochs [349,350], i have 349, src has [264,350]
2016-01-02 04:48:55.490671 7fdd328d7700 10 osd.1 349 handle_osd_map got inc map for epoch 350
2016-01-02 04:48:55.490792 7fdd328d7700 20 osd.1 349 got_full_map 350, nothing requested
2016-01-02 04:48:55.490840 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 241
2016-01-02 04:48:55.490847 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 242
2016-01-02 04:48:55.490852 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 243
2016-01-02 04:48:55.490855 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 244
2016-01-02 04:48:55.490862 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 245
2016-01-02 04:48:55.490866 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 246
2016-01-02 04:48:55.490869 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 247
2016-01-02 04:48:55.490873 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 248
2016-01-02 04:48:55.490877 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 249
2016-01-02 04:48:55.490881 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 250
2016-01-02 04:48:55.490884 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 251
2016-01-02 04:48:55.490898 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 252
2016-01-02 04:48:55.490903 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 253
2016-01-02 04:48:55.490907 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 254
2016-01-02 04:48:55.490910 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 255
2016-01-02 04:48:55.490914 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 256
2016-01-02 04:48:55.490922 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 257
2016-01-02 04:48:55.490926 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 258
2016-01-02 04:48:55.490930 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 259
2016-01-02 04:48:55.490933 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 260
2016-01-02 04:48:55.490940 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 261
2016-01-02 04:48:55.490943 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 262
2016-01-02 04:48:55.490947 7fdd328d7700 20 osd.1 349 removing old osdmap epoch 263
2016-01-02 04:48:55.490951 7fdd328d7700 10 osd.1 349 advance to epoch 350 (<= newest 350)
2016-01-02 04:48:55.490955 7fdd328d7700 7 osd.1 350 advance_map epoch 350
2016-01-02 04:48:55.490959 7fdd328d7700 10 osd.1 350 write_superblock sb(8a7b4c6c-6dbb-4d98-ae0a-f9898086181a osd.1 bf9515a1-77d2-4d6c-b637-05cf7863143a e350 [264,350] lci=[316,350])

osd.2
2016-01-02 04:48:55.492251 7fe87e451700 1 -- 172.21.15.9:6805/25874 <== osd.1 172.21.15.9:6813/17916 1232 ==== osd_map(350..350 src has 264..350) v3 ==== 238+0+0 (384319974 0 0) 0x7fe89faf81c0 con 0x7fe89ed51480
2016-01-02 04:48:55.492288 7fe87e451700 10 osd.2 349 do_waiters -- start
2016-01-02 04:48:55.492291 7fe87e451700 10 osd.2 349 do_waiters -- finish
2016-01-02 04:48:55.492292 7fe87e451700 20 osd.2 349 _dispatch 0x7fe89faf81c0 osd_map(350..350 src has 264..350) v3
2016-01-02 04:48:55.492316 7fe87e451700 3 osd.2 349 handle_osd_map epochs [350,350], i have 349, src has [264,350]
2016-01-02 04:48:55.492319 7fe87e451700 10 osd.2 349 handle_osd_map got inc map for epoch 350
2016-01-02 04:48:55.492459 7fe87e451700 20 osd.2 349 got_full_map 350, nothing requested
2016-01-02 04:48:55.492511 7fe87e451700 20 osd.2 349 removing old osdmap epoch 241
2016-01-02 04:48:55.492520 7fe87e451700 20 osd.2 349 removing old osdmap epoch 242
2016-01-02 04:48:55.492533 7fe87e451700 20 osd.2 349 removing old osdmap epoch 243
2016-01-02 04:48:55.492538 7fe87e451700 20 osd.2 349 removing old osdmap epoch 244
2016-01-02 04:48:55.492543 7fe87e451700 20 osd.2 349 removing old osdmap epoch 245
2016-01-02 04:48:55.492548 7fe87e451700 20 osd.2 349 removing old osdmap epoch 246
2016-01-02 04:48:55.492553 7fe87e451700 20 osd.2 349 removing old osdmap epoch 247
2016-01-02 04:48:55.492557 7fe87e451700 20 osd.2 349 removing old osdmap epoch 248
2016-01-02 04:48:55.492561 7fe87e451700 20 osd.2 349 removing old osdmap epoch 249
2016-01-02 04:48:55.492566 7fe87e451700 20 osd.2 349 removing old osdmap epoch 250
2016-01-02 04:48:55.492570 7fe87e451700 20 osd.2 349 removing old osdmap epoch 251
2016-01-02 04:48:55.492575 7fe87e451700 20 osd.2 349 removing old osdmap epoch 252
2016-01-02 04:48:55.492580 7fe87e451700 20 osd.2 349 removing old osdmap epoch 253
2016-01-02 04:48:55.492584 7fe87e451700 20 osd.2 349 removing old osdmap epoch 254
2016-01-02 04:48:55.492589 7fe87e451700 20 osd.2 349 removing old osdmap epoch 255
2016-01-02 04:48:55.492594 7fe87e451700 20 osd.2 349 removing old osdmap epoch 256
2016-01-02 04:48:55.492613 7fe87e451700 20 osd.2 349 removing old osdmap epoch 257
2016-01-02 04:48:55.492618 7fe87e451700 20 osd.2 349 removing old osdmap epoch 258
2016-01-02 04:48:55.492623 7fe87e451700 20 osd.2 349 removing old osdmap epoch 259
2016-01-02 04:48:55.492627 7fe87e451700 20 osd.2 349 removing old osdmap epoch 260
2016-01-02 04:48:55.492632 7fe87e451700 20 osd.2 349 removing old osdmap epoch 261
2016-01-02 04:48:55.492636 7fe87e451700 20 osd.2 349 removing old osdmap epoch 262
2016-01-02 04:48:55.492641 7fe87e451700 20 osd.2 349 removing old osdmap epoch 263
2016-01-02 04:48:55.492646 7fe87e451700 10 osd.2 349 advance to epoch 350 (<= newest 350)
2016-01-02 04:48:55.492650 7fe87e451700 7 osd.2 350 advance_map epoch 350
2016-01-02 04:48:55.492655 7fe87e451700 10 osd.2 350 write_superblock sb(8a7b4c6c-6dbb-4d98-ae0a-f9898086181a osd.2 0eb2eec1-a1e9-4cf4-b264-a6290ad6301a e350 [264,350] lci=[3,350])
2016-01-02 04:48:55.492672 7fe87e451700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions existing 0x7fe89e52e580 osr(meta 0x7fe89e468190)
2016-01-02 04:48:55.492718 7fe87e451700 10 journal prepare_entry 0x7fe8a18d6540
2016-01-02 04:48:55.492827 7fe87e451700 10 journal len 12905 -> 16384 (head 40 pre_pad 0 bl 12905 post_pad 3399 tail 40) (bl alignment -1)
2016-01-02 04:48:55.492905 7fe87e451700 10 journal op_submit_start 12612
2016-01-02 04:48:55.492910 7fe87e451700 5 filestore(/var/lib/ceph/osd/ceph-2) queue_transactions (writeahead) 12612 0x7fe8a18d6540
2016-01-02 04:48:55.492913 7fe87e451700 10 journal op_journal_transactions 12612
2016-01-02 04:48:55.492914 7fe87e451700 5 journal submit_entry seq 12612 len 16384 (0x7fe89ee49810)
2016-01-02 04:48:55.492923 7fe87e451700 10 journal op_submit_finish 12612
2016-01-02 04:48:55.492932 7fe8875e4700 20 journal write_thread_entry woke up
2016-01-02 04:48:55.492938 7fe87e451700 7 osd.2 350 consume_map version 350

2016-01-02 04:48:55.499628 7fe8865e2700 10 filestore(/var/lib/ceph/osd/ceph-2) remove meta/-1/c67cdba2/inc_osdmap.254/0 = 0
2016-01-02 04:48:55.499636 7fe8865e2700 15 filestore(/var/lib/ceph/osd/ceph-2) remove meta/-1/ac96ce05/osdmap.255/0
2016-01-02 04:48:55.499670 7fe8865e2700 20 filestore(/var/lib/ceph/osd/ceph-2) lfn_unlink: clearing omap on -1/ac96ce05/osdmap.255/0 in cid meta

2016-01-02 04:52:54.278507 7fcf24bd27c0 20 osd.2 0 get_map 255 - loading and decoding 0x7fcf3067d680
2016-01-02 04:52:54.278512 7fcf24bd27c0 15 filestore(/var/lib/ceph/osd/ceph-2) read meta/-1/ac96ce05/osdmap.255/0 0~0
2016-01-02 04:52:54.278557 7fcf24bd27c0 10 filestore(/var/lib/ceph/osd/ceph-2) error opening file /var/lib/ceph/osd/ceph-2/current/meta/DIR_5/osdmap.255__0_AC96CE05__none with flags=2: (2) No such file or directory
2016-01-02 04:52:54.278567 7fcf24bd27c0 10 filestore(/var/lib/ceph/osd/ceph-2) FileStore::read(meta/-1/ac96ce05/osdmap.255/0) open error: (2) No such file or directory
2016-01-02 04:52:54.278578 7fcf24bd27c0 -1 osd.2 370 load_pgs: have pgid 0.7f at epoch 255, but missing map. Crashing.

#3 Updated by David Zafman about 8 years ago

  • Assignee changed from David Zafman to Sage Weil

Not sure what changes were being tested in branch at 6479fc8a7c28c5eb3723fa2268d596d4588906aa, so I'm assigning back to Sage.

#4 Updated by Sage Weil about 8 years ago

/a/sage-2016-01-28_12:50:51-rados-wip-sage-testing3---basic-mira/48261

this is master, plus somnaht's ObjectStore::Transaction change. pretty sure it's not related to that.

#5 Updated by Kefu Chai about 8 years ago

/a/sage-2016-02-01_16:19:44-rados-wip-sage-testing2-distro-basic-mira/1453

#6 Updated by Sage Weil about 8 years ago

/a/sage-2016-02-01_16:19:44-rados-wip-sage-testing2-distro-basic-mira/1453

#7 Updated by Sage Weil about 8 years ago

  • Assignee changed from Sage Weil to David Zafman

/a/sage-2016-02-04_12:56:11-rados-wip-sage-testing---basic-smithi/6933

definitely in master

#8 Updated by David Zafman about 8 years ago

  • Related to Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

#9 Updated by Kefu Chai about 8 years ago

quote from sam

we need to hold the reference to prevent the map from being trimmed. that reference is the map epoch which the backing store has persisted as the pg epoch

#10 Updated by David Zafman about 8 years ago

  • Status changed from New to Resolved

a0b0ae00f25706e702377a915188f7e9a316272f

Also available in: Atom PDF