Actions
Bug #21142
closedOSD crashes when loading pgs with "FAILED assert(interval.last > last)"
Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
after upgrading to luminous 12.1.4 rc
we saw several osds crashing with below logs.
the cluster was unhealthy when upgraded due to high memory consumption. and was subject to a lot of oom kills before the upgrade.
the upgrade did solve the memory issue. but introduced this bug.
the pg seams empty, and probably its there because there was so many osd flapping.
logs:
-34> 2017-08-26 00:38:00.505114 7f14556b4d00 0 osd.299 1085665 load_pgs opened 455 pgs -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e needs 1050342-1085230 -32> 2017-08-26 00:38:00.505814 7f14556b4d00 1 osd.299 1085665 build_past_intervals_parallel over 1050342-1085230 -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342 -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 - loading and decoding 0x7f14b3dfb0c0 -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl 1050342 780781 bytes -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80] up [80], same_interval_since = 1050342 -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050343 -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 - loading and decoding 0x7f14b3dfad80 -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl 1050343 781371 bytes -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050344 -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 - loading and decoding 0x7f14b3dfb740 -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl 1050344 782446 bytes -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050345 -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 - loading and decoding 0x7f14b3dfba80 -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl 1050345 782071 bytes -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050346 -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 - loading and decoding 0x7f14b79c4000 -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl 1050346 781955 bytes -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050347 -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 - loading and decoding 0x7f14b79c4340 -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl 1050347 781930 bytes -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050348 -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 - loading and decoding 0x7f14b79c4680 -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl 1050348 784883 bytes -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050349 -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 - loading and decoding 0x7f14b79c49c0 -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl 1050349 785081 bytes -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050350 -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 - loading and decoding 0x7f14b79c4d00 -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl 1050350 785948 bytes -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050351 -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 - loading and decoding 0x7f14b79c5040 -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl 1050351 788650 bytes 0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26 00:38:00.532119 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: 3205: FAILED assert(interval.last > last) ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f145612f420] 2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2] 3: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0] 4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] 5: (OSD::load_pgs()+0x503) [0x7f1455bbef13] 6: (OSD::init()+0x2179) [0x7f1455bd7779] 7: (main()+0x2def) [0x7f1455add56f] 8: (__libc_start_main()+0xf5) [0x7f1451d14b35] 9: (()+0x4ac8a6) [0x7f1455b7b8a6] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 20/20 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.299.log --- end dump of recent events --- 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) ** in thread 7f14556b4d00 thread_name:ceph-osd ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) 1: (()+0xa21a01) [0x7f14560f0a01] 2: (()+0xf370) [0x7f1452cfe370] 3: (gsignal()+0x37) [0x7f1451d281d7] 4: (abort()+0x148) [0x7f1451d298c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594] 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2] 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0] 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13] 10: (OSD::init()+0x2179) [0x7f1455bd7779] 11: (main()+0x2def) [0x7f1455add56f] 12: (__libc_start_main()+0xf5) [0x7f1451d14b35] 13: (()+0x4ac8a6) [0x7f1455b7b8a6] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) ** in thread 7f14556b4d00 thread_name:ceph-osd ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc) 1: (()+0xa21a01) [0x7f14560f0a01] 2: (()+0xf370) [0x7f1452cfe370] 3: (gsignal()+0x37) [0x7f1451d281d7] 4: (abort()+0x148) [0x7f1451d298c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594] 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2] 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0] 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f] 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13] 10: (OSD::init()+0x2179) [0x7f1455bd7779] 11: (main()+0x2def) [0x7f1455add56f] 12: (__libc_start_main()+0xf5) [0x7f1451d14b35] 13: (()+0x4ac8a6) [0x7f1455b7b8a6] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 20/20 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.299.log --- end dump of recent events ---
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid 19.f1e { "pgid": "19.f1e", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 1084817, "epoch_pool_created": 1084817, "last_epoch_started": 1085232, "last_interval_started": 1085230, "last_epoch_clean": 1050342, "last_interval_clean": 1050342, "last_epoch_split": 0, "last_epoch_marked_full": 1061015, "same_up_since": 1085230, "same_interval_since": 1085230, "same_primary_since": 1085230, "last_scrub": "960114'865853", "last_scrub_stamp": "2017-08-25 17:32:06.181006", "last_deep_scrub": "952725'861179", "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006", "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006" }, "stats": { "version": "0'0", "reported_seq": "424", "reported_epoch": "1085650", "state": "active+undersized+degraded", "last_fresh": "2017-08-25 18:52:46.520078", "last_change": "2017-08-25 18:38:16.356266", "last_active": "2017-08-25 18:52:46.520078", "last_peered": "2017-08-25 18:52:46.520078", "last_clean": "2017-08-25 17:32:06.181006", "last_became_active": "2017-08-25 18:38:16.356266", "last_became_peered": "2017-08-25 18:38:16.356266", "last_unstale": "2017-08-25 18:52:46.520078", "last_undegraded": "2017-08-25 18:38:16.304877", "last_fullsized": "2017-08-25 18:38:16.304877", "mapping_epoch": 1085230, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 1084817, "last_epoch_clean": 1050342, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "960114'865853", "last_scrub_stamp": "2017-08-25 17:32:06.181006", "last_deep_scrub": "952725'861179", "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006", "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 299 ], "acting": [ 299 ], "blocked_by": [], "up_primary": 299, "acting_primary": 299 }, "empty": 1, "dne": 0, "incomplete": 0, "last_epoch_started": 1085232, "hit_set_history": { "current_last_update": "0'0", "history": [] } }
ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/ total 0 -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13 ceph pg 19.f1e query . . "blocked": "peering is blocked due to down osds", "down_osds_we_would_probe": [ 299 ], "peering_blocked_by": [ { "osd": 299, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" } ] . . .
removing the pg with ceph-objectstore-tool did the trick. but i am not sure if that will happen to a pg with real data in it or not.
Files
Actions