Project

General

Profile

Actions

Bug #21142

closed

OSD crashes when loading pgs with "FAILED assert(interval.last > last)"

Added by Ali chips over 6 years ago. Updated over 4 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

after upgrading to luminous 12.1.4 rc
we saw several osds crashing with below logs.
the cluster was unhealthy when upgraded due to high memory consumption. and was subject to a lot of oom kills before the upgrade.
the upgrade did solve the memory issue. but introduced this bug.
the pg seams empty, and probably its there because there was so many osd flapping.

logs:

   -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 load_pgs opened 455 pgs
   -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e needs 1050342-1085230
   -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665 build_past_intervals_parallel over 1050342-1085230
   -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342
   -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 - loading and decoding 0x7f14b3dfb0c0
   -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl 1050342 780781 bytes
   -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80] up [80], same_interval_since = 1050342
   -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050343
   -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 - loading and decoding 0x7f14b3dfad80
   -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl 1050343 781371 bytes
   -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050344
   -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 - loading and decoding 0x7f14b3dfb740
   -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl 1050344 782446 bytes
   -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050345
   -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 - loading and decoding 0x7f14b3dfba80
   -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl 1050345 782071 bytes
   -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050346
   -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 - loading and decoding 0x7f14b79c4000
   -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl 1050346 781955 bytes
   -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050347
   -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 - loading and decoding 0x7f14b79c4340
   -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl 1050347 781930 bytes
   -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050348
   -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 - loading and decoding 0x7f14b79c4680
   -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl 1050348 784883 bytes
    -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050349
    -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 - loading and decoding 0x7f14b79c49c0
    -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl 1050349 785081 bytes
    -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050350
    -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 - loading and decoding 0x7f14b79c4d00
    -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl 1050350 785948 bytes
    -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050351
    -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 - loading and decoding 0x7f14b79c5040
    -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl 1050351 788650 bytes
     0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26 00:38:00.532119
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: 3205: FAILED assert(interval.last > last)

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f145612f420]
 2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 3: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 6: (OSD::init()+0x2179) [0x7f1455bd7779]
 7: (main()+0x2def) [0x7f1455add56f]
 8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 9: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---
2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
 in thread 7f14556b4d00 thread_name:ceph-osd

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (()+0xa21a01) [0x7f14560f0a01]
 2: (()+0xf370) [0x7f1452cfe370]
 3: (gsignal()+0x37) [0x7f1451d281d7]
 4: (abort()+0x148) [0x7f1451d298c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 10: (OSD::init()+0x2179) [0x7f1455bd7779]
 11: (main()+0x2def) [0x7f1455add56f]
 12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 13: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
 in thread 7f14556b4d00 thread_name:ceph-osd

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (()+0xa21a01) [0x7f14560f0a01]
 2: (()+0xf370) [0x7f1452cfe370]
 3: (gsignal()+0x37) [0x7f1451d281d7]
 4: (abort()+0x148) [0x7f1451d298c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 10: (OSD::init()+0x2179) [0x7f1455bd7779]
 11: (main()+0x2def) [0x7f1455add56f]
 12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 13: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---


ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid 19.f1e
{
    "pgid": "19.f1e",
    "last_update": "0'0",
    "last_complete": "0'0",
    "log_tail": "0'0",
    "last_user_version": 0,
    "last_backfill": "MAX",
    "last_backfill_bitwise": 0,
    "purged_snaps": [],
    "history": {
        "epoch_created": 1084817,
        "epoch_pool_created": 1084817,
        "last_epoch_started": 1085232,
        "last_interval_started": 1085230,
        "last_epoch_clean": 1050342,
        "last_interval_clean": 1050342,
        "last_epoch_split": 0,
        "last_epoch_marked_full": 1061015,
        "same_up_since": 1085230,
        "same_interval_since": 1085230,
        "same_primary_since": 1085230,
        "last_scrub": "960114'865853",
        "last_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_deep_scrub": "952725'861179",
        "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006" 
    },
    "stats": {
        "version": "0'0",
        "reported_seq": "424",
        "reported_epoch": "1085650",
        "state": "active+undersized+degraded",
        "last_fresh": "2017-08-25 18:52:46.520078",
        "last_change": "2017-08-25 18:38:16.356266",
        "last_active": "2017-08-25 18:52:46.520078",
        "last_peered": "2017-08-25 18:52:46.520078",
        "last_clean": "2017-08-25 17:32:06.181006",
        "last_became_active": "2017-08-25 18:38:16.356266",
        "last_became_peered": "2017-08-25 18:38:16.356266",
        "last_unstale": "2017-08-25 18:52:46.520078",
        "last_undegraded": "2017-08-25 18:38:16.304877",
        "last_fullsized": "2017-08-25 18:38:16.304877",
        "mapping_epoch": 1085230,
        "log_start": "0'0",
        "ondisk_log_start": "0'0",
        "created": 1084817,
        "last_epoch_clean": 1050342,
        "parent": "0.0",
        "parent_split_bits": 0,
        "last_scrub": "960114'865853",
        "last_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_deep_scrub": "952725'861179",
        "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
        "log_size": 0,
        "ondisk_log_size": 0,
        "stats_invalid": false,
        "dirty_stats_invalid": false,
        "omap_stats_invalid": false,
        "hitset_stats_invalid": false,
        "hitset_bytes_stats_invalid": false,
        "pin_stats_invalid": false,
        "stat_sum": {
            "num_bytes": 0,
            "num_objects": 0,
            "num_object_clones": 0,
            "num_object_copies": 0,
            "num_objects_missing_on_primary": 0,
            "num_objects_missing": 0,
            "num_objects_degraded": 0,
            "num_objects_misplaced": 0,
            "num_objects_unfound": 0,
            "num_objects_dirty": 0,
            "num_whiteouts": 0,
            "num_read": 0,
            "num_read_kb": 0,
            "num_write": 0,
            "num_write_kb": 0,
            "num_scrub_errors": 0,
            "num_shallow_scrub_errors": 0,
            "num_deep_scrub_errors": 0,
            "num_objects_recovered": 0,
            "num_bytes_recovered": 0,
            "num_keys_recovered": 0,
            "num_objects_omap": 0,
            "num_objects_hit_set_archive": 0,
            "num_bytes_hit_set_archive": 0,
            "num_flush": 0,
            "num_flush_kb": 0,
            "num_evict": 0,
            "num_evict_kb": 0,
            "num_promote": 0,
            "num_flush_mode_high": 0,
            "num_flush_mode_low": 0,
            "num_evict_mode_some": 0,
            "num_evict_mode_full": 0,
            "num_objects_pinned": 0,
            "num_legacy_snapsets": 0
        },
        "up": [
            299
        ],
        "acting": [
            299
        ],
        "blocked_by": [],
        "up_primary": 299,
        "acting_primary": 299
    },
    "empty": 1,
    "dne": 0,
    "incomplete": 0,
    "last_epoch_started": 1085232,
    "hit_set_history": {
        "current_last_update": "0'0",
        "history": []
    }
}

ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
total 0
-rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13

ceph pg 19.f1e query
.
.
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                299
            ],
            "peering_blocked_by": [
                {
                    "osd": 299,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let us proceed" 
                }
            ]
.
.
.

removing the pg with ceph-objectstore-tool did the trick. but i am not sure if that will happen to a pg with real data in it or not.


Files

strace.osd-2.log (14.5 KB) strace.osd-2.log Josef Zelenka, 01/11/2018 09:43 AM
osd-2.log (162 KB) osd-2.log Josef Zelenka, 01/11/2018 09:43 AM
coredump.strace-osd2.log (7.95 KB) coredump.strace-osd2.log Josef Zelenka, 01/11/2018 09:43 AM
osd-1.log (176 KB) osd-1.log Josef Zelenka, 01/11/2018 09:43 AM
trace_end (934 KB) trace_end Zdenek Janda, 01/11/2018 10:16 AM
osd_log_debug_osd_20.tar.gz (234 KB) osd_log_debug_osd_20.tar.gz Xiaoxi Chen, 06/21/2018 09:10 AM
osd.log.zip (75.3 KB) osd.log.zip Dexter John Genterone, 08/07/2018 07:24 PM

Related issues 2 (0 open2 closed)

Related to RADOS - Bug #24588: osd: may get empty info at recoveryResolved06/20/2018

Actions
Copied to RADOS - Backport #38256: luminous: OSD crashes when loading pgs with "FAILED assert(interval.last > last)"DuplicateActions
Actions

Also available in: Atom PDF