Project

General

Profile

Bug #21142

OSD crashes when loading pgs with "FAILED assert(interval.last > last)"

Added by Ali chips almost 2 years ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
Start date:
08/26/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

after upgrading to luminous 12.1.4 rc
we saw several osds crashing with below logs.
the cluster was unhealthy when upgraded due to high memory consumption. and was subject to a lot of oom kills before the upgrade.
the upgrade did solve the memory issue. but introduced this bug.
the pg seams empty, and probably its there because there was so many osd flapping.

logs:

   -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 load_pgs opened 455 pgs
   -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e needs 1050342-1085230
   -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665 build_past_intervals_parallel over 1050342-1085230
   -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342
   -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 - loading and decoding 0x7f14b3dfb0c0
   -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl 1050342 780781 bytes
   -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80] up [80], same_interval_since = 1050342
   -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050343
   -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 - loading and decoding 0x7f14b3dfad80
   -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl 1050343 781371 bytes
   -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050344
   -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 - loading and decoding 0x7f14b3dfb740
   -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl 1050344 782446 bytes
   -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050345
   -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 - loading and decoding 0x7f14b3dfba80
   -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl 1050345 782071 bytes
   -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050346
   -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 - loading and decoding 0x7f14b79c4000
   -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl 1050346 781955 bytes
   -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050347
   -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 - loading and decoding 0x7f14b79c4340
   -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl 1050347 781930 bytes
   -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050348
   -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 - loading and decoding 0x7f14b79c4680
   -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl 1050348 784883 bytes
    -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050349
    -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 - loading and decoding 0x7f14b79c49c0
    -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl 1050349 785081 bytes
    -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050350
    -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 - loading and decoding 0x7f14b79c4d00
    -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl 1050350 785948 bytes
    -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665 build_past_intervals_parallel epoch 1050351
    -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 - loading and decoding 0x7f14b79c5040
    -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl 1050351 788650 bytes
     0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26 00:38:00.532119
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: 3205: FAILED assert(interval.last > last)

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f145612f420]
 2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 3: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 6: (OSD::init()+0x2179) [0x7f1455bd7779]
 7: (main()+0x2def) [0x7f1455add56f]
 8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 9: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---
2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
 in thread 7f14556b4d00 thread_name:ceph-osd

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (()+0xa21a01) [0x7f14560f0a01]
 2: (()+0xf370) [0x7f1452cfe370]
 3: (gsignal()+0x37) [0x7f1451d281d7]
 4: (abort()+0x148) [0x7f1451d298c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 10: (OSD::init()+0x2179) [0x7f1455bd7779]
 11: (main()+0x2def) [0x7f1455add56f]
 12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 13: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
 in thread 7f14556b4d00 thread_name:ceph-osd

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (()+0xa21a01) [0x7f14560f0a01]
 2: (()+0xf370) [0x7f1452cfe370]
 3: (gsignal()+0x37) [0x7f1451d281d7]
 4: (abort()+0x148) [0x7f1451d298c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7f145612f594]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x7f1455e030b2]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x7f1455de8ab0]
 8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
 9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
 10: (OSD::init()+0x2179) [0x7f1455bd7779]
 11: (main()+0x2def) [0x7f1455add56f]
 12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
 13: (()+0x4ac8a6) [0x7f1455b7b8a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---


ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid 19.f1e
{
    "pgid": "19.f1e",
    "last_update": "0'0",
    "last_complete": "0'0",
    "log_tail": "0'0",
    "last_user_version": 0,
    "last_backfill": "MAX",
    "last_backfill_bitwise": 0,
    "purged_snaps": [],
    "history": {
        "epoch_created": 1084817,
        "epoch_pool_created": 1084817,
        "last_epoch_started": 1085232,
        "last_interval_started": 1085230,
        "last_epoch_clean": 1050342,
        "last_interval_clean": 1050342,
        "last_epoch_split": 0,
        "last_epoch_marked_full": 1061015,
        "same_up_since": 1085230,
        "same_interval_since": 1085230,
        "same_primary_since": 1085230,
        "last_scrub": "960114'865853",
        "last_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_deep_scrub": "952725'861179",
        "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006" 
    },
    "stats": {
        "version": "0'0",
        "reported_seq": "424",
        "reported_epoch": "1085650",
        "state": "active+undersized+degraded",
        "last_fresh": "2017-08-25 18:52:46.520078",
        "last_change": "2017-08-25 18:38:16.356266",
        "last_active": "2017-08-25 18:52:46.520078",
        "last_peered": "2017-08-25 18:52:46.520078",
        "last_clean": "2017-08-25 17:32:06.181006",
        "last_became_active": "2017-08-25 18:38:16.356266",
        "last_became_peered": "2017-08-25 18:38:16.356266",
        "last_unstale": "2017-08-25 18:52:46.520078",
        "last_undegraded": "2017-08-25 18:38:16.304877",
        "last_fullsized": "2017-08-25 18:38:16.304877",
        "mapping_epoch": 1085230,
        "log_start": "0'0",
        "ondisk_log_start": "0'0",
        "created": 1084817,
        "last_epoch_clean": 1050342,
        "parent": "0.0",
        "parent_split_bits": 0,
        "last_scrub": "960114'865853",
        "last_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_deep_scrub": "952725'861179",
        "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
        "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
        "log_size": 0,
        "ondisk_log_size": 0,
        "stats_invalid": false,
        "dirty_stats_invalid": false,
        "omap_stats_invalid": false,
        "hitset_stats_invalid": false,
        "hitset_bytes_stats_invalid": false,
        "pin_stats_invalid": false,
        "stat_sum": {
            "num_bytes": 0,
            "num_objects": 0,
            "num_object_clones": 0,
            "num_object_copies": 0,
            "num_objects_missing_on_primary": 0,
            "num_objects_missing": 0,
            "num_objects_degraded": 0,
            "num_objects_misplaced": 0,
            "num_objects_unfound": 0,
            "num_objects_dirty": 0,
            "num_whiteouts": 0,
            "num_read": 0,
            "num_read_kb": 0,
            "num_write": 0,
            "num_write_kb": 0,
            "num_scrub_errors": 0,
            "num_shallow_scrub_errors": 0,
            "num_deep_scrub_errors": 0,
            "num_objects_recovered": 0,
            "num_bytes_recovered": 0,
            "num_keys_recovered": 0,
            "num_objects_omap": 0,
            "num_objects_hit_set_archive": 0,
            "num_bytes_hit_set_archive": 0,
            "num_flush": 0,
            "num_flush_kb": 0,
            "num_evict": 0,
            "num_evict_kb": 0,
            "num_promote": 0,
            "num_flush_mode_high": 0,
            "num_flush_mode_low": 0,
            "num_evict_mode_some": 0,
            "num_evict_mode_full": 0,
            "num_objects_pinned": 0,
            "num_legacy_snapsets": 0
        },
        "up": [
            299
        ],
        "acting": [
            299
        ],
        "blocked_by": [],
        "up_primary": 299,
        "acting_primary": 299
    },
    "empty": 1,
    "dne": 0,
    "incomplete": 0,
    "last_epoch_started": 1085232,
    "hit_set_history": {
        "current_last_update": "0'0",
        "history": []
    }
}

ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
total 0
-rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13

ceph pg 19.f1e query
.
.
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                299
            ],
            "peering_blocked_by": [
                {
                    "osd": 299,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let us proceed" 
                }
            ]
.
.
.

removing the pg with ceph-objectstore-tool did the trick. but i am not sure if that will happen to a pg with real data in it or not.

strace.osd-2.log View (14.5 KB) Josef Zelenka, 01/11/2018 09:43 AM

osd-2.log View (162 KB) Josef Zelenka, 01/11/2018 09:43 AM

coredump.strace-osd2.log View (7.95 KB) Josef Zelenka, 01/11/2018 09:43 AM

osd-1.log View (176 KB) Josef Zelenka, 01/11/2018 09:43 AM

trace_end (934 KB) Zdenek Janda, 01/11/2018 10:16 AM

osd_log_debug_osd_20.tar.gz (234 KB) Xiaoxi Chen, 06/21/2018 09:10 AM

osd.log.zip (75.3 KB) Dexter John Genterone, 08/07/2018 07:24 PM


Related issues

Related to RADOS - Bug #24588: osd: may get empty info at recovery Resolved 06/20/2018
Copied to RADOS - Backport #38256: luminous: OSD crashes when loading pgs with "FAILED assert(interval.last > last)" Need More Info

History

#1 Updated by Ali chips almost 2 years ago

I uploaded more logs and info files with ceph-post-file
f27fb8a5-baae-4f04-8353-d3b2b314c61a

#2 Updated by John Spray over 1 year ago

  • Project changed from Ceph to RADOS
  • Subject changed from OSD crashs when loading pgs with "FAILED assert(interval.last > last)" to OSD crashes when loading pgs with "FAILED assert(interval.last > last)"
  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Component(RADOS) OSD added

#3 Updated by Josef Zelenka over 1 year ago

we are also affected by this bug. we are running luminous 12.2.2 on ubuntu 16.04, 3 node cluster, 8 HDDs per node, blockdb is on SSDs. our cluster stores a lot of small images => a lot of objects. the cluster failed after rebooting one of the nodes, which caused OSDs on other nodes to be down and later out from the cluster, which killed it. OSDs can't be brought up due to the aforementioned FAILED assert(interval.last>last) bug, i'm attaching logs from OSD's + strace of one of them. please let me know what other info you may need. this is quite a critical bug, because as we can see, it can cause a complete unrecoverable failure.

#4 Updated by Josef Zelenka over 1 year ago

also adding our current ceph -s/ceph osd tree state:

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF 
-1       22.45805 root default                           
-2        7.46782     host ceph1                         
 0   hdd  0.90959         osd.0      up  1.00000 1.00000 
 1   hdd  0.93689         osd.1    down        0 1.00000 
 2   hdd  0.93689         osd.2    down        0 1.00000 
 3   hdd  0.93689         osd.3    down  1.00000 1.00000 
 4   hdd  0.93689         osd.4      up  1.00000 1.00000 
 5   hdd  0.93689         osd.5      up  1.00000 1.00000 
 6   hdd  0.93689         osd.6    down        0 1.00000 
 7   hdd  0.93689         osd.7      up  1.00000 1.00000 
-3        7.49512     host ceph2                         
 8   hdd  0.93689         osd.8    down        0 1.00000 
 9   hdd  0.93689         osd.9      up  1.00000 1.00000 
10   hdd  0.93689         osd.10     up  1.00000 1.00000 
11   hdd  0.93689         osd.11     up  1.00000 1.00000 
12   hdd  0.93689         osd.12     up  1.00000 1.00000 
13   hdd  0.93689         osd.13     up  1.00000 1.00000 
14   hdd  0.93689         osd.14     up  1.00000 1.00000 
23   hdd  0.93689         osd.23   down        0 1.00000 
-4        7.49512     host ceph3                         
15   hdd  0.93689         osd.15   down  1.00000 1.00000 
16   hdd  0.93689         osd.16     up  1.00000 1.00000 
17   hdd  0.93689         osd.17     up  1.00000 1.00000 
18   hdd  0.93689         osd.18   down        0 1.00000 
19   hdd  0.93689         osd.19   down        0 1.00000 
20   hdd  0.93689         osd.20   down  1.00000 1.00000 
21   hdd  0.93689         osd.21   down  1.00000 1.00000 
22   hdd  0.93689         osd.22   down  1.00000 1.00000 

  cluster:
    id:     63e1eb6b-570d-43f5-9930-1cc78ab88d7b
    health: HEALTH_WARN
            5 osds down
            1 pools have many more objects per pg than average
            8917535/181086414 objects misplaced (4.924%)
            112945/90543207 objects unfound (0.125%)
            Reduced data availability: 20736 pgs inactive, 13827 pgs down, 7 pgs peering, 896 pgs stale
            Degraded data redundancy: 39304150/181086414 objects degraded (21.705%), 26637 pgs unclean, 5874 pgs degraded, 4831 pgs undersized

  services:
    mon: 3 daemons, quorum ceph1,ceph2,ceph3
    mgr: ceph1(active), standbys: ceph2, ceph3
    mds: cephfs-1/1/1 up  {0=ceph3=up:active}, 2 up:standby
    osd: 24 osds: 12 up, 17 in; 10867 remapped pgs
    rgw: 2 daemons active

  data:
    pools:   14 pools, 53312 pgs
    objects: 88421k objects, 2565 GB
    usage:   8476 GB used, 3007 GB / 11484 GB avail
    pgs:     12.941% pgs unknown
             25.955% pgs not active
             39304150/181086414 objects degraded (21.705%)
             8917535/181086414 objects misplaced (4.924%)
             112945/90543207 objects unfound (0.125%)
             17246 active+clean
             12970 down
             9429  active+clean+remapped
             6899  unknown
             3828  active+undersized+degraded
             857   stale+down
             709   active+undersized+degraded+remapped+backfill_wait
             537   active+recovery_wait+degraded+remapped
             504   active+recovery_wait+degraded
             167   active+recovery_wait+undersized+degraded
             75    active+recovery_wait+undersized+degraded+remapped
             39    stale+active+undersized+degraded
             27    active+remapped+backfill_wait
             11    active+undersized+degraded+remapped
             5     peering
             3     remapped
             2     remapped+peering
             2     active+undersized+degraded+remapped+backfilling
             2     active+degraded+remapped+backfilling

  io:
    recovery: 976 kB/s, 32 objects/s

#5 Updated by Zdenek Janda over 1 year ago

Adding last 10000 lines of strace of OSD affected by the bug.

The ABRT signal is generated right after


     0.007869 pread64(22, "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"..., 12288, 908493324288) = 12288
     0.004220 pread64(22, "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"..., 12288, 908499615744) = 12288
     0.009143 pread64(22, "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"..., 12288, 908500926464) = 12288
     0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"..., 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
/build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED assert(interval.last > last)

#6 Updated by Zdenek Janda over 1 year ago

I have succeeded in identifying faulty PG:

 -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d needs 13939-15333
 -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1 osd.15 15340 build_past_intervals_parallel over 13939-15333
 -3448> 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340 build_past_intervals_parallel epoch 13939
 -3447> 2018-01-11 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map 13939 - loading and decoding 0x55d39deefb80
 -3446> 2018-01-11 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl 13939 27475 bytes
 -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340 build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting [21,9] up [21,9], same_interval_since = 13939
 -3444> 2018-01-11 11:32:20.250505 7f066e2a3e00 10 osd.15 15340 build_past_intervals_parallel epoch 13940
 -3443> 2018-01-11 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map 13940 - loading and decoding 0x55d39deef800
 -3442> 2018-01-11 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl 13940 27475 bytes
....
    -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340 build_past_intervals_parallel epoch 15087
    -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map 15087 - loading and decoding 0x55d3f9e7e700
    -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl 15087 11409 bytes
     0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1 /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
/build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED assert(interval.last > last)

Lets see what can be done about this PG.

#7 Updated by Zdenek Janda over 1 year ago

More info on affected PG

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-15/ --op info --pgid 12.62d
{
    "pgid": "12.62d",
    "last_update": "15105'34977",
    "last_complete": "15105'34977",
    "log_tail": "6328'33477",
    "last_user_version": 0,
    "last_backfill": "MIN",
    "last_backfill_bitwise": 1,
    "purged_snaps": [],
    "history": {
        "epoch_created": 132,
        "epoch_pool_created": 132,
        "last_epoch_started": 15334,
        "last_interval_started": 15333,
        "last_epoch_clean": 6662,
        "last_interval_clean": 6661,
        "last_epoch_split": 0,
        "last_epoch_marked_full": 0,
        "same_up_since": 15333,
        "same_interval_since": 15333,
        "same_primary_since": 15199,
        "last_scrub": "11919'34743",
        "last_scrub_stamp": "2018-01-02 22:32:44.598392",
        "last_deep_scrub": "11919'34743",
        "last_deep_scrub_stamp": "2018-01-02 22:32:44.598392",
        "last_clean_scrub_stamp": "2018-01-02 22:32:44.598392" 
    },
    "stats": {
        "version": "0'0",
        "reported_seq": "0",
        "reported_epoch": "0",
        "state": "unknown",
        "last_fresh": "0.000000",
        "last_change": "0.000000",
        "last_active": "0.000000",
        "last_peered": "0.000000",
        "last_clean": "0.000000",
        "last_became_active": "0.000000",
        "last_became_peered": "0.000000",
        "last_unstale": "0.000000",
        "last_undegraded": "0.000000",
        "last_fullsized": "0.000000",
        "mapping_epoch": 0,
        "log_start": "0'0",
        "ondisk_log_start": "0'0",
        "created": 0,
        "last_epoch_clean": 0,
        "parent": "0.0",
        "parent_split_bits": 0,
        "last_scrub": "0'0",
        "last_scrub_stamp": "0.000000",
        "last_deep_scrub": "0'0",
        "last_deep_scrub_stamp": "0.000000",
        "last_clean_scrub_stamp": "0.000000",
        "log_size": 0,
        "ondisk_log_size": 0,
        "stats_invalid": false,
        "dirty_stats_invalid": false,
        "omap_stats_invalid": false,
        "hitset_stats_invalid": false,
        "hitset_bytes_stats_invalid": false,
        "pin_stats_invalid": false,
        "stat_sum": {
            "num_bytes": 0,
            "num_objects": 0,
            "num_object_clones": 0,
            "num_object_copies": 0,
            "num_objects_missing_on_primary": 0,
            "num_objects_missing": 0,
            "num_objects_degraded": 0,
            "num_objects_misplaced": 0,
            "num_objects_unfound": 0,
            "num_objects_dirty": 0,
            "num_whiteouts": 0,
            "num_read": 0,
            "num_read_kb": 0,
            "num_write": 0,
            "num_write_kb": 0,
            "num_scrub_errors": 0,
            "num_shallow_scrub_errors": 0,
            "num_deep_scrub_errors": 0,
            "num_objects_recovered": 0,
            "num_bytes_recovered": 0,
            "num_keys_recovered": 0,
            "num_objects_omap": 0,
            "num_objects_hit_set_archive": 0,
            "num_bytes_hit_set_archive": 0,
            "num_flush": 0,
            "num_flush_kb": 0,
            "num_evict": 0,
            "num_evict_kb": 0,
            "num_promote": 0,
            "num_flush_mode_high": 0,
            "num_flush_mode_low": 0,
            "num_evict_mode_some": 0,
            "num_evict_mode_full": 0,
            "num_objects_pinned": 0,
            "num_legacy_snapsets": 0
        },
        "up": [],
        "acting": [],
        "blocked_by": [],
        "up_primary": -1,
        "acting_primary": -1
    },
    "empty": 0,
    "dne": 0,
    "incomplete": 1,
    "last_epoch_started": 15334,
    "hit_set_history": {
        "current_last_update": "0'0",
        "history": []
    }
}

#8 Updated by Michal Cila over 1 year ago

Also several osds (as you can see the ceph osd tree output) are getting dumped out of the crush map. After putting them back in, they hold on there for several minutes and get dumped again.

#9 Updated by Zdenek Janda over 1 year ago

Recovery from non starting OSDs in this case is as following. Run OSD with debug:

/usr/bin/ceph-osd -f --cluster ceph --id 18 --setuser ceph --setgroup ceph --debug_osd 10

Inspect the resulting log after it crashes, should on end usually:
    -2> 2018-01-11 13:02:36.379141 7f09cd155e00 10 osd.19 15198 build_past_intervals_parallel epoch 15087
    -1> 2018-01-11 13:02:36.384966 7f09cd155e00 10 osd.19 0 add_map_bl 15087 11409 bytes
     0> 2018-01-11 13:02:36.391019 7f09cd155e00 -1 /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' thread 7f09cd155e00 time 2018-01-11 13:02:36.385347

Inspect log above, on begin of build_past_intervals_parallel, should locate faulty PG
 -4205> 2018-01-11 13:02:25.544829 7f09cd155e00 10 osd.19 15198 1.429 needs 12988-15159
 -4204> 2018-01-11 13:02:25.551423 7f09cd155e00  1 osd.19 15198 build_past_intervals_parallel over 12988-15159
 -4203> 2018-01-11 13:02:25.551430 7f09cd155e00 10 osd.19 15198 build_past_intervals_parallel epoch 12988
 -4202> 2018-01-11 13:02:25.578353 7f09cd155e00 10 osd.19 0 add_map_bl 12988 39278 bytes
 -4201> 2018-01-11 13:02:25.578937 7f09cd155e00 10 osd.19 15198 build_past_intervals_parallel epoch 12988 pg 1.429 first map, acting [21,12] up [21,12], same_interval_since = 12988
 -4200> 2018-01-11 13:02:25.578947 7f09cd155e00 10 osd.19 15198 build_past_intervals_parallel epoch 12989

Remove intervals of faulting PG, if there is more listed PGs remove intervals of all of them
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-19/ --op rm-past-intervals --pgid 1.429

Remove logfile (often lot of GB with debug 10), restart OSD daemon, should come up sucessfully and issue with single OSD fixed. Repeat to all affected OSDs, cluster is ok again.

#10 Updated by Tim Niemueller about 1 year ago

We frequently experience this with 12.2.3 running Ceph in a Kubernetes cluster, cf. https://github.com/ceph/ceph-container/issues/1040.

Is there anything to make the cluster robust against nodes dropping out for limited (but somewhat prolonged) times?

#11 Updated by Dexter John Genterone about 1 year ago

We are experiencing this too. Majority of the OSDs went down. We tried removing the intervals. It works on some OSDs but fails on the others. Sometimes the failing OSDs will also affect the running ones. We are on 12.2.0 running on Kubernetes.

#12 Updated by Sage Weil about 1 year ago

Tim, Dexter, is this something that is reproducible in your environment? I haven't seen this one, which makes me very concerned that multiple users are hitting it in the field. With osd logs I think it shouldn't be hard to track down.

#13 Updated by Sage Weil about 1 year ago

  • Status changed from Triaged to Need More Info
  • Priority changed from Normal to High

#14 Updated by Xiaoxi Chen about 1 year ago

We also hit this today, happen to have osd log with --debug_osd = 20

FWIW, the cluster has an inconsistent PG and there was one host failure 1 hours prior to hit this issue. We fixed the failed host and the cluster was under recovering /backfilling.

It looks like https://github.com/ceph/ceph/commit/ac9e2f76093ef00238a1a8a0266a95c5ca2468b2 can be backport without too much effort, will it solve the issue?

#15 Updated by Dexter John Genterone about 1 year ago

Hi Sage,

We've experienced this again on a new environment we setup. Took a snippet of the logs, hope it's enough: https://gist.github.com/darkcrux/86908d35e843f01d908a2397c6da7605. Will gather more logs if we can reproduce the other issues we've faced relating to this.

The OSDs seem to fail due to wrong past intervals failing the assert. We've noticed that this issue sometimes occur when there are clock skews. Trying to sync the time, the OSDs still won't go up until we remove the past intervals. There are cases when removing the past intervals work. Sometimes, it doesn't though and there was one time when the OSD will failed to boot caused by another failed assert (log tail?). Had to mark those PGs as complete for them to boot but it sometimes doesn't work. Unfortunately we didn't catch the logs for it.

Also notice this happens when nodes go down for a long time. When the nodes are booted, some of the OSDs will fail, and even OSDs on the other nodes will start to fail the assertions.

#16 Updated by Dexter John Genterone about 1 year ago

Uploaded a few more logs (debug 20) here: https://storage.googleapis.com/ceph-logs/ceph-osd-logs.tar.gz

After running the workaround provided and restarting the OSDs, then past intervals of other PGs on failing OSDs became invalid too.

#17 Updated by Sage Weil about 1 year ago

Dexter John Genterone wrote:

Uploaded a few more logs (debug 20) here: https://storage.googleapis.com/ceph-logs/ceph-osd-logs.tar.gz

After running the workaround provided and restarting the OSDs, then past intervals of other PGs on failing OSDs became invalid too.

Any chance you can gdb one of the core files for a crashing OSD to identify which PG is it asserting on? and perhaps 'p *this' on the pi_compact_rep?

#18 Updated by Sage Weil about 1 year ago

 -7297> 2018-06-22 23:10:14.762512 7ff3a9b38e00 10 osd.17 320 load_pgs loaded pg[6.4( empty local-lis/les=273/274 n=0 ec=253/253 lis/c 273/180 les/c/f 274/181/0 307/307/304) [18,10,2] r=-1 lpr=319 pi=[253,307)/4 crt=0'0 unknown NOTIFY] log((0'0,0'0], crt=0'0)

last_epoch_clean=181 when epoch_created and epoch_pool_created=253. This is what is making build_past_intervals_parallel() think it needs to do something.

#19 Updated by Sage Weil about 1 year ago

ceph-osd-sda6-sda5-4f6mw.log:2018-06-22 23:10:15.536716 7fd97ec33e00 10 osd.1 pg_epoch: 344 pg[6.4( v 166'96 (0'0,166'96] local-lis/les=214/215 n=2 ec=23/23 lis/c 214/180 les/c/f 215/181/0 344/344/344) [1] r=0 lpr=0 pi=[180,344)/5 crt=166'96 lcod 0'0 mlcod 0'0 unknown] handle_loaded

23 appears to be the real value for epoch_created

#20 Updated by Sage Weil about 1 year ago

Dexter, anyone: was there a PG split (pg_num increase) on the cluster before this happened? Or maybe a split combined with an OSD upgrade?

#21 Updated by Xiaoxi Chen about 1 year ago

Sage Weil wrote:

Dexter, anyone: was there a PG split (pg_num increase) on the cluster before this happened? Or maybe a split combined with an OSD upgrade?

In my case we do have OSD failure due to other reason before this happened. Shrinking cluster size will increase #PG on every OSD, is that related?

#22 Updated by Dexter John Genterone about 1 year ago

Any chance you can gdb one of the core files for a crashing OSD to identify which PG is it asserting on? and perhaps 'p *this' on the pi_compact_rep?

Sorry, I no longer have access to the cluster. It has been reprovisioned for another purpose. When I get a chance to reproduce it on another environment, I'll see what I can get.

Dexter, anyone: was there a PG split (pg_num increase) on the cluster before this happened? Or maybe a split combined with an OSD upgrade?

There wasn't any PG split or OSD upgrade. The first instance this happened though was triggered by an update on the crushmap to balance out the cluster. OSDs were failing during rebalancing and then it triggered this issue.

The 2nd instance was when a node was rebooted. After boot, this got triggered.

I think this can happen when there's movement of PGs? Also would the network be a factor in this too? (eg. packets arriving late?)

#23 Updated by Sage Weil about 1 year ago

Well, I can work around the issue.. I the build_past_itnervals_parallel() is removed entirely in mimic and I can do the same in luminous. But I would like to understand why the PG metadata is wrong first (epoch_created in this case).

If anyone can reproduce this with more logging, or share other details about what was happening to the cluster at the time, that would be helpful!

#24 Updated by Sage Weil about 1 year ago

https://github.com/ceph/ceph/pull/22744 disabled build_past_itnervals_parallel in luminous (by default; can be turned back on if a buried cluster still needs it)

#25 Updated by Dexter John Genterone about 1 year ago

Thanks Sage. I'll try to get my hands on another environment and see if I can reproduce and get more details. Will update this ticket then.

#26 Updated by Sage Weil about 1 year ago

  • Related to Bug #24588: osd: may get empty info at recovery added

#27 Updated by Dexter John Genterone 12 months ago

We've encountered this again when we were adding a new OSD. Couldn't get the gdb as there was none installed and the environment is airtight. :(

We did try the patch and we got past the build_past_interval issue only to encounter issue https://tracker.ceph.com/issues/24866.

#28 Updated by Dexter John Genterone 12 months ago

attaching new osd log.

#29 Updated by Dexter John Genterone 12 months ago

ubuntu@mastercontroller01:~$ ceph -s
cluster:
id: dc00b525-7dca-435a-bfa6-c0b9b216e1f2
health: HEALTH_ERR
1 filesystem is degraded
1 backfillfull osd(s)
1 chassis (4 osds) down
21 osds down
16 hosts (19 osds) down
14 pool(s) backfillfull
1 pools have many more objects per pg than average
269/3384300 objects misplaced (0.008%)
Reduced data availability: 840 pgs inactive, 42 pgs down, 500 pgs peering, 29 pgs stale
Degraded data redundancy: 459531/3384300 objects degraded (13.578%), 210 pgs degraded, 472 pgs undersized
Degraded data redundancy (low space): 1 pg backfill_toofull
61 slow requests are blocked > 32 sec. Implicated osds 27,28,31

services:
mon: 3 daemons, quorum mastercontroller01,mastercontroller02,supercompute01
mgr: mastercontroller02(active), standbys: supercompute01, mastercontroller01
mds: cephfs-1/1/1 up {0=mds-mastercontroller02=up:replay}, 2 up:standby
osd: 46 osds: 13 up, 34 in; 80 remapped pgs
rgw: 1 daemon active
data:
pools: 14 pools, 1312 pgs
objects: 1101k objects, 5035 GB
usage: 8454 GB used, 5185 GB / 13640 GB avail
pgs: 22.713% pgs unknown
41.311% pgs not active
459531/3384300 objects degraded (13.578%)
269/3384300 objects misplaced (0.008%)
462 peering
298 unknown
242 active+undersized
198 active+undersized+degraded
34 down
32 remapped+peering
13 stale+active+undersized
11 stale+active+undersized+degraded
7 creating+down
7 active+undersized+remapped
2 stale+remapped+peering
2 stale+peering
2 creating+peering
1 stale+down
1 active+undersized+degraded+remapped+backfill_toofull

#30 Updated by Dexter John Genterone 11 months ago

Some steps tried to reproduce the bug:

1. Create a luminous cluster running in Kubernetes using hostNetwork and the public and cluster pointing to the same network.
2. Write data to the cluster bringing it to NEARFULL
3. There should be high client IO
4. Trigger rebalance, eg. add new OSD, change crush_rule

This can sometimes cause the bug to trigger. Haven't figure out a way to fully reproduce it. :(

#31 Updated by Xiaoxi Chen 10 months ago

FWIW, we hit this issue several times, it seems relate with our operational works that change `mon_osd_force_trim_to` to latestver - 2000 , to solve the "mon using too much space" warning.

Other observation is the problematic PG usually are PGs belong to inactive pool, i.e IO rarely happening on those PGs.

#32 Updated by Josh Durgin 5 months ago

  • Status changed from Need More Info to Resolved
  • Pull request ID set to 22704

#33 Updated by Nathan Cutler 5 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous

#34 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #38256: luminous: OSD crashes when loading pgs with "FAILED assert(interval.last > last)" added

Also available in: Atom PDF