Project

General

Profile

Actions

Bug #8270

closed

0.80~rc1: OSD crash during replication after repair

Added by Dmitry Smirnov almost 10 years ago. Updated almost 10 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -5> 2014-05-02 17:36:59.305440 7f46b148c700  5 -- op tracker -- , seq: 124083, time: 2014-05-02 17:36:59.305310, event: throttled, request: pg_trim(0.2b to 28876'3931 e28876) v1
    -4> 2014-05-02 17:36:59.305442 7f46b148c700  5 -- op tracker -- , seq: 124083, time: 2014-05-02 17:36:59.305382, event: all_read, request: pg_trim(0.2b to 28876'3931 e28876) v1
    -3> 2014-05-02 17:36:59.305445 7f46b148c700  5 -- op tracker -- , seq: 124083, time: 2014-05-02 17:36:59.305427, event: dispatched, request: pg_trim(0.2b to 28876'3931 e28876) v1
    -2> 2014-05-02 17:36:59.305450 7f46b148c700  5 -- op tracker -- , seq: 124083, time: 2014-05-02 17:36:59.305449, event: waiting_for_osdmap, request: pg_trim(0.2b to 28876'3931 e28876) v1
    -1> 2014-05-02 17:36:59.305460 7f46b148c700  5 -- op tracker -- , seq: 124083, time: 2014-05-02 17:36:59.305458, event: started, request: pg_trim(0.2b to 28876'3931 e28876) v1
     0> 2014-05-02 17:36:59.330794 7f46aac7f700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_primary(int, ThreadPool::TPHandle&)' thread 7f46aac7f700 time 2014-05-02 17:36:59.215010
osd/ReplicatedPG.cc: 9682: FAILED assert(latest->is_update())

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (ReplicatedPG::recover_primary(int, ThreadPool::TPHandle&)+0x441) [0x7f46ca066a71]
 2: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x54b) [0x7f46ca095feb]
 3: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x2ae) [0x7f46c9eb488e]
 4: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x7f46c9f11437]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f46ca2c9bb1]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f46ca2caaa0]
 7: (()+0x8062) [0x7f46c93a9062]
 8: (clone()+0x6d) [0x7f46c78d1a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.6.log
--- end dump of recent events ---
2014-05-02 17:36:59.399677 7f46aac7f700 -1 *** Caught signal (Aborted) **
 in thread 7f46aac7f700

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (()+0x5a5daf) [0x7f46ca1f7daf]
 2: (()+0xf880) [0x7f46c93b0880]
 3: (gsignal()+0x39) [0x7f46c78213a9]
 4: (abort()+0x148) [0x7f46c78244c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f46c810e5e5]
 6: (()+0x5e746) [0x7f46c810c746]
 7: (()+0x5e773) [0x7f46c810c773]
 8: (()+0x5e9b2) [0x7f46c810c9b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f46ca2d8cf2]
 10: (ReplicatedPG::recover_primary(int, ThreadPool::TPHandle&)+0x441) [0x7f46ca066a71]
 11: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x54b) [0x7f46ca095feb]
 12: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x2ae) [0x7f46c9eb488e]
 13: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x7f46c9f11437]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f46ca2c9bb1]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x7f46ca2caaa0]
 16: (()+0x8062) [0x7f46c93a9062]
 17: (clone()+0x6d) [0x7f46c78d1a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -26> 2014-05-02 17:36:59.351778 7f46ab480700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.351777, event: started, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -25> 2014-05-02 17:36:59.351796 7f46ab480700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.351795, event: started, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -24> 2014-05-02 17:36:59.351832 7f46ab480700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.351831, event: waiting for subops from 1,7,9, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -23> 2014-05-02 17:36:59.351850 7f46ab480700  1 -- 192.168.0.204:6811/1007885 --> osd.1 192.168.0.250:6809/19720 -- osd_sub_op(mds.0.164:467954 0.19 2575c599/100002d177c.00000000/head//0 [] v 28876'322141 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46d3db6b00
   -22> 2014-05-02 17:36:59.351867 7f46ab480700  1 -- 192.168.0.204:6811/1007885 --> osd.7 192.168.0.2:6814/17668 -- osd_sub_op(mds.0.164:467954 0.19 2575c599/100002d177c.00000000/head//0 [] v 28876'322141 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46e703b700
   -21> 2014-05-02 17:36:59.351881 7f46ab480700  1 -- 192.168.0.204:6811/1007885 --> osd.9 192.168.0.7:6801/1155 -- osd_sub_op(mds.0.164:467954 0.19 2575c599/100002d177c.00000000/head//0 [] v 28876'322141 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46eabb1700
   -20> 2014-05-02 17:36:59.351935 7f46ab480700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.351934, event: commit_queued_for_journal_write, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -19> 2014-05-02 17:36:59.351969 7f46c08de700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.351968, event: write_thread_in_journal_buffer, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -18> 2014-05-02 17:36:59.352159 7f46c08de700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.352159, event: journaled_completion_queued, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -17> 2014-05-02 17:36:59.352177 7f46bf8dc700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.352176, event: op_commit, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -16> 2014-05-02 17:36:59.352729 7f46be0d9700  5 -- op tracker -- , seq: 124082, time: 2014-05-02 17:36:59.352729, event: op_applied, request: osd_op(mds.0.164:467954 100002d177c.00000000 [delete] 0.2575c599 snapc 1=[] ondisk+write e28876) v4
   -15> 2014-05-02 17:36:59.355718 7f46aec87700  1 -- 192.168.0.204:6812/1007885 <== osd.0 192.168.0.250:0/19386 978 ==== osd_ping(ping e28876 stamp 2014-05-02 17:36:59.352828) v2 ==== 47+0+0 (1505424456 0 0) 0x7f46e85b0300 con 0x7f46e6d56000
   -14> 2014-05-02 17:36:59.355737 7f46aec87700  1 -- 192.168.0.204:6812/1007885 --> 192.168.0.250:0/19386 -- osd_ping(ping_reply e28876 stamp 2014-05-02 17:36:59.352828) v2 -- ?+0 0x7f46e0e7c540 con 0x7f46e6d56000
   -13> 2014-05-02 17:36:59.355959 7f46afc89700  1 -- 192.168.0.204:6813/1007885 <== osd.0 192.168.0.250:0/19386 978 ==== osd_ping(ping e28876 stamp 2014-05-02 17:36:59.352828) v2 ==== 47+0+0 (1505424456 0 0) 0x7f46dd0504c0 con 0x7f46e6d566e0
   -12> 2014-05-02 17:36:59.355975 7f46afc89700  1 -- 192.168.0.204:6813/1007885 --> 192.168.0.250:0/19386 -- osd_ping(ping_reply e28876 stamp 2014-05-02 17:36:59.352828) v2 -- ?+0 0x7f46e76e72c0 con 0x7f46e6d566e0
   -11> 2014-05-02 17:36:59.375383 7f46abc81700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375383, event: started, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
   -10> 2014-05-02 17:36:59.375413 7f46abc81700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375412, event: started, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -9> 2014-05-02 17:36:59.375461 7f46abc81700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375460, event: waiting for subops from 3,4,9, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -8> 2014-05-02 17:36:59.375480 7f46abc81700  1 -- 192.168.0.204:6811/1007885 --> osd.3 192.168.0.2:6821/16947 -- osd_sub_op(mds.0.164:467953 0.71 e530bf1/100002d177b.00000000/head//0 [] v 28876'272097 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46ebaa6000
    -7> 2014-05-02 17:36:59.375500 7f46abc81700  1 -- 192.168.0.204:6811/1007885 --> osd.4 192.168.0.250:6819/20528 -- osd_sub_op(mds.0.164:467953 0.71 e530bf1/100002d177b.00000000/head//0 [] v 28876'272097 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46db7ee200
    -6> 2014-05-02 17:36:59.375515 7f46abc81700  1 -- 192.168.0.204:6811/1007885 --> osd.9 192.168.0.7:6801/1155 -- osd_sub_op(mds.0.164:467953 0.71 e530bf1/100002d177b.00000000/head//0 [] v 28876'272097 snapset=0=[]:[] snapc=0=[]) v10 -- ?+125 0x7f46e747a100
    -5> 2014-05-02 17:36:59.375573 7f46abc81700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375572, event: commit_queued_for_journal_write, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -4> 2014-05-02 17:36:59.375611 7f46c08de700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375610, event: write_thread_in_journal_buffer, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -3> 2014-05-02 17:36:59.375807 7f46c08de700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375806, event: journaled_completion_queued, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -2> 2014-05-02 17:36:59.375938 7f46bf8dc700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.375937, event: op_commit, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
    -1> 2014-05-02 17:36:59.376326 7f46be0d9700  5 -- op tracker -- , seq: 124081, time: 2014-05-02 17:36:59.376325, event: op_applied, request: osd_op(mds.0.164:467953 100002d177b.00000000 [delete] 0.e530bf1 snapc 1=[] ondisk+write e28876) v4
     0> 2014-05-02 17:36:59.399677 7f46aac7f700 -1 *** Caught signal (Aborted) **
 in thread 7f46aac7f700

 ceph version 0.80-rc1 (6769f4dc88425396921f94e1a37a1c90758aa3ea)
 1: (()+0x5a5daf) [0x7f46ca1f7daf]
 2: (()+0xf880) [0x7f46c93b0880]
 3: (gsignal()+0x39) [0x7f46c78213a9]
 4: (abort()+0x148) [0x7f46c78244c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f46c810e5e5]
 6: (()+0x5e746) [0x7f46c810c746]
 7: (()+0x5e773) [0x7f46c810c773]
 8: (()+0x5e9b2) [0x7f46c810c9b2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0x7f46ca2d8cf2]
 10: (ReplicatedPG::recover_primary(int, ThreadPool::TPHandle&)+0x441) [0x7f46ca066a71]
 11: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x54b) [0x7f46ca095feb]
 12: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x2ae) [0x7f46c9eb488e]
 13: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x17) [0x7f46c9f11437]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f46ca2c9bb1]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x7f46ca2caaa0]
 16: (()+0x8062) [0x7f46c93a9062]
 17: (clone()+0x6d) [0x7f46c78d1a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.6.log
--- end dump of recent events ---
Actions #1

Updated by Dmitry Smirnov almost 10 years ago

ceph pg deep-scrub 0.2b

over 10000 errors like the following are found:

osd.6 [ERR] 0.2b shard 1 missing b0abef2b/100004087af.00000000/head//0
osd.6 [ERR] 0.2b deep-scrub 3915 missing, 0 inconsistent objects
osd.6 [ERR] 0.2b deep-scrub 11745 errors

The attempt to repair using

ceph pg repair 0.2b
appears to succeed -- errors disappear and some objects shown in 'degraded' state; recovery begins but shortly ends with crash of OSD.6.

I repeated the above steps for three times and it doesn't seems to repair anything. OSD.6 is Btrfs-based; I've checked for file system errors but file system looks healthy, no surface defects, no CRC or other errors logged by kernel etc.

Actions #2

Updated by Dmitry Smirnov almost 10 years ago

It's getting interesting: every time there are less errors found so apparently each iteration fixes some.
Also OSD stopped crashing and I wonder if it is because number of errors dropped below 10000 or because I don't use deep-scrub any more. Just

ceph pg repair 0.2b

put 1 PG into "active+clean+scrubbing+deep+repair" state.

Then number of errors reduce:

10622 errors

osd.6 [ERR] 0.2b repair 3541 missing, 0 inconsistent objects 
osd.6 [ERR] 0.2b repair 10619 errors, 10619 fixed

----

osd.6 [ERR] 0.2b repair 1247 missing, 0 inconsistent objects 
2014-05-02 19:47:54.019812 osd.6 [ERR] 0.2b repair 3736 errors, 3736 fixed 

Until eventually all errors are gone.
I recognise some problems from this experience:

  • Crash.
  • Incomplete repair.
  • Status "active+clean" after repair while some errors are unfixed.

Perhaps repair should be followed by scrub automatically.

Actions #3

Updated by Samuel Just almost 10 years ago

  • Priority changed from Normal to Urgent

Can you reproduce with debug osd = 20, debug filestore = 20, debug ms = 1?

Actions #4

Updated by Dmitry Smirnov almost 10 years ago

No idea how to replicate on healthy cluster.
Problem seems to be gone for now...

Actions #5

Updated by Samuel Just almost 10 years ago

It should take something rather extreme to create inconsistent objects in the first place. Did anything interesting happen to any of the osds in the cluster?

Actions #6

Updated by Dmitry Smirnov almost 10 years ago

Well, there were number of upgrades 0.72.2 --> 0.78 --> 0.79 --> 0.80~rc1 --> 0.80
to give you the impression how much could have changed.
Then I had a misfortune to upgrade Linux kernel to 3.14.1 on some OSD hosts.
This version of kernel suffer from serious Btrfs regression (apparently fixed in 3.14.2)
so I lost several OSDs due to kernel crash and unclean shutdown. It was impossible
to repair affected OSDs so I were replacing them with freshly-formatted ones but before cluster
had a chance to recover more OSD died. Until I found the problem I lost 3 replicas out of 4.
Then crash of last OSD that couldn't replicate its only copy of object for three days produced 1 "unfound" object
that I had to recover using "mark_unfound_lost revert".
Finally after RAM upgrade one OSD host revealed some RAM errors so it was taken out (but I fear it could propagate some corrupted data during recovery).

Till the time I reported this bug all the problems were resolved but (deep-)scrub were still finding some errors
and on this instance it was unable to repair errors effectively...

Actions #7

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF