Project

General

Profile

Actions

Bug #7630

closed

Simple tiering test produces OSD assert

Added by Mark Nelson about 10 years ago. Updated about 10 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

An initial teiring test produced the included assert just as rados bench started. pools were on the same physical disks. Not sure if the procedure for setting up the teiring was correct, but it probably should do something nicer than throw an assert in the log and fail horribly either way.

Steps to produce:

ceph osd pool create cache 4096
ceph osd pool create base 4096
ceph osd pool set cache size 1
ceph osd pool set base size 3
ceph osd tier add base cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay base cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 8
ceph osd pool set cache hit_set_period 60
ceph osd pool set cache target_max_objects 5000

Client output:

2014-03-06 09:14:54.960043 7f3dd59aa700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6835/20215 pipe(0x7f3de400a400 sd=11 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400a660).fault
2014-03-06 09:14:54.966151 7f3dd58a9700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6940/4132 pipe(0x7f3de400ab60 sd=12 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400a8d0).fault
2014-03-06 09:14:54.971522 7f3dd65b6700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6800/17141 pipe(0x7f3de400ae60 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400b0c0).fault
2014-03-06 09:14:54.974074 7f3dd5aab700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6875/25320 pipe(0x7f3de400b320 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400b580).fault
2014-03-06 09:14:54.985299 7f3dd50a1700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6910/30727 pipe(0x7f3de400b7b0 sd=16 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400ba10).fault
2014-03-06 09:14:54.985383 7f3dd60b1700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6850/22081 pipe(0x7f3de400bc40 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400bea0).fault
2014-03-06 09:14:54.987171 7f3dd51a2700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6840/20706 pipe(0x7f3de400c430 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400c690).fault
2014-03-06 09:14:54.987204 7f3d6b5f7700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6890/27510 pipe(0x15c1920 sd=30 :0 s=1 pgs=0 cs=0 l=1 c=0x15b1600).fault
2014-03-06 09:14:54.991210 7f3dd56a7700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6810/17957 pipe(0x7f3de400d320 sd=9 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400d580).fault
2014-03-06 09:14:54.995999 7f3dd68b9700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6865/23979 pipe(0x7f3de400e630 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400e890).fault
2014-03-06 09:14:54.996277 7f3dd4d9e700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6920/32527 pipe(0x7f3de400f540 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de400f7a0).fault
2014-03-06 09:14:54.997669 7f3dd4a9b700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6825/19273 pipe(0x7f3de4010490 sd=21 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de40106f0).fault
2014-03-06 09:14:55.007518 7f3dd5cad700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6870/24641 pipe(0x7f3de4010920 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de4010b80).fault
2014-03-06 09:14:55.028786 7f3d6b7f9700  0 -- 192.168.10.2:0/1008072 >> 192.168.10.1:6815/18383 pipe(0x7f3de40114b0 sd=31 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3de4011710).fault

Assert in OSD logs:

     0> 2014-03-06 09:20:14.811587 7fbf89839700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::process_copy_chunk(hobject_t, tid_t, int)' thread 7fbf89839700 time 2014-03-06 09:20:14.793633
osd/ReplicatedPG.cc: 5444: FAILED assert(cop->rval >= 0)

 ceph version 0.77-655-g195d53a (195d53a7fc695ed954c85022fef6d2a18f68fe20)
 1: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0xf9d) [0x885d8d]
 2: (C_Copyfrom::finish(int)+0x99) [0x8d6f79]
 3: (Context::complete(int)+0x9) [0x658fc9]
 4: (Finisher::finisher_thread_entry()+0x1b8) [0x980da8]
 5: (()+0x7f6e) [0x7fbfa4d4df6e]
 6: (clone()+0x6d) [0x7fbfa32f29cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.0.log
--- end dump of recent events ---
2014-03-06 09:20:14.830535 7fbf89839700 -1 *** Caught signal (Aborted) **
 in thread 7fbf89839700

 ceph version 0.77-655-g195d53a (195d53a7fc695ed954c85022fef6d2a18f68fe20)
 1: ceph-osd() [0x95ebbf]
 2: (()+0xfbb0) [0x7fbfa4d55bb0]
 3: (gsignal()+0x37) [0x7fbfa322ef77]
 4: (abort()+0x148) [0x7fbfa32325e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fbfa3b3a6e5]
 6: (()+0x5e856) [0x7fbfa3b38856]
 7: (()+0x5e883) [0x7fbfa3b38883]
 8: (()+0x5eaae) [0x7fbfa3b38aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa3e332]
 10: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0xf9d) [0x885d8d]
 11: (C_Copyfrom::finish(int)+0x99) [0x8d6f79]
 12: (Context::complete(int)+0x9) [0x658fc9]
 13: (Finisher::finisher_thread_entry()+0x1b8) [0x980da8]
 14: (()+0x7f6e) [0x7fbfa4d4df6e]
 15: (clone()+0x6d) [0x7fbfa32f29cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2014-03-06 09:20:14.830535 7fbf89839700 -1 *** Caught signal (Aborted) **
 in thread 7fbf89839700

 ceph version 0.77-655-g195d53a (195d53a7fc695ed954c85022fef6d2a18f68fe20)
 1: ceph-osd() [0x95ebbf]
 2: (()+0xfbb0) [0x7fbfa4d55bb0]
 3: (gsignal()+0x37) [0x7fbfa322ef77]
 4: (abort()+0x148) [0x7fbfa32325e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fbfa3b3a6e5]
 6: (()+0x5e856) [0x7fbfa3b38856]
 7: (()+0x5e883) [0x7fbfa3b38883]
 8: (()+0x5eaae) [0x7fbfa3b38aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa3e332]
 10: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0xf9d) [0x885d8d]
 11: (C_Copyfrom::finish(int)+0x99) [0x8d6f79]
 12: (Context::complete(int)+0x9) [0x658fc9]
 13: (Finisher::finisher_thread_entry()+0x1b8) [0x980da8]
 14: (()+0x7f6e) [0x7fbfa4d4df6e]
 15: (clone()+0x6d) [0x7fbfa32f29cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.0.log
--- end dump of recent events ---

Actions #1

Updated by Mark Nelson about 10 years ago

rados bench command run:

/usr/bin/rados -c /tmp/cbt/ceph/ceph.conf -p base -b 4194304 bench 300 write --concurrent-ios 32 --no-cleanup

Actions #2

Updated by Sage Weil about 10 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Greg Farnum about 10 years ago

  • Status changed from Resolved to Duplicate
Actions

Also available in: Atom PDF