Bug #8947
closedWriting rados objects with max objects set for cache pool crashed osd
0%
Description
Setting target_max_objects parameter and writing rados object onto cache pool crashed osd.
History of operations on the setup:
There are 2 other cache tiers which are used to test parameters
1> cache_min_flush_age & cache_target_full_ratio
2> cache_min_flush_age
Steps:
1> Created erasure coded profile "ecprofile"
2> Created erasure coded pool "ecpoolobj" using "ecprofile"
3> Created cachepoolobj (replicated pool)
4> Created cache tier using ecpoolobj & cachepoolobj
5> Set cachepoolobj in writeback mode
6> Set-overlay on ecpoolobj to redirect IO's to cachepoolobj
7> Set target_max_objects parameter to 10
8> Wrote one object onto cachepoolobj using rados utility
Start writing 10 rados objects on cachepoolobj
one of the osd got crashed initially.
After sometime, 3 of the osd's went down.
Attaching all osd's log.
ubuntu@ubuntuServer1:~$ sudo ceph osd pool create ecpoolobj 12 12 erasure ecprofile
pool 'ecpoolobj' created
ubuntu@ubuntuServer1:~$
ubuntu@ubuntuServer1:~$ sudo ceph osd pool create cachepoolobj 256 256
pool 'cachepoolobj' created
ubuntu@ubuntuServer1:~$ sudo ceph osd tier add ecpoolobj cachepoolobj
pool 'cachepoolobj' is now (or already was) a tier of 'ecpoolobj'
ubuntu@ubuntuServer1:~$ sudo ceph osd tier cache-mode cachepoolobj writeback
set cache-mode for pool 'cachepoolobj' to writeback
ubuntu@ubuntuServer1:~$ sudo ceph osd tier set-overlay ecpoolobj cachepoolobj
overlay for 'ecpoolobj' is now (or already was) 'cachepoolobj'
ubuntu@ubuntuServer1:~$ sudo ceph osd pool set cachepool target_max_objects 10
set pool 4 target_max_objects to 10
ubuntu@ubuntuServer1:~$
ubuntu@ubuntuServer1:~$ sudo rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
cachepool - 36 1 0 0 0 0 0 3 37
cachepoolage - 36 1 0 0 0 0 0 2 37
cachepoolobj - 0 0 0 0 0 0 0 0 0
data - 0 0 0 0 0 0 0 0 0
ecpool - 36 1 0 0 0 1 0 2 37
ecpoolage - 0 0 0 0 0 0 0 0 0
ecpoolobj - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
total used 170460 3
total avail 757575652
total space 757746112
ubuntu@ubuntuServer1:~$
ubuntu@ubuntufunc1:~$ for i in {1..9}; do sudo rados put p cachepoolobj obj1 CEPH-Ubuntu14.04_x64-1.06.00/python-ceph_0.80-1_amd64.deb; done 10.66.26.5:0/1013789 >> 10.242.24.154:6802/28779 pipe(0x1f1bbe0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x1f1be50).fault
ubuntu@ubuntufunc1:~$ for i in {1..9}; do sudo rados put -p cachepoolobj obj$i CEPH-Ubuntu14.04_x64-1.06.00/python-ceph_0.80-1_amd64.deb; done
2014-07-25 18:26:03.041974 7f895d162700 0 -
2014-07-25 18:26:34.521877 7f895d061700 0 -- 10.66.26.5:0/1013789 >> 10.66.26.5:6800/12812 pipe(0x7f8954007190 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8954007400).fault
Crash logs:
28> 2014-07-25 18:28:15.643983 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549656, event: header_read, op: pg_info(1 pgs e65:1.a)
27> 2014-07-25 18:28:15.644024 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549657, event: throttled, op: pg_info(1 pgs e65:1.a)
26> 2014-07-25 18:28:15.644032 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549748, event: all_read, op: pg_info(1 pgs e65:1.a)
25> 2014-07-25 18:28:15.644039 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.643981, event: dispatched, op: pg_info(1 pgs e65:1.a)
24> 2014-07-25 18:28:15.644046 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644045, event: waiting_for_osdmap, op: pg_info(1 pgs e65:1.a)
23> 2014-07-25 18:28:15.644055 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644053, event: started, op: pg_info(1 pgs e65:1.a)
22> 2014-07-25 18:28:15.644071 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644069, event: done, op: pg_info(1 pgs e65:1.a)
21> 2014-07-25 18:28:15.644082 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.1 10.242.24.111:6801/31290 596 ==== pg_info(1 pgs e65:6.10) v4 ==== 714+0+0 (522635800 0 0) 0x5230680 con 0x487c420
20> 2014-07-25 18:28:15.644093 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550890, event: header_read, op: pg_info(1 pgs e65:6.10)
19> 2014-07-25 18:28:15.644101 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550891, event: throttled, op: pg_info(1 pgs e65:6.10)
18> 2014-07-25 18:28:15.644108 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550907, event: all_read, op: pg_info(1 pgs e65:6.10)
17> 2014-07-25 18:28:15.644117 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644091, event: dispatched, op: pg_info(1 pgs e65:6.10)
16> 2014-07-25 18:28:15.644125 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644123, event: waiting_for_osdmap, op: pg_info(1 pgs e65:6.10)
15> 2014-07-25 18:28:15.644133 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644132, event: started, op: pg_info(1 pgs e65:6.10)
14> 2014-07-25 18:28:15.644150 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644148, event: done, op: pg_info(1 pgs e65:6.10)
13> 2014-07-25 18:28:15.644164 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.0 10.66.26.5:6801/12812 627 ==== pg_info(1 pgs e65:8.9) v4 ==== 714+0+0 (527915233 0 0) 0x4dcec40 con 0x487c160
12> 2014-07-25 18:28:15.644175 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549772, event: header_read, op: pg_info(1 pgs e65:8.9)
11> 2014-07-25 18:28:15.644184 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549772, event: throttled, op: pg_info(1 pgs e65:8.9)
10> 2014-07-25 18:28:15.644191 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549987, event: all_read, op: pg_info(1 pgs e65:8.9)
-9> 2014-07-25 18:28:15.644168 7f4f299b2700 -1 ** Caught signal (Aborted) *
in thread 7f4f299b2700
1: /usr/bin/ceph-osd() [0x974a1f]
2: (()+0x10340) [0x7f4f4c5ee340]
3: (gsignal()+0x39) [0x7f4f4a8c2f79]
4: (abort()+0x148) [0x7f4f4a8c6388]
5: (_gnu_cxx::_verbose_terminate_handler()+0x155) [0x7f4f4b1ce6b5]
6: (()+0x5e836) [0x7f4f4b1cc836]
7: (()+0x5e863) [0x7f4f4b1cc863]
8: (()+0x5eaa2) [0x7f4f4b1ccaa2]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa5a692]
10: (ReplicatedPG::agent_estimate_atime_temp(hobject_t const&, int*, int*)+0x12e) [0x7ad25e]
11: (ReplicatedPG::agent_maybe_evict(std::tr1::shared_ptr<ObjectContext>&)+0x7a) [0x7e1a2a]
12: (ReplicatedPG::agent_work(int)+0xf45) [0x7fb1d5]
13: (OSDService::agent_entry()+0x557) [0x6090c7]
14: (OSDService::AgentThread::entry()+0xd) [0x67201d]
15: (()+0x8182) [0x7f4f4c5e6182]
16: (clone()+0x6d) [0x7f4f4a98730d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
8> 2014-07-25 18:28:15.644199 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644172, event: dispatched, op: pg_info(1 pgs e65:8.9)
7> 2014-07-25 18:28:15.644207 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644205, event: waiting_for_osdmap, op: pg_info(1 pgs e65:8.9)
6> 2014-07-25 18:28:15.644233 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644230, event: started, op: pg_info(1 pgs e65:8.9)
5> 2014-07-25 18:28:15.644249 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644247, event: done, op: pg_info(1 pgs e65:8.9)
4> 2014-07-25 18:28:15.644258 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.1 10.242.24.111:6801/31290 597 ==== pg_info(1 pgs e65:8.26) v4 ==== 714+0+0 (367980315 0 0) 0x4dcf180 con 0x487c420
3> 2014-07-25 18:28:15.644267 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.550916, event: header_read, op: pg_info(1 pgs e65:8.26)
2> 2014-07-25 18:28:15.644273 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.550916, event: throttled, op: pg_info(1 pgs e65:8.26)
1> 2014-07-25 18:28:15.644282 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.551111, event: all_read, op: pg_info(1 pgs e65:8.26)
0> 2014-07-25 18:28:15.644291 7f4f3c1d7700 5 -- op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.644265, event: dispatched, op: pg_info(1 pgs e65:8.26)
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.2.log
-- end dump of recent events ---
ubuntu@ubuntuServer1:~$
Files
Updated by Mallikarjun Biradar over 9 years ago
Test configuration:
No of osd nodes: 3
No of osd's : 4
No of monitors: 2
Kernel versions: 3.13.0-24-generic
No of gateway nodes: 1
Updated by Greg Farnum over 9 years ago
I don't remember the details, but we were previously crashing with a 10-object limit anyway due to hit sets and such. This might just be that.
Updated by David Zafman over 9 years ago
Could not reproduce using vstart.sh on current master branch. I never saw a crash or bug report with that stack trace.
Updated by Mallikarjun Biradar over 9 years ago
- File _usr_bin_ceph-osd.0.crash _usr_bin_ceph-osd.0.crash added
- File _usr_bin_ceph-osd.0.zip _usr_bin_ceph-osd.0.zip added
Uploading crash dump
Updated by Sage Weil over 9 years ago
- Status changed from New to Need More Info
can you attach the complete logs? all three osds claim to have hit an assert, but the assert message isn't in the log.. it's usually further up.
also, strangely, there is no assert() call in ReplicatedPG::agent_estimate_atime_temp()...
in any case, the complete osd log would be helpful!
Updated by Sage Weil over 9 years ago
- Status changed from Need More Info to Duplicate
Oh, i see it now. This is a dup of #8982.