Project

General

Profile

Actions

Bug #8947

closed

Writing rados objects with max objects set for cache pool crashed osd

Added by Mallikarjun Biradar over 9 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Setting target_max_objects parameter and writing rados object onto cache pool crashed osd.

History of operations on the setup:
There are 2 other cache tiers which are used to test parameters
1> cache_min_flush_age & cache_target_full_ratio
2> cache_min_flush_age

Steps:
1> Created erasure coded profile "ecprofile"
2> Created erasure coded pool "ecpoolobj" using "ecprofile"
3> Created cachepoolobj (replicated pool)
4> Created cache tier using ecpoolobj & cachepoolobj
5> Set cachepoolobj in writeback mode
6> Set-overlay on ecpoolobj to redirect IO's to cachepoolobj
7> Set target_max_objects parameter to 10
8> Wrote one object onto cachepoolobj using rados utility

Start writing 10 rados objects on cachepoolobj

one of the osd got crashed initially.

After sometime, 3 of the osd's went down.

Attaching all osd's log.

ubuntu@ubuntuServer1:~$ sudo ceph osd pool create ecpoolobj 12 12 erasure ecprofile
pool 'ecpoolobj' created
ubuntu@ubuntuServer1:~$

ubuntu@ubuntuServer1:~$ sudo ceph osd pool create cachepoolobj 256 256
pool 'cachepoolobj' created

ubuntu@ubuntuServer1:~$ sudo ceph osd tier add ecpoolobj cachepoolobj
pool 'cachepoolobj' is now (or already was) a tier of 'ecpoolobj'

ubuntu@ubuntuServer1:~$ sudo ceph osd tier cache-mode cachepoolobj writeback
set cache-mode for pool 'cachepoolobj' to writeback

ubuntu@ubuntuServer1:~$ sudo ceph osd tier set-overlay ecpoolobj cachepoolobj
overlay for 'ecpoolobj' is now (or already was) 'cachepoolobj'

ubuntu@ubuntuServer1:~$ sudo ceph osd pool set cachepool target_max_objects 10
set pool 4 target_max_objects to 10
ubuntu@ubuntuServer1:~$

ubuntu@ubuntuServer1:~$ sudo rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
cachepool - 36 1 0 0 0 0 0 3 37
cachepoolage - 36 1 0 0 0 0 0 2 37
cachepoolobj - 0 0 0 0 0 0 0 0 0
data - 0 0 0 0 0 0 0 0 0
ecpool - 36 1 0 0 0 1 0 2 37
ecpoolage - 0 0 0 0 0 0 0 0 0
ecpoolobj - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
total used 170460 3
total avail 757575652
total space 757746112
ubuntu@ubuntuServer1:~$

ubuntu@ubuntufunc1:~$ for i in {1..9}; do sudo rados put p cachepoolobj obj1 CEPH-Ubuntu14.04_x64-1.06.00/python-ceph_0.80-1_amd64.deb; done
ubuntu@ubuntufunc1:~$ for i in {1..9}; do sudo rados put -p cachepoolobj obj$i CEPH-Ubuntu14.04_x64-1.06.00/python-ceph_0.80-1_amd64.deb; done
2014-07-25 18:26:03.041974 7f895d162700 0 -
10.66.26.5:0/1013789 >> 10.242.24.154:6802/28779 pipe(0x1f1bbe0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x1f1be50).fault
2014-07-25 18:26:34.521877 7f895d061700 0 -- 10.66.26.5:0/1013789 >> 10.66.26.5:6800/12812 pipe(0x7f8954007190 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8954007400).fault

Crash logs:
28> 2014-07-25 18:28:15.643983 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549656, event: header_read, op: pg_info(1 pgs e65:1.a)
27> 2014-07-25 18:28:15.644024 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549657, event: throttled, op: pg_info(1 pgs e65:1.a)
26> 2014-07-25 18:28:15.644032 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.549748, event: all_read, op: pg_info(1 pgs e65:1.a)
25> 2014-07-25 18:28:15.644039 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.643981, event: dispatched, op: pg_info(1 pgs e65:1.a)
24> 2014-07-25 18:28:15.644046 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644045, event: waiting_for_osdmap, op: pg_info(1 pgs e65:1.a)
23> 2014-07-25 18:28:15.644055 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644053, event: started, op: pg_info(1 pgs e65:1.a)
22> 2014-07-25 18:28:15.644071 7f4f3c1d7700 5 - op tracker -- , seq: 1261, time: 2014-07-25 18:28:15.644069, event: done, op: pg_info(1 pgs e65:1.a)
21> 2014-07-25 18:28:15.644082 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.1 10.242.24.111:6801/31290 596 ==== pg_info(1 pgs e65:6.10) v4 ==== 714+0+0 (522635800 0 0) 0x5230680 con 0x487c420
20> 2014-07-25 18:28:15.644093 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550890, event: header_read, op: pg_info(1 pgs e65:6.10)
19> 2014-07-25 18:28:15.644101 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550891, event: throttled, op: pg_info(1 pgs e65:6.10)
18> 2014-07-25 18:28:15.644108 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.550907, event: all_read, op: pg_info(1 pgs e65:6.10)
17> 2014-07-25 18:28:15.644117 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644091, event: dispatched, op: pg_info(1 pgs e65:6.10)
16> 2014-07-25 18:28:15.644125 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644123, event: waiting_for_osdmap, op: pg_info(1 pgs e65:6.10)
15> 2014-07-25 18:28:15.644133 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644132, event: started, op: pg_info(1 pgs e65:6.10)
14> 2014-07-25 18:28:15.644150 7f4f3c1d7700 5 - op tracker -- , seq: 1262, time: 2014-07-25 18:28:15.644148, event: done, op: pg_info(1 pgs e65:6.10)
13> 2014-07-25 18:28:15.644164 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.0 10.66.26.5:6801/12812 627 ==== pg_info(1 pgs e65:8.9) v4 ==== 714+0+0 (527915233 0 0) 0x4dcec40 con 0x487c160
12> 2014-07-25 18:28:15.644175 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549772, event: header_read, op: pg_info(1 pgs e65:8.9)
11> 2014-07-25 18:28:15.644184 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549772, event: throttled, op: pg_info(1 pgs e65:8.9)
10> 2014-07-25 18:28:15.644191 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.549987, event: all_read, op: pg_info(1 pgs e65:8.9)
-9> 2014-07-25 18:28:15.644168 7f4f299b2700 -1 ** Caught signal (Aborted) *
in thread 7f4f299b2700

1: /usr/bin/ceph-osd() [0x974a1f]
2: (()+0x10340) [0x7f4f4c5ee340]
3: (gsignal()+0x39) [0x7f4f4a8c2f79]
4: (abort()+0x148) [0x7f4f4a8c6388]
5: (_gnu_cxx::_verbose_terminate_handler()+0x155) [0x7f4f4b1ce6b5]
6: (()+0x5e836) [0x7f4f4b1cc836]
7: (()+0x5e863) [0x7f4f4b1cc863]
8: (()+0x5eaa2) [0x7f4f4b1ccaa2]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa5a692]
10: (ReplicatedPG::agent_estimate_atime_temp(hobject_t const&, int*, int*)+0x12e) [0x7ad25e]
11: (ReplicatedPG::agent_maybe_evict(std::tr1::shared_ptr&lt;ObjectContext&gt;&)+0x7a) [0x7e1a2a]
12: (ReplicatedPG::agent_work(int)+0xf45) [0x7fb1d5]
13: (OSDService::agent_entry()+0x557) [0x6090c7]
14: (OSDService::AgentThread::entry()+0xd) [0x67201d]
15: (()+0x8182) [0x7f4f4c5e6182]
16: (clone()+0x6d) [0x7f4f4a98730d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
8> 2014-07-25 18:28:15.644199 7f4f3c1d7700  5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644172, event: dispatched, op: pg_info(1 pgs e65:8.9)
7> 2014-07-25 18:28:15.644207 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644205, event: waiting_for_osdmap, op: pg_info(1 pgs e65:8.9)
6> 2014-07-25 18:28:15.644233 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644230, event: started, op: pg_info(1 pgs e65:8.9)
5> 2014-07-25 18:28:15.644249 7f4f3c1d7700 5 - op tracker -- , seq: 1263, time: 2014-07-25 18:28:15.644247, event: done, op: pg_info(1 pgs e65:8.9)
4> 2014-07-25 18:28:15.644258 7f4f3c1d7700 1 - 10.242.24.154:6803/28779 <== osd.1 10.242.24.111:6801/31290 597 ==== pg_info(1 pgs e65:8.26) v4 ==== 714+0+0 (367980315 0 0) 0x4dcf180 con 0x487c420
3> 2014-07-25 18:28:15.644267 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.550916, event: header_read, op: pg_info(1 pgs e65:8.26)
2> 2014-07-25 18:28:15.644273 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.550916, event: throttled, op: pg_info(1 pgs e65:8.26)
1> 2014-07-25 18:28:15.644282 7f4f3c1d7700 5 - op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.551111, event: all_read, op: pg_info(1 pgs e65:8.26)
0> 2014-07-25 18:28:15.644291 7f4f3c1d7700 5 -- op tracker -- , seq: 1264, time: 2014-07-25 18:28:15.644265, event: dispatched, op: pg_info(1 pgs e65:8.26)
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.2.log
--
end dump of recent events ---
ubuntu@ubuntuServer1:~$

Files

ceph-osd.0.log (10.4 KB) ceph-osd.0.log log for crashed osd Mallikarjun Biradar, 07/28/2014 08:15 AM
ceph-osd.2.log (9.14 KB) ceph-osd.2.log log for crashed osd Mallikarjun Biradar, 07/28/2014 08:15 AM
ceph-osd.3.log (78.9 KB) ceph-osd.3.log log for crashed osd Mallikarjun Biradar, 07/28/2014 08:15 AM
_usr_bin_ceph-osd.0.crash (11.8 MB) _usr_bin_ceph-osd.0.crash Mallikarjun Biradar, 07/28/2014 10:46 PM
_usr_bin_ceph-osd.0.zip (8.18 MB) _usr_bin_ceph-osd.0.zip Mallikarjun Biradar, 07/28/2014 10:46 PM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #8982: cache pool osds crashing when data is evicting to underlying storage poolResolvedSage Weil07/31/2014

Actions
Actions #1

Updated by Mallikarjun Biradar over 9 years ago

Test configuration:
No of osd nodes: 3
No of osd's : 4
No of monitors: 2
Kernel versions: 3.13.0-24-generic
No of gateway nodes: 1

Actions #2

Updated by Sage Weil over 9 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Greg Farnum over 9 years ago

I don't remember the details, but we were previously crashing with a 10-object limit anyway due to hit sets and such. This might just be that.

Actions #4

Updated by David Zafman over 9 years ago

Could not reproduce using vstart.sh on current master branch. I never saw a crash or bug report with that stack trace.

Actions #5

Updated by Yehuda Sadeh over 9 years ago

  • Project changed from rgw to Ceph
Actions #7

Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info

can you attach the complete logs? all three osds claim to have hit an assert, but the assert message isn't in the log.. it's usually further up.

also, strangely, there is no assert() call in ReplicatedPG::agent_estimate_atime_temp()...

in any case, the complete osd log would be helpful!

Actions #8

Updated by Sage Weil over 9 years ago

  • Status changed from Need More Info to Duplicate

Oh, i see it now. This is a dup of #8982.

Actions

Also available in: Atom PDF