Bug #23878
assert on pg upmap
% Done:
0%
Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I use the follow script to test upmap
./bin/init-ceph stop
killall ceph-mon ceph-osd
killall ceph-mon ceph-osd
OSD=9 MON=1 MGR=1 MDS=0 ../src/vstart.sh -X -n
./bin/ceph osd crush add-bucket test root
./bin/ceph osd crush add-bucket huangjun-1 host
./bin/ceph osd crush add-bucket huangjun-2 host
./bin/ceph osd crush add-bucket huangjun-3 host
./bin/ceph osd crush move huangjun-1 root=test
./bin/ceph osd crush move huangjun-2 root=test
./bin/ceph osd crush move huangjun-3 root=test
./bin/ceph osd crush add osd.0 1.0 host=huangjun-1
./bin/ceph osd crush add osd.1 1.0 host=huangjun-1
./bin/ceph osd crush add osd.2 1.0 host=huangjun-1
./bin/ceph osd crush add osd.3 1.0 host=huangjun-2
./bin/ceph osd crush add osd.4 1.0 host=huangjun-2
./bin/ceph osd crush add osd.5 1.0 host=huangjun-2
./bin/ceph osd crush add osd.7 1.0 host=huangjun-3
./bin/ceph osd crush add osd.6 1.0 host=huangjun-3
./bin/ceph osd erasure-code-profile set test k=4 m=2 crush-failure-domain=osd
./bin/ceph osd getcrushmap -o crush
./bin/crushtool -d crush -o crush.txt
echo "
rule test {
id 1
type erasure
min_size 1
max_size 10
step take huangjun-1
step chooseleaf indep 2 type osd
step emit
step take huangjun-2
step chooseleaf indep 2 type osd
step emit
step take huangjun-3
step chooseleaf indep 2 type osd
step emit
}
" >> crush.txt
./bin/crushtool -c crush.txt -o crush.new
./bin/ceph osd setcrushmap -i crush.new
./bin/ceph osd pool create test 256 256 erasure test test
max_deviation=0.01
max_pg=256
pool='test'
./bin/ceph osd getmap -o om
./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh
sh result.sh
rm -f result.sh
./bin/ceph osd crush unlink osd.2 huangjun-1
./bin/ceph osd getmap -o om
./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh
sh result.sh
the test crashed with
*** Caught signal (Aborted) ** in thread 7f9e999b0180 thread_name:osdmaptool ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (()+0x21321) [0x56420eafb321] 2: (()+0xf5e0) [0x7f9e8f6755e0] 3: (gsignal()+0x37) [0x7f9e8e0671f7] 4: (abort()+0x148) [0x7f9e8e0688e8] 5: (()+0x74f47) [0x7f9e8e0a6f47] 6: (()+0x7c619) [0x7f9e8e0ae619] 7: (std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > >, std::_Select1st<std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > >)+0x76) [0x7f9e90f7a4b6] 8: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1041) [0x7f9e90f6a651] 9: (main()+0x3925) [0x56420eaec385] 10: (__libc_start_main()+0xf5) [0x7f9e8e053c05] 11: (()+0x12fc0) [0x56420eaecfc0] 2018-04-26 12:32:04.556614 7f9e999b0180 -1 *** Caught signal (Aborted) ** in thread 7f9e999b0180 thread_name:osdmaptool
Related issues
History
#1 Updated by huang jun almost 6 years ago
After pick the pr https://github.com/ceph/ceph/pull/21325
It works fine.
But i have some question:
the upmap items is
pg_upmap_items 1.1 [4,3] pg_upmap_items 1.2 [4,5] pg_upmap_items 1.10 [0,1] pg_upmap_items 1.11 [4,3] pg_upmap_items 1.14 [0,1] pg_upmap_items 1.17 [0,1] pg_upmap_items 1.1f [0,1] pg_upmap_items 1.20 [0,1] pg_upmap_items 1.22 [0,1] pg_upmap_items 1.24 [0,1] pg_upmap_items 1.29 [0,1] pg_upmap_items 1.2c [0,1] pg_upmap_items 1.31 [0,1]
after i unlink osd.3 from huangjun-2 by:
./bin/ceph osd crush unlink osd.3 huangjun-2
the ceph osd df shows
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 1 hdd 1.00000 1.00000 51175M 32693M 18481M 63.88 1.00 256 4 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 254 5 hdd 1.00000 1.00000 51175M 32693M 18481M 63.88 1.00 256 6 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 7 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 0 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 1 hdd 1.00000 1.00000 51175M 32693M 18481M 63.88 1.00 256 2 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 0 3 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 2 4 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 254 5 hdd 1.00000 1.00000 51175M 32693M 18481M 63.88 1.00 256 6 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 7 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 256 8 hdd 1.00000 1.00000 51175M 32692M 18482M 63.88 1.00 0 TOTAL 449G 287G 162G 63.88 MIN/MAX VAR: 1.00/1.00 STDDEV: 0
The osd tree is
[root@lab104 build]# ./bin/ceph osd tree *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2018-04-26 13:35:25.684329 7f74b5cbd700 -1 WARNING: all dangerous and experimental features are enabled. 2018-04-26 13:35:25.744078 7f74b5cbd700 -1 WARNING: all dangerous and experimental features are enabled. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -5 6.00000 root test -6 2.00000 host huangjun-1 0 hdd 1.00000 osd.0 up 1.00000 1.00000 1 hdd 1.00000 osd.1 up 1.00000 1.00000 -7 2.00000 host huangjun-2 4 hdd 1.00000 osd.4 up 1.00000 1.00000 5 hdd 1.00000 osd.5 up 1.00000 1.00000 -8 2.00000 host huangjun-3 6 hdd 1.00000 osd.6 up 1.00000 1.00000 7 hdd 1.00000 osd.7 up 1.00000 1.00000 -1 9.00000 root default -2 9.00000 host lab104 0 hdd 1.00000 osd.0 up 1.00000 1.00000 1 hdd 1.00000 osd.1 up 1.00000 1.00000 2 hdd 1.00000 osd.2 up 1.00000 1.00000 3 hdd 1.00000 osd.3 up 1.00000 1.00000 4 hdd 1.00000 osd.4 up 1.00000 1.00000 5 hdd 1.00000 osd.5 up 1.00000 1.00000 6 hdd 1.00000 osd.6 up 1.00000 1.00000 7 hdd 1.00000 osd.7 up 1.00000 1.00000 8 hdd 1.00000 osd.8 up 1.00000 1.00000
My Question:
1. why osd.3 still have 2 pgs? shouldn't we remove it from pg_upmaps?
#2 Updated by huang jun almost 6 years ago
And then if i do pg-upmap operation.
max_deviation=0.01 max_pg=256 pool='test' ./bin/ceph osd getmap -o om ./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh sh result.sh
there is the same coredump like: http://tracker.ceph.com/issues/23877
2018-04-26 14:41:44.738374 7f7283bfd180 10 clean_pg_upmaps 2018-04-26 14:41:44.739947 7f7283bfd180 20 osd.0 weight 0.333333 pgs 172 2018-04-26 14:41:44.739966 7f7283bfd180 20 osd.1 weight 0.333333 pgs 169 /root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*)' thread 7f7283bfd180 time 2018-04-26 14:41 :44.740048 /root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 4078: FAILED assert(target > 0) 2018-04-26 14:41:44.739970 7f7283bfd180 20 osd.2 weight 0.333333 pgs 171 2018-04-26 14:41:44.739972 7f7283bfd180 20 osd.4 weight 0.5 pgs 254 2018-04-26 14:41:44.739974 7f7283bfd180 20 osd.5 weight 0.5 pgs 256 2018-04-26 14:41:44.739980 7f7283bfd180 20 osd.6 weight 0.5 pgs 256 2018-04-26 14:41:44.739982 7f7283bfd180 20 osd.7 weight 0.5 pgs 256 2018-04-26 14:41:44.739985 7f7283bfd180 10 osd_weight_total 3 2018-04-26 14:41:44.739988 7f7283bfd180 10 pgs_per_weight 512 2018-04-26 14:41:44.739998 7f7283bfd180 20 osd.0 pgs 172 target 170.667 deviation 1.33333 2018-04-26 14:41:44.740006 7f7283bfd180 20 osd.1 pgs 169 target 170.667 deviation -1.66667 2018-04-26 14:41:44.740013 7f7283bfd180 20 osd.2 pgs 171 target 170.667 deviation 0.333328 2018-04-26 14:41:44.740018 7f7283bfd180 20 osd.3 pgs 2 target 0 deviation 2 2018-04-26 14:41:44.740022 7f7283bfd180 20 osd.4 pgs 254 target 256 deviation -2 2018-04-26 14:41:44.740026 7f7283bfd180 20 osd.5 pgs 256 target 256 deviation 0 2018-04-26 14:41:44.740030 7f7283bfd180 20 osd.6 pgs 256 target 256 deviation 0 2018-04-26 14:41:44.740033 7f7283bfd180 20 osd.7 pgs 256 target 256 deviation 0 2018-04-26 14:41:44.740041 7f7283bfd180 10 total_deviation 7.33333 overfull 0,3 underfull [4,1] ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f727afe9d50] 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1c41) [0x7f727b1b8251] 3: (main()+0x3925) [0x5594d2896385] 4: (__libc_start_main()+0xf5) [0x7f72782a0c05] 5: (()+0x12fc0) [0x5594d2896fc0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2018-04-26 14:41:44.740903 7f7283bfd180 -1 /root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*) ' thread 7f7283bfd180 time 2018-04-26 14:41:44.740048 /root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 4078: FAILED assert(target > 0) ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f727afe9d50] 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1c41) [0x7f727b1b8251] 3: (main()+0x3925) [0x5594d2896385] 4: (__libc_start_main()+0xf5) [0x7f72782a0c05] 5: (()+0x12fc0) [0x5594d2896fc0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#3 Updated by Kefu Chai almost 6 years ago
- Duplicated by Bug #23877: osd/OSDMap.cc: assert(target > 0) added
#5 Updated by Sage Weil almost 6 years ago
- Status changed from New to 12
- Priority changed from Normal to High
- Backport set to luminous
#6 Updated by Sage Weil almost 6 years ago
- Status changed from 12 to Fix Under Review
#7 Updated by huang jun almost 6 years ago
This pr #21670 passed tests failed before in my local cluster, needs qa
#8 Updated by Sage Weil almost 6 years ago
- Status changed from Fix Under Review to Pending Backport
#9 Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #23925: luminous: assert on pg upmap added
#10 Updated by Greg Farnum almost 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSDMap)
#11 Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved