Project

General

Profile

Bug #23878

assert on pg upmap

Added by huang jun almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I use the follow script to test upmap

./bin/init-ceph stop
killall ceph-mon ceph-osd
killall ceph-mon ceph-osd
OSD=9 MON=1 MGR=1 MDS=0 ../src/vstart.sh -X -n

./bin/ceph osd crush add-bucket test root
./bin/ceph osd crush add-bucket huangjun-1 host 
./bin/ceph osd crush add-bucket huangjun-2 host 
./bin/ceph osd crush add-bucket huangjun-3 host

./bin/ceph osd crush move huangjun-1 root=test
./bin/ceph osd crush move huangjun-2 root=test
./bin/ceph osd crush move huangjun-3 root=test

./bin/ceph osd crush add osd.0 1.0 host=huangjun-1
./bin/ceph osd crush add osd.1 1.0 host=huangjun-1
./bin/ceph osd crush add osd.2 1.0 host=huangjun-1

./bin/ceph osd crush add osd.3 1.0 host=huangjun-2
./bin/ceph osd crush add osd.4 1.0 host=huangjun-2
./bin/ceph osd crush add osd.5 1.0 host=huangjun-2

./bin/ceph osd crush add osd.7 1.0 host=huangjun-3
./bin/ceph osd crush add osd.6 1.0 host=huangjun-3

./bin/ceph osd erasure-code-profile set test k=4 m=2 crush-failure-domain=osd

./bin/ceph osd getcrushmap -o crush
./bin/crushtool -d crush -o crush.txt

echo " 
rule test {
        id 1
        type erasure
        min_size 1
        max_size 10
        step take huangjun-1
        step chooseleaf indep 2 type osd
        step emit
        step take huangjun-2
        step chooseleaf indep 2 type osd
        step emit
        step take huangjun-3
        step chooseleaf indep 2 type osd
        step emit
}
" >> crush.txt
./bin/crushtool -c crush.txt -o crush.new
./bin/ceph osd setcrushmap -i crush.new

./bin/ceph osd pool create test 256 256 erasure test test

max_deviation=0.01
max_pg=256
pool='test'
./bin/ceph osd getmap -o om
./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh
sh result.sh

rm -f result.sh
./bin/ceph osd crush unlink osd.2 huangjun-1
./bin/ceph osd getmap -o om
./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh
sh result.sh

the test crashed with

*** Caught signal (Aborted) **
 in thread 7f9e999b0180 thread_name:osdmaptool
 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (()+0x21321) [0x56420eafb321]
 2: (()+0xf5e0) [0x7f9e8f6755e0]
 3: (gsignal()+0x37) [0x7f9e8e0671f7]
 4: (abort()+0x148) [0x7f9e8e0688e8]
 5: (()+0x74f47) [0x7f9e8e0a6f47]
 6: (()+0x7c619) [0x7f9e8e0ae619]
 7: (std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > >, std::_Select1st<std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<pg_t const, std::vector<std::pair<int, int>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<int, int> > > > >)+0x76) [0x7f9e90f7a4b6]
 8: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1041) [0x7f9e90f6a651]
 9: (main()+0x3925) [0x56420eaec385]
 10: (__libc_start_main()+0xf5) [0x7f9e8e053c05]
 11: (()+0x12fc0) [0x56420eaecfc0]
2018-04-26 12:32:04.556614 7f9e999b0180 -1 *** Caught signal (Aborted) **
 in thread 7f9e999b0180 thread_name:osdmaptool


Related issues

Duplicated by Ceph - Bug #23877: osd/OSDMap.cc: assert(target > 0) Duplicate 04/26/2018
Copied to RADOS - Backport #23925: luminous: assert on pg upmap Resolved

History

#1 Updated by huang jun almost 6 years ago

After pick the pr https://github.com/ceph/ceph/pull/21325
It works fine.
But i have some question:
the upmap items is

pg_upmap_items 1.1 [4,3]
pg_upmap_items 1.2 [4,5]
pg_upmap_items 1.10 [0,1]
pg_upmap_items 1.11 [4,3]
pg_upmap_items 1.14 [0,1]
pg_upmap_items 1.17 [0,1]
pg_upmap_items 1.1f [0,1]
pg_upmap_items 1.20 [0,1]
pg_upmap_items 1.22 [0,1]
pg_upmap_items 1.24 [0,1]
pg_upmap_items 1.29 [0,1]
pg_upmap_items 1.2c [0,1]
pg_upmap_items 1.31 [0,1]

after i unlink osd.3 from huangjun-2 by:
./bin/ceph osd crush unlink osd.3 huangjun-2

the ceph osd df shows
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS 
 0   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 1   hdd 1.00000  1.00000 51175M 32693M 18481M 63.88 1.00 256 
 4   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 254 
 5   hdd 1.00000  1.00000 51175M 32693M 18481M 63.88 1.00 256 
 6   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 7   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 0   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 1   hdd 1.00000  1.00000 51175M 32693M 18481M 63.88 1.00 256 
 2   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00   0 
 3   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00   2 
 4   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 254 
 5   hdd 1.00000  1.00000 51175M 32693M 18481M 63.88 1.00 256 
 6   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 7   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00 256 
 8   hdd 1.00000  1.00000 51175M 32692M 18482M 63.88 1.00   0 
                    TOTAL   449G   287G   162G 63.88          
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

The osd tree is

[root@lab104 build]# ./bin/ceph osd tree
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-04-26 13:35:25.684329 7f74b5cbd700 -1 WARNING: all dangerous and experimental features are enabled.
2018-04-26 13:35:25.744078 7f74b5cbd700 -1 WARNING: all dangerous and experimental features are enabled.
ID CLASS WEIGHT  TYPE NAME           STATUS REWEIGHT PRI-AFF 
-5       6.00000 root test                                   
-6       2.00000     host huangjun-1                         
 0   hdd 1.00000         osd.0           up  1.00000 1.00000 
 1   hdd 1.00000         osd.1           up  1.00000 1.00000 
-7       2.00000     host huangjun-2                         
 4   hdd 1.00000         osd.4           up  1.00000 1.00000 
 5   hdd 1.00000         osd.5           up  1.00000 1.00000 
-8       2.00000     host huangjun-3                         
 6   hdd 1.00000         osd.6           up  1.00000 1.00000 
 7   hdd 1.00000         osd.7           up  1.00000 1.00000 
-1       9.00000 root default                                
-2       9.00000     host lab104                             
 0   hdd 1.00000         osd.0           up  1.00000 1.00000 
 1   hdd 1.00000         osd.1           up  1.00000 1.00000 
 2   hdd 1.00000         osd.2           up  1.00000 1.00000 
 3   hdd 1.00000         osd.3           up  1.00000 1.00000 
 4   hdd 1.00000         osd.4           up  1.00000 1.00000 
 5   hdd 1.00000         osd.5           up  1.00000 1.00000 
 6   hdd 1.00000         osd.6           up  1.00000 1.00000 
 7   hdd 1.00000         osd.7           up  1.00000 1.00000 
 8   hdd 1.00000         osd.8           up  1.00000 1.00000

My Question:
1. why osd.3 still have 2 pgs? shouldn't we remove it from pg_upmaps?

#2 Updated by huang jun almost 6 years ago

And then if i do pg-upmap operation.

max_deviation=0.01
max_pg=256
pool='test'
./bin/ceph osd getmap -o om
./bin/osdmaptool om --upmap-deviation $max_deviation --upmap-max $max_pg --upmap-pool $pool --upmap result.sh
sh result.sh

there is the same coredump like: http://tracker.ceph.com/issues/23877
2018-04-26 14:41:44.738374 7f7283bfd180 10 clean_pg_upmaps
2018-04-26 14:41:44.739947 7f7283bfd180 20  osd.0 weight 0.333333 pgs 172
2018-04-26 14:41:44.739966 7f7283bfd180 20  osd.1 weight 0.333333 pgs 169
/root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*)' thread 7f7283bfd180 time 2018-04-26 14:41
:44.740048
/root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 4078: FAILED assert(target > 0)
2018-04-26 14:41:44.739970 7f7283bfd180 20  osd.2 weight 0.333333 pgs 171
2018-04-26 14:41:44.739972 7f7283bfd180 20  osd.4 weight 0.5 pgs 254
2018-04-26 14:41:44.739974 7f7283bfd180 20  osd.5 weight 0.5 pgs 256
2018-04-26 14:41:44.739980 7f7283bfd180 20  osd.6 weight 0.5 pgs 256
2018-04-26 14:41:44.739982 7f7283bfd180 20  osd.7 weight 0.5 pgs 256
2018-04-26 14:41:44.739985 7f7283bfd180 10  osd_weight_total 3
2018-04-26 14:41:44.739988 7f7283bfd180 10  pgs_per_weight 512
2018-04-26 14:41:44.739998 7f7283bfd180 20  osd.0       pgs 172 target 170.667  deviation 1.33333
2018-04-26 14:41:44.740006 7f7283bfd180 20  osd.1       pgs 169 target 170.667  deviation -1.66667
2018-04-26 14:41:44.740013 7f7283bfd180 20  osd.2       pgs 171 target 170.667  deviation 0.333328
2018-04-26 14:41:44.740018 7f7283bfd180 20  osd.3       pgs 2   target 0        deviation 2
2018-04-26 14:41:44.740022 7f7283bfd180 20  osd.4       pgs 254 target 256      deviation -2
2018-04-26 14:41:44.740026 7f7283bfd180 20  osd.5       pgs 256 target 256      deviation 0
2018-04-26 14:41:44.740030 7f7283bfd180 20  osd.6       pgs 256 target 256      deviation 0
2018-04-26 14:41:44.740033 7f7283bfd180 20  osd.7       pgs 256 target 256      deviation 0
2018-04-26 14:41:44.740041 7f7283bfd180 10  total_deviation 7.33333 overfull 0,3 underfull [4,1]
 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f727afe9d50]
 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1c41) [0x7f727b1b8251]
 3: (main()+0x3925) [0x5594d2896385]
 4: (__libc_start_main()+0xf5) [0x7f72782a0c05]
 5: (()+0x12fc0) [0x5594d2896fc0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-04-26 14:41:44.740903 7f7283bfd180 -1 /root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*)
' thread 7f7283bfd180 time 2018-04-26 14:41:44.740048
/root/rpmbuild/BUILD/ceph-12.2.5/src/osd/OSDMap.cc: 4078: FAILED assert(target > 0)

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f727afe9d50]
 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x1c41) [0x7f727b1b8251]
 3: (main()+0x3925) [0x5594d2896385]
 4: (__libc_start_main()+0xf5) [0x7f72782a0c05]
 5: (()+0x12fc0) [0x5594d2896fc0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Kefu Chai almost 6 years ago

  • Duplicated by Bug #23877: osd/OSDMap.cc: assert(target > 0) added

#4 Updated by xie xingguo almost 6 years ago

  • Assignee set to xie xingguo

I’ll prepare a patch soon

#5 Updated by Sage Weil almost 6 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High
  • Backport set to luminous

#6 Updated by Sage Weil almost 6 years ago

  • Status changed from 12 to Fix Under Review

#7 Updated by huang jun almost 6 years ago

This pr #21670 passed tests failed before in my local cluster, needs qa

#8 Updated by Sage Weil almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport

#9 Updated by Nathan Cutler almost 6 years ago

#10 Updated by Greg Farnum almost 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSDMap)

#11 Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF