Bug #14073
closedosd: hammer: fail to start due to stray pglog object after firefly->hammer upgrade
0%
Description
We have a large object storage cluster upgrade from firefly(0.80.10) to hammer(0.94.5). We found exists probability that this osd failed to upgrade the info_struct_v(https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L3041), we can observe this case from attached log:
2015-12-14 10:08:37.867333 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.ce1_head/ce1//head//15 = 0 (size 0)
2015-12-14 10:08:37.915473 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.ce5_head/ce5//head//15 = 0 (size 0)
2015-12-14 10:08:37.961296 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.d22_head/d22//head//15 = 0 (size 0)
2015-12-14 10:08:38.005547 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.df3_head/df3//head//15 = 0 (size 0)
2015-12-14 10:08:38.049795 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.df7_head/df7//head//15 = 0 (size 0)
2015-12-14 10:08:38.093949 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.e6f_head/e6f//head//15 = 0 (size 0)
2015-12-14 10:08:38.139775 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat 15.ed8_head/ed8//head//15 = 0 (size 0)
2015-12-14 10:08:38.186538 7f4381ddb800 10 filestore(/var/lib/ceph/osd/ceph-45) stat meta/bd6a2c2d/pglog_15.ee3/0//-1 = -2
The last line tells osd failed to get v8 structure, so it will read legacy meta oid which actually not existed. From omap_get_values filtered logs:
2015-12-14 10:08:38.005426 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.df3_head/df3//head//15
2015-12-14 10:08:38.049560 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.df7_head/df7//head//15
2015-12-14 10:08:38.049682 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.df7_head/df7//head//15
2015-12-14 10:08:38.093687 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.e6f_head/e6f//head//15
2015-12-14 10:08:38.093813 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.e6f_head/e6f//head//15
2015-12-14 10:08:38.139493 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.ed8_head/ed8//head//15
2015-12-14 10:08:38.139641 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.ed8_head/ed8//head//15
2015-12-14 10:08:38.186081 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.ee3_head/ee3//head//15
2015-12-14 10:08:38.186159 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values meta/16ef7597/infos/head//-1
2015-12-14 10:08:38.186384 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values 15.ee3_head/ee3//head//15
2015-12-14 10:08:38.186422 7f4381ddb800 15 filestore(/var/lib/ceph/osd/ceph-45) omap_get_values meta/16ef7597/infos/head//-1
The last line verify the guess.
So is it possible that when we upgrade osd from firefly to hammer, it may delete the legacy oid but 'info_struct_v' not successfully update? Or anything else, I haven't dive into this, maybe later can have a try.
Updated by Samuel Just over 8 years ago
- Priority changed from Normal to Urgent
Those should be updated atomically, please reproduce with filestore,osd,ms debugging.
debug filestore = 20
debug ms = 1
debug osd = 20
Updated by Sage Weil about 8 years ago
- Status changed from New to Can't reproduce