Bug #20848
closedupgrading from jewel to luminous- mgr create throws EACCES: access denied
0%
Description
1) have 4 node cluster with 3 mons and 3 osd's , preinstalled jewel release
2) then upgrading one by one to luminous using 'ceph-deploy install --release=luminous nodename'
3) initially the cluster was in healthy state, and after 1st upgrade, i could see the mon wasn't up but the daemon was fine
[ubuntu@vpm161 ~]$ sudo ceph -s cluster 76b054f1-989f-4dab-983b-6cbe87eb5c2f health HEALTH_WARN 1 mons down, quorum 0,1 vpm005,vpm089 monmap e1: 3 mons at {vpm005=172.21.2.5:6789/0,vpm089=172.21.2.89:6789/0,vpm161=172.21.2.161:6789/0} election epoch 994, quorum 0,1 vpm005,vpm089 fsmap e5: 1/1/1 up {0=vpm089=up:active} osdmap e46: 9 osds: 9 up, 9 in flags sortbitwise,require_jewel_osds pgmap v129: 100 pgs, 3 pools, 2068 bytes data, 20 objects 306 MB used, 1753 GB / 1754 GB avail 100 active+clean
After upgrading all the nodes, i went ahead to see if i can create mgr on one of the mon nodes but it fails with below error, Is this something that we can hadle coming from jewel release ?
[ubuntu@vpm089 ceph-deploy]$ ./ceph-deploy mgr create vpm161 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.38): ./ceph-deploy mgr create vpm161 [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] mgr : [('vpm161', 'vpm161')] [ceph_deploy.cli][INFO ] overwrite_conf : False [ceph_deploy.cli][INFO ] subcommand : create [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x159efc8> [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] func : <function mgr at 0x152d410> [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.mgr][DEBUG ] Deploying mgr, cluster ceph hosts vpm161:vpm161 Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts. [vpm161][DEBUG ] connection detected need for sudo Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts. [vpm161][DEBUG ] connected to host: vpm161 [vpm161][DEBUG ] detect platform information from remote host [vpm161][DEBUG ] detect machine type [ceph_deploy.mgr][INFO ] Distro info: CentOS Linux 7.3.1611 Core [ceph_deploy.mgr][DEBUG ] remote host will use systemd [ceph_deploy.mgr][DEBUG ] deploying mgr bootstrap to vpm161 [vpm161][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [vpm161][WARNIN] mgr keyring does not exist yet, creating one [vpm161][DEBUG ] create a keyring file [vpm161][DEBUG ] create path if it doesn't exist [vpm161][INFO ] Running command: sudo ceph --cluster ceph --name client.bootstrap-mgr --keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring auth get-or-create mgr.vpm161 mon allow profile mgr osd allow * mds allow * -o /var/lib/ceph/mgr/ceph-vpm161/keyring [vpm161][ERROR ] Error EACCES: access denied [vpm161][ERROR ] exit code from command was: 13 [ceph_deploy.mgr][ERROR ] could not create mgr [ceph_deploy][ERROR ] GenericError: Failed to create 1 MGRs
current state is : flag is require_jewel_osds
[ubuntu@vpm161 ~]$ sudo ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.71263 root default -2 0.57088 host vpm123 0 0.19029 osd.0 up 1.00000 1.00000 1 0.19029 osd.1 up 1.00000 1.00000 2 0.19029 osd.2 up 1.00000 1.00000 -3 0.57088 host vpm089 3 0.19029 osd.3 down 0 1.00000 4 0.19029 osd.4 down 0 1.00000 5 0.19029 osd.5 down 0 1.00000 -4 0.57088 host vpm005 6 0.19029 osd.6 down 0 1.00000 7 0.19029 osd.7 down 0 1.00000 8 0.19029 osd.8 down 0 1.00000 [ubuntu@vpm161 ~]$ sudo ceph -s cluster 76b054f1-989f-4dab-983b-6cbe87eb5c2f health HEALTH_ERR 66 pgs are stuck inactive for more than 300 seconds 100 pgs degraded 100 pgs stuck degraded 66 pgs stuck inactive 100 pgs stuck unclean 100 pgs stuck undersized 100 pgs undersized 3 requests are blocked > 32 sec recovery 36/60 objects degraded (60.000%) recovery 4/60 objects misplaced (6.667%) mds cluster is degraded 1 mons down, quorum 0,1 vpm005,vpm089 monmap e1: 3 mons at {vpm005=172.21.2.5:6789/0,vpm089=172.21.2.89:6789/0,vpm161=172.21.2.161:6789/0} election epoch 1562, quorum 0,1 vpm005,vpm089 fsmap e8: 1/1/1 up {0=vpm089=up:replay} osdmap e54: 9 osds: 3 up, 3 in; 100 remapped pgs flags sortbitwise,require_jewel_osds pgmap v152: 100 pgs, 3 pools, 2068 bytes data, 20 objects 104 MB used, 584 GB / 584 GB avail 36/60 objects degraded (60.000%) 4/60 objects misplaced (6.667%) 66 undersized+degraded+peered 34 active+undersized+degraded
Updated by Vasu Kulkarni almost 7 years ago
- Project changed from Ceph to mgr
- Category set to ceph-mgr
Updated by Vasu Kulkarni almost 7 years ago
- Assignee set to John Spray
Hi John,
I am trying to add a upgrade test using ceph-deploy and would like to know if this requires a fix in ceph-mgr bootstrap or something?
the bootstrap key on vpm161 seems to be good, few in the mail thread mentioned about busted mgr key when upgrading from jewel, I dont think that is the issue here
[ubuntu@vpm161 ~]$ sudo cat /var/lib/ceph/bootstrap-mgr/ceph.keyring [client.bootstrap-mgr] key = AQATiHpZpZbpGBAAmYM6GstmzftcvrHxfcYMBA==
Updated by John Spray almost 7 years ago
Could you check the exact version locally on all the monitors with "ceph daemon mon.<id> version"?
Assuming they are all indeed running the same latest luminous, then look in the mon logs to see if there is some more detail about why the mon is rejecting the key creation with EACCES
Updated by Vasu Kulkarni almost 7 years ago
2 of the mons were still on jewel and that seems to be issue with systemd even when the packages were at 12.1( I will raise a separate issue for systemd),
[ubuntu@vpm089 ~]$ sudo ceph daemon mon.vpm089 version {"version":"10.2.9"}[ubuntu@vpm089 ~]$ rpm -qa | grep ceph libcephfs2-12.1.1-0.el7.x86_64 ceph-mon-12.1.1-0.el7.x86_64 iozone-3.424-2_ceph.el7.centos.x86_64 ceph-selinux-12.1.1-0.el7.x86_64 ceph-test-12.1.1-0.el7.x86_64 ceph-release-1-1.el7.noarch python-cephfs-12.1.1-0.el7.x86_64 ceph-mds-12.1.1-0.el7.x86_64 ceph-radosgw-12.1.1-0.el7.x86_64 ceph-common-12.1.1-0.el7.x86_64 ceph-osd-12.1.1-0.el7.x86_64 ceph-mgr-12.1.1-0.el7.x86_64 ceph-base-12.1.1-0.el7.x86_64 ceph-12.1.1-0.el7.x86_64 mod_fastcgi-2.4.7-1.ceph.el7.centos.x86_64 [ubuntu@vpm005 ~]$ sudo ceph daemon mon.vpm005 version {"version":"10.2.9"} [ubuntu@vpm161 ~]$ sudo ceph daemon mon.vpm161 version {"version":"12.1.1","release":"luminous","release_type":"rc"}[ubuntu@vpm161 ~]$
After restarting both the mons and verifying that they were on 12.1.1 and all mon's were up, i reissued the mgr create but this time i got a crash and all 3 mons crashed
[ubuntu@vpm005 ~]$ sudo ceph -s cluster: id: 76b054f1-989f-4dab-983b-6cbe87eb5c2f health: HEALTH_ERR 66 pgs are stuck inactive for more than 60 seconds 100 pgs degraded 100 pgs stuck degraded 66 pgs stuck inactive 100 pgs stuck unclean 100 pgs stuck undersized 100 pgs undersized 3 requests are blocked > 32 sec recovery 36/60 objects degraded (60.000%) recovery 4/60 objects misplaced (6.667%) mds cluster is degraded services: mon: 3 daemons, quorum vpm005,vpm089,vpm161 mgr: no daemons active mds: 1/1/1 up {0=vpm089=up:replay} osd: 9 osds: 3 up, 3 in; 100 remapped pgs data: pools: 3 pools, 100 pgs objects: 20 objects, 2068 bytes usage: 104 MB used, 584 GB / 584 GB avail pgs: 66.000% pgs not active 36/60 objects degraded (60.000%) 4/60 objects misplaced (6.667%) 66 undersized+degraded+peered 34 active+undersized+degraded [ubuntu@vpm089 ceph-deploy]$ ./ceph-deploy mgr create vpm161 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.38): ./ceph-deploy mgr create vpm161 [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] mgr : [('vpm161', 'vpm161')] [ceph_deploy.cli][INFO ] overwrite_conf : False [ceph_deploy.cli][INFO ] subcommand : create [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x16d9fc8> [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] func : <function mgr at 0x1668410> [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.mgr][DEBUG ] Deploying mgr, cluster ceph hosts vpm161:vpm161 Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts. [vpm161][DEBUG ] connection detected need for sudo Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts. [vpm161][DEBUG ] connected to host: vpm161 [vpm161][DEBUG ] detect platform information from remote host [vpm161][DEBUG ] detect machine type [ceph_deploy.mgr][INFO ] Distro info: CentOS Linux 7.3.1611 Core [ceph_deploy.mgr][DEBUG ] remote host will use systemd [ceph_deploy.mgr][DEBUG ] deploying mgr bootstrap to vpm161 [vpm161][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [vpm161][DEBUG ] create path if it doesn't exist [vpm161][INFO ] Running command: sudo ceph --cluster ceph --name client.bootstrap-mgr --keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring auth get-or-create mgr.vpm161 mon allow profile mgr osd allow * mds allow * -o /var/lib/ceph/mgr/ceph-vpm161/keyring [vpm161][WARNIN] No data was received after 300 seconds, disconnecting... [vpm161][INFO ] Running command: sudo systemctl enable ceph-mgr@vpm161 [vpm161][WARNIN] Created symlink from /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@vpm161.service to /usr/lib/systemd/system/ceph-mgr@.service. [vpm161][INFO ] Running command: sudo systemctl start ceph-mgr@vpm161 [vpm161][INFO ] Running command: sudo systemctl enable ceph.target
The crash seen in monitor looks like:
-38> 2017-08-01 19:48:36.532727 7fe3d70a3700 1 -- 172.21.2.89:6789/0 _send_message--> mon.2 172.21.2.161:6789/0 -- paxos(lease lc 118352 fc 117720 pn 0 opn 0) v4 -- ?+0 0x7fe3e876f000 -37> 2017-08-01 19:48:36.532737 7fe3d70a3700 1 -- 172.21.2.89:6789/0 --> 172.21.2.161:6789/0 -- paxos(lease lc 118352 fc 117720 pn 0 opn 0) v4 -- 0x7fe3e876f000 con 0 -36> 2017-08-01 19:48:36.543756 7fe3d427e700 5 mon.vpm089@1(leader).paxos(paxos active c 117720..118352) queue_pending_finisher 0x7fe3e8592950 -35> 2017-08-01 19:48:36.545835 7fe3d427e700 1 -- 172.21.2.89:6789/0 _send_message--> mon.2 172.21.2.161:6789/0 -- paxos(begin lc 118352 fc 0 pn 5866101 opn 0) v4 -- ?+0 0x7fe3e876f900 -34> 2017-08-01 19:48:36.545851 7fe3d427e700 1 -- 172.21.2.89:6789/0 --> 172.21.2.161:6789/0 -- paxos(begin lc 118352 fc 0 pn 5866101 opn 0) v4 -- 0x7fe3e876f900 con 0 -33> 2017-08-01 19:48:36.549450 7fe3d6691700 5 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0). rx mon.2 seq 17 0x7fe3e876f000 paxos(lease_ack lc 118352 fc 117720 pn 0 opn 0) v4 -32> 2017-08-01 19:48:36.549514 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== mon.2 172.21.2.161:6789/0 17 ==== paxos(lease_ack lc 118352 fc 117720 pn 0 opn 0) v4 ==== 166+0+0 (2034569483 0 0) 0x7fe3e876f000 con 0x7fe3e89bd000 -31> 2017-08-01 19:48:36.583151 7fe3d6691700 1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).read_bulk peer close file descriptor 29 -30> 2017-08-01 19:48:36.583166 7fe3d6691700 1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).read_until read failed -29> 2017-08-01 19:48:36.583170 7fe3d6691700 1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).process read tag failed -28> 2017-08-01 19:48:36.583386 7fe3d5e90700 1 -- 172.21.2.89:6789/0 >> - conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=28 - -27> 2017-08-01 19:48:36.583517 7fe3d5e90700 2 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=93597 cs=1 l=1).handle_connect_msg accept write reply msg done -26> 2017-08-01 19:48:36.583740 7fe3d5e90700 2 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=93597 cs=1 l=1)._process_connection accept get newly_acked_seq 0 -25> 2017-08-01 19:48:36.583778 7fe3d5e90700 5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 1 0x7fe3e89c9e00 auth(proto 0 31 bytes epoch 2) v1 -24> 2017-08-01 19:48:36.583817 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 1 ==== auth(proto 0 31 bytes epoch 2) v1 ==== 61+0+0 (732538964 0 0) 0x7fe3e89c9e00 con 0x7fe3e8957000 -23> 2017-08-01 19:48:36.583836 7fe3d1a79700 5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:36.583837 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352 -22> 2017-08-01 19:48:36.583877 7fe3d1a79700 2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9b80 auth_reply(proto 2 0 (0) Success) v1 -21> 2017-08-01 19:48:36.583885 7fe3d1a79700 1 -- 172.21.2.89:6789/0 --> 172.21.2.89:6800/2251934460 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9b80 con 0 -20> 2017-08-01 19:48:36.584153 7fe3d5e90700 5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 2 0x7fe3e89c9b80 auth(proto 2 128 bytes epoch 0) v1 -19> 2017-08-01 19:48:36.584183 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 2 ==== auth(proto 2 128 bytes epoch 0) v1 ==== 158+0+0 (1065591626 0 0) 0x7fe3e89c9b80 con 0x7fe3e8957000 -18> 2017-08-01 19:48:36.584197 7fe3d1a79700 5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:36.584198 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352 -17> 2017-08-01 19:48:36.584371 7fe3d1a79700 2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9e00 auth_reply(proto 2 0 (0) Success) v1 -16> 2017-08-01 19:48:36.584380 7fe3d1a79700 1 -- 172.21.2.89:6789/0 --> 172.21.2.89:6800/2251934460 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9e00 con 0 -15> 2017-08-01 19:48:36.584629 7fe3d5e90700 5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 3 0x7fe3e8a1ad80 mon_subscribe({mdsmap=9+,monmap=3+,osdmap=56}) v2 -14> 2017-08-01 19:48:36.584670 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 3 ==== mon_subscribe({mdsmap=9+,monmap=3+,osdmap=56}) v2 ==== 61+0+0 (3973911011 0 0) 0x7fe3e8a1ad80 con 0x7fe3e8957000 -13> 2017-08-01 19:48:37.686897 7fe3d568f700 1 -- 172.21.2.89:6789/0 >> 172.21.2.5:6789/0 conn(0x7fe3e89bb800 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._process_connection reconnect failed -12> 2017-08-01 19:48:37.686937 7fe3d568f700 2 -- 172.21.2.89:6789/0 >> 172.21.2.5:6789/0 conn(0x7fe3e89bb800 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._process_connection connection refused! -11> 2017-08-01 19:48:39.213027 7fe3d5e90700 1 -- 172.21.2.89:6789/0 >> - conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=29 - -10> 2017-08-01 19:48:39.213587 7fe3d5e90700 2 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=162 cs=1 l=1).handle_connect_msg accept write reply msg done -9> 2017-08-01 19:48:39.214190 7fe3d5e90700 2 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=162 cs=1 l=1)._process_connection accept get newly_acked_seq 0 -8> 2017-08-01 19:48:39.214280 7fe3d5e90700 5 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=162 cs=1 l=1). rx client.? seq 1 0x7fe3e89c9e00 auth(proto 0 38 bytes epoch 2) v1 -7> 2017-08-01 19:48:39.214323 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== client.? 172.21.2.161:0/2298754901 1 ==== auth(proto 0 38 bytes epoch 2) v1 ==== 68+0+0 (3411313417 0 0) 0x7fe3e89c9e00 con 0x7fe3e8abd000 -6> 2017-08-01 19:48:39.214347 7fe3d1a79700 5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:39.214348 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352 -5> 2017-08-01 19:48:39.214394 7fe3d1a79700 2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9b80 auth_reply(proto 2 0 (0) Success) v1 -4> 2017-08-01 19:48:39.214408 7fe3d1a79700 1 -- 172.21.2.89:6789/0 --> 172.21.2.161:0/2298754901 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9b80 con 0 -3> 2017-08-01 19:48:39.215481 7fe3d5e90700 5 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=162 cs=1 l=1). rx client.? seq 2 0x7fe3e89c9b80 auth(proto 2 32 bytes epoch 0) v1 -2> 2017-08-01 19:48:39.215514 7fe3d1a79700 1 -- 172.21.2.89:6789/0 <== client.? 172.21.2.161:0/2298754901 2 ==== auth(proto 2 32 bytes epoch 0) v1 ==== 62+0+0 (2204999250 0 0) 0x7fe3e89c9b80 con 0x7fe3e8abd000 -1> 2017-08-01 19:48:39.215529 7fe3d1a79700 5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:39.215530 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352 0> 2017-08-01 19:48:39.219115 7fe3d1a79700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/auth/Crypto.h: In function 'int CryptoKey::encrypt(CephContext*, const bufferlist&, ceph::bufferlist&, std::string*) const' thread 7fe3d1a79700 time 2017-08-01 19:48:39.215549 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/auth/Crypto.h: 109: FAILED assert(ckh) ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fe3de2b7310] 2: (()+0x2c50f8) [0x7fe3ddfb70f8] 3: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055] 4: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c] 5: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644] 6: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162] 7: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091] 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661] 9: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e] 10: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303] 11: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd] 13: (()+0x7dc5) [0x7fe3dd081dc5] 14: (clone()+0x6d) [0x7fe3da40d76d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mon.vpm089.log --- end dump of recent events --- 2017-08-01 19:48:39.223796 7fe3d1a79700 -1 *** Caught signal (Aborted) ** in thread 7fe3d1a79700 thread_name:ms_dispatch ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (()+0x852ec1) [0x7fe3de544ec1] 2: (()+0xf370) [0x7fe3dd089370] 3: (gsignal()+0x37) [0x7fe3da34b1d7] 4: (abort()+0x148) [0x7fe3da34c8c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7fe3de2b7484] 6: (()+0x2c50f8) [0x7fe3ddfb70f8] 7: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055] 8: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c] 9: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644] 10: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162] 11: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091] 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661] 13: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e] 14: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303] 15: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812] 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd] 17: (()+0x7dc5) [0x7fe3dd081dc5] 18: (clone()+0x6d) [0x7fe3da40d76d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2017-08-01 19:48:39.223796 7fe3d1a79700 -1 *** Caught signal (Aborted) ** in thread 7fe3d1a79700 thread_name:ms_dispatch ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (()+0x852ec1) [0x7fe3de544ec1] 2: (()+0xf370) [0x7fe3dd089370] 3: (gsignal()+0x37) [0x7fe3da34b1d7] 4: (abort()+0x148) [0x7fe3da34c8c8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7fe3de2b7484] 6: (()+0x2c50f8) [0x7fe3ddfb70f8] 7: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055] 8: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c] 9: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644] 10: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162] 11: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091] 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661] 13: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e] 14: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303] 15: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812] 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd] 17: (()+0x7dc5) [0x7fe3dd081dc5] 18: (clone()+0x6d) [0x7fe3da40d76d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mon.vpm089.log --- end dump of recent events ---
logs link:
vpm089 -> http://chunk.io/f/cfeaeb298bfc41d49055d4a74cfbc1ce ( last 5000 lines)
vpm005 -> http://chunk.io/f/2749112a9b0c40cf91007d8b8860ae60
vpm161 -> http://chunk.io/f/043c290f17cf41149dde6d5b46b7199d
Also feel free to login to those nodes, since some of the log files are more than 300MB+ for uploads.
Updated by John Spray almost 7 years ago
Thanks, I logged into the nodes and restarted the mons to experiment.
The key is indeed corrupted as per the issue that was discussed on the mailing list, in "ceph auth list" I see:
client.bootstrap-mgr key: AAAAAAAAAAAAAAAA caps: [mon] allow profile bootstrap-mgr
I think it was confusing because the key in the .keyring file looks fine, but the corrupted one is nevertheless stored in the mons.
The fix was https://github.com/ceph/ceph/pull/16395 (for http://tracker.ceph.com/issues/20666)
To work around this in 12.1.1, we can "ceph auth del client.bootstrap-mgr ; ceph auth get-or-create client.bootstrap-mgr mon "allow profile bootstrap-mgr" > /var/lib/ceph/bootstrap-mgr" -- I have gone ahead and done that on all three nodes.
Hopefully creating a mgr with ceph-deploy works now
Updated by Vasu Kulkarni almost 7 years ago
Thanks John, mgr create is working now, Hopefully I can verify this when the next RC candidate comes out and I will close this then. Until then I will use the workaround to recreate the bootstrap-mgr which you suggested above.