Project

General

Profile

Actions

Bug #20848

closed

upgrading from jewel to luminous- mgr create throws EACCES: access denied

Added by Vasu Kulkarni over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1) have 4 node cluster with 3 mons and 3 osd's , preinstalled jewel release
2) then upgrading one by one to luminous using 'ceph-deploy install --release=luminous nodename'
3) initially the cluster was in healthy state, and after 1st upgrade, i could see the mon wasn't up but the daemon was fine

   [ubuntu@vpm161 ~]$ sudo ceph -s
    cluster 76b054f1-989f-4dab-983b-6cbe87eb5c2f
     health HEALTH_WARN
            1 mons down, quorum 0,1 vpm005,vpm089
     monmap e1: 3 mons at {vpm005=172.21.2.5:6789/0,vpm089=172.21.2.89:6789/0,vpm161=172.21.2.161:6789/0}
            election epoch 994, quorum 0,1 vpm005,vpm089
      fsmap e5: 1/1/1 up {0=vpm089=up:active}
     osdmap e46: 9 osds: 9 up, 9 in
            flags sortbitwise,require_jewel_osds
      pgmap v129: 100 pgs, 3 pools, 2068 bytes data, 20 objects
            306 MB used, 1753 GB / 1754 GB avail
                 100 active+clean
 

After upgrading all the nodes, i went ahead to see if i can create mgr on one of the mon nodes but it fails with below error, Is this something that we can hadle coming from jewel release ?

[ubuntu@vpm089 ceph-deploy]$ ./ceph-deploy mgr create vpm161
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.38): ./ceph-deploy mgr create vpm161
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  mgr                           : [('vpm161', 'vpm161')]
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x159efc8>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function mgr at 0x152d410>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.mgr][DEBUG ] Deploying mgr, cluster ceph hosts vpm161:vpm161
Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts.
[vpm161][DEBUG ] connection detected need for sudo
Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts.
[vpm161][DEBUG ] connected to host: vpm161 
[vpm161][DEBUG ] detect platform information from remote host
[vpm161][DEBUG ] detect machine type
[ceph_deploy.mgr][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph_deploy.mgr][DEBUG ] remote host will use systemd
[ceph_deploy.mgr][DEBUG ] deploying mgr bootstrap to vpm161
[vpm161][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[vpm161][WARNIN] mgr keyring does not exist yet, creating one
[vpm161][DEBUG ] create a keyring file
[vpm161][DEBUG ] create path if it doesn't exist
[vpm161][INFO  ] Running command: sudo ceph --cluster ceph --name client.bootstrap-mgr --keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring auth get-or-create mgr.vpm161 mon allow profile mgr osd allow * mds allow * -o /var/lib/ceph/mgr/ceph-vpm161/keyring
[vpm161][ERROR ] Error EACCES: access denied
[vpm161][ERROR ] exit code from command was: 13
[ceph_deploy.mgr][ERROR ] could not create mgr
[ceph_deploy][ERROR ] GenericError: Failed to create 1 MGRs

current state is : flag is require_jewel_osds

[ubuntu@vpm161 ~]$ sudo ceph osd tree
ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.71263 root default                                      
-2 0.57088     host vpm123                                   
 0 0.19029         osd.0        up  1.00000          1.00000 
 1 0.19029         osd.1        up  1.00000          1.00000 
 2 0.19029         osd.2        up  1.00000          1.00000 
-3 0.57088     host vpm089                                   
 3 0.19029         osd.3      down        0          1.00000 
 4 0.19029         osd.4      down        0          1.00000 
 5 0.19029         osd.5      down        0          1.00000 
-4 0.57088     host vpm005                                   
 6 0.19029         osd.6      down        0          1.00000 
 7 0.19029         osd.7      down        0          1.00000 
 8 0.19029         osd.8      down        0          1.00000 
[ubuntu@vpm161 ~]$ sudo ceph -s
    cluster 76b054f1-989f-4dab-983b-6cbe87eb5c2f
     health HEALTH_ERR
            66 pgs are stuck inactive for more than 300 seconds
            100 pgs degraded
            100 pgs stuck degraded
            66 pgs stuck inactive
            100 pgs stuck unclean
            100 pgs stuck undersized
            100 pgs undersized
            3 requests are blocked > 32 sec
            recovery 36/60 objects degraded (60.000%)
            recovery 4/60 objects misplaced (6.667%)
            mds cluster is degraded
            1 mons down, quorum 0,1 vpm005,vpm089
     monmap e1: 3 mons at {vpm005=172.21.2.5:6789/0,vpm089=172.21.2.89:6789/0,vpm161=172.21.2.161:6789/0}
            election epoch 1562, quorum 0,1 vpm005,vpm089
      fsmap e8: 1/1/1 up {0=vpm089=up:replay}
     osdmap e54: 9 osds: 3 up, 3 in; 100 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v152: 100 pgs, 3 pools, 2068 bytes data, 20 objects
            104 MB used, 584 GB / 584 GB avail
            36/60 objects degraded (60.000%)
            4/60 objects misplaced (6.667%)
                  66 undersized+degraded+peered
                  34 active+undersized+degraded

Actions #1

Updated by Vasu Kulkarni over 6 years ago

  • Project changed from Ceph to mgr
  • Category set to ceph-mgr
Actions #2

Updated by Vasu Kulkarni over 6 years ago

  • Assignee set to John Spray

Hi John,

I am trying to add a upgrade test using ceph-deploy and would like to know if this requires a fix in ceph-mgr bootstrap or something?

the bootstrap key on vpm161 seems to be good, few in the mail thread mentioned about busted mgr key when upgrading from jewel, I dont think that is the issue here

[ubuntu@vpm161 ~]$ sudo cat /var/lib/ceph/bootstrap-mgr/ceph.keyring 
[client.bootstrap-mgr]
    key = AQATiHpZpZbpGBAAmYM6GstmzftcvrHxfcYMBA==

Actions #3

Updated by John Spray over 6 years ago

Could you check the exact version locally on all the monitors with "ceph daemon mon.<id> version"?

Assuming they are all indeed running the same latest luminous, then look in the mon logs to see if there is some more detail about why the mon is rejecting the key creation with EACCES

Actions #4

Updated by Vasu Kulkarni over 6 years ago

2 of the mons were still on jewel and that seems to be issue with systemd even when the packages were at 12.1( I will raise a separate issue for systemd),

[ubuntu@vpm089 ~]$ sudo ceph daemon mon.vpm089 version                                                                                                                       
{"version":"10.2.9"}[ubuntu@vpm089 ~]$ rpm -qa | grep ceph
libcephfs2-12.1.1-0.el7.x86_64
ceph-mon-12.1.1-0.el7.x86_64
iozone-3.424-2_ceph.el7.centos.x86_64
ceph-selinux-12.1.1-0.el7.x86_64
ceph-test-12.1.1-0.el7.x86_64
ceph-release-1-1.el7.noarch
python-cephfs-12.1.1-0.el7.x86_64
ceph-mds-12.1.1-0.el7.x86_64
ceph-radosgw-12.1.1-0.el7.x86_64
ceph-common-12.1.1-0.el7.x86_64
ceph-osd-12.1.1-0.el7.x86_64
ceph-mgr-12.1.1-0.el7.x86_64
ceph-base-12.1.1-0.el7.x86_64
ceph-12.1.1-0.el7.x86_64
mod_fastcgi-2.4.7-1.ceph.el7.centos.x86_64

[ubuntu@vpm005 ~]$ sudo ceph daemon mon.vpm005 version                                                                                                                       
{"version":"10.2.9"}

[ubuntu@vpm161 ~]$ sudo ceph daemon mon.vpm161 version                                                                                                                       
{"version":"12.1.1","release":"luminous","release_type":"rc"}[ubuntu@vpm161 ~]$

After restarting both the mons and verifying that they were on 12.1.1 and all mon's were up, i reissued the mgr create but this time i got a crash and all 3 mons crashed

[ubuntu@vpm005 ~]$ sudo ceph -s
  cluster:
    id:     76b054f1-989f-4dab-983b-6cbe87eb5c2f
    health: HEALTH_ERR
            66 pgs are stuck inactive for more than 60 seconds
            100 pgs degraded
            100 pgs stuck degraded
            66 pgs stuck inactive
            100 pgs stuck unclean
            100 pgs stuck undersized
            100 pgs undersized
            3 requests are blocked > 32 sec
            recovery 36/60 objects degraded (60.000%)
            recovery 4/60 objects misplaced (6.667%)
            mds cluster is degraded

  services:
    mon: 3 daemons, quorum vpm005,vpm089,vpm161
    mgr: no daemons active
    mds: 1/1/1 up {0=vpm089=up:replay}
    osd: 9 osds: 3 up, 3 in; 100 remapped pgs

  data:
    pools:   3 pools, 100 pgs
    objects: 20 objects, 2068 bytes
    usage:   104 MB used, 584 GB / 584 GB avail
    pgs:     66.000% pgs not active
             36/60 objects degraded (60.000%)
             4/60 objects misplaced (6.667%)
             66 undersized+degraded+peered
             34 active+undersized+degraded

[ubuntu@vpm089 ceph-deploy]$ ./ceph-deploy mgr create vpm161
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.38): ./ceph-deploy mgr create vpm161
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  mgr                           : [('vpm161', 'vpm161')]
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x16d9fc8>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function mgr at 0x1668410>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.mgr][DEBUG ] Deploying mgr, cluster ceph hosts vpm161:vpm161
Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts.
[vpm161][DEBUG ] connection detected need for sudo
Warning: Permanently added 'vpm161,172.21.2.161' (ECDSA) to the list of known hosts.
[vpm161][DEBUG ] connected to host: vpm161 
[vpm161][DEBUG ] detect platform information from remote host
[vpm161][DEBUG ] detect machine type
[ceph_deploy.mgr][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph_deploy.mgr][DEBUG ] remote host will use systemd
[ceph_deploy.mgr][DEBUG ] deploying mgr bootstrap to vpm161
[vpm161][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[vpm161][DEBUG ] create path if it doesn't exist
[vpm161][INFO  ] Running command: sudo ceph --cluster ceph --name client.bootstrap-mgr --keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring auth get-or-create mgr.vpm161 mon allow profile mgr osd allow * mds allow * -o /var/lib/ceph/mgr/ceph-vpm161/keyring
[vpm161][WARNIN] No data was received after 300 seconds, disconnecting...
[vpm161][INFO  ] Running command: sudo systemctl enable ceph-mgr@vpm161
[vpm161][WARNIN] Created symlink from /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@vpm161.service to /usr/lib/systemd/system/ceph-mgr@.service.
[vpm161][INFO  ] Running command: sudo systemctl start ceph-mgr@vpm161
[vpm161][INFO  ] Running command: sudo systemctl enable ceph.target

The crash seen in monitor looks like:

   -38> 2017-08-01 19:48:36.532727 7fe3d70a3700  1 -- 172.21.2.89:6789/0 _send_message--> mon.2 172.21.2.161:6789/0 -- paxos(lease lc 118352 fc 117720 pn 0 opn 0) v4 -- ?+0 0x7fe3e876f000
   -37> 2017-08-01 19:48:36.532737 7fe3d70a3700  1 -- 172.21.2.89:6789/0 --> 172.21.2.161:6789/0 -- paxos(lease lc 118352 fc 117720 pn 0 opn 0) v4 -- 0x7fe3e876f000 con 0
   -36> 2017-08-01 19:48:36.543756 7fe3d427e700  5 mon.vpm089@1(leader).paxos(paxos active c 117720..118352) queue_pending_finisher 0x7fe3e8592950
   -35> 2017-08-01 19:48:36.545835 7fe3d427e700  1 -- 172.21.2.89:6789/0 _send_message--> mon.2 172.21.2.161:6789/0 -- paxos(begin lc 118352 fc 0 pn 5866101 opn 0) v4 -- ?+0 0x7fe3e876f900
   -34> 2017-08-01 19:48:36.545851 7fe3d427e700  1 -- 172.21.2.89:6789/0 --> 172.21.2.161:6789/0 -- paxos(begin lc 118352 fc 0 pn 5866101 opn 0) v4 -- 0x7fe3e876f900 con 0
   -33> 2017-08-01 19:48:36.549450 7fe3d6691700  5 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0). rx mon.2 seq 17 0x7fe3e876f000 paxos(lease_ack lc 118352 fc 117720 pn 0 opn 0) v4
   -32> 2017-08-01 19:48:36.549514 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== mon.2 172.21.2.161:6789/0 17 ==== paxos(lease_ack lc 118352 fc 117720 pn 0 opn 0) v4 ==== 166+0+0 (2034569483 0 0) 0x7fe3e876f000 con 0x7fe3e89bd000
   -31> 2017-08-01 19:48:36.583151 7fe3d6691700  1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).read_bulk peer close file descriptor 29
   -30> 2017-08-01 19:48:36.583166 7fe3d6691700  1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).read_until read failed
   -29> 2017-08-01 19:48:36.583170 7fe3d6691700  1 -- 172.21.2.89:6789/0 >> 172.21.2.161:6789/0 conn(0x7fe3e89bd000 :-1 s=STATE_OPEN pgs=3 cs=1 l=0).process read tag failed
   -28> 2017-08-01 19:48:36.583386 7fe3d5e90700  1 -- 172.21.2.89:6789/0 >> - conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=28 -
   -27> 2017-08-01 19:48:36.583517 7fe3d5e90700  2 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=93597 cs=1 l=1).handle_connect_msg accept write reply msg done
   -26> 2017-08-01 19:48:36.583740 7fe3d5e90700  2 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=93597 cs=1 l=1)._process_connection accept get newly_acked_seq 0
   -25> 2017-08-01 19:48:36.583778 7fe3d5e90700  5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 1 0x7fe3e89c9e00 auth(proto 0 31 bytes epoch 2) v1
   -24> 2017-08-01 19:48:36.583817 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 1 ==== auth(proto 0 31 bytes epoch 2) v1 ==== 61+0+0 (732538964 0 0) 0x7fe3e89c9e00 con 0x7fe3e8957000
   -23> 2017-08-01 19:48:36.583836 7fe3d1a79700  5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:36.583837 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352
   -22> 2017-08-01 19:48:36.583877 7fe3d1a79700  2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9b80 auth_reply(proto 2 0 (0) Success) v1
   -21> 2017-08-01 19:48:36.583885 7fe3d1a79700  1 -- 172.21.2.89:6789/0 --> 172.21.2.89:6800/2251934460 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9b80 con 0
   -20> 2017-08-01 19:48:36.584153 7fe3d5e90700  5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 2 0x7fe3e89c9b80 auth(proto 2 128 bytes epoch 0) v1
   -19> 2017-08-01 19:48:36.584183 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 2 ==== auth(proto 2 128 bytes epoch 0) v1 ==== 158+0+0 (1065591626 0 0) 0x7fe3e89c9b80 con 0x7fe3e8957000
   -18> 2017-08-01 19:48:36.584197 7fe3d1a79700  5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:36.584198 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352
   -17> 2017-08-01 19:48:36.584371 7fe3d1a79700  2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9e00 auth_reply(proto 2 0 (0) Success) v1
   -16> 2017-08-01 19:48:36.584380 7fe3d1a79700  1 -- 172.21.2.89:6789/0 --> 172.21.2.89:6800/2251934460 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9e00 con 0
   -15> 2017-08-01 19:48:36.584629 7fe3d5e90700  5 -- 172.21.2.89:6789/0 >> 172.21.2.89:6800/2251934460 conn(0x7fe3e8957000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=93597 cs=1 l=1). rx mds.0 seq 3 0x7fe3e8a1ad80 mon_subscribe({mdsmap=9+,monmap=3+,osdmap=56}) v2
   -14> 2017-08-01 19:48:36.584670 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== mds.0 172.21.2.89:6800/2251934460 3 ==== mon_subscribe({mdsmap=9+,monmap=3+,osdmap=56}) v2 ==== 61+0+0 (3973911011 0 0) 0x7fe3e8a1ad80 con 0x7fe3e8957000
   -13> 2017-08-01 19:48:37.686897 7fe3d568f700  1 -- 172.21.2.89:6789/0 >> 172.21.2.5:6789/0 conn(0x7fe3e89bb800 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._process_connection reconnect failed 
   -12> 2017-08-01 19:48:37.686937 7fe3d568f700  2 -- 172.21.2.89:6789/0 >> 172.21.2.5:6789/0 conn(0x7fe3e89bb800 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._process_connection connection refused!
   -11> 2017-08-01 19:48:39.213027 7fe3d5e90700  1 -- 172.21.2.89:6789/0 >> - conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=29 -
   -10> 2017-08-01 19:48:39.213587 7fe3d5e90700  2 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=162 cs=1 l=1).handle_connect_msg accept write reply msg done
    -9> 2017-08-01 19:48:39.214190 7fe3d5e90700  2 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=162 cs=1 l=1)._process_connection accept get newly_acked_seq 0
    -8> 2017-08-01 19:48:39.214280 7fe3d5e90700  5 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=162 cs=1 l=1). rx client.? seq 1 0x7fe3e89c9e00 auth(proto 0 38 bytes epoch 2) v1
    -7> 2017-08-01 19:48:39.214323 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== client.? 172.21.2.161:0/2298754901 1 ==== auth(proto 0 38 bytes epoch 2) v1 ==== 68+0+0 (3411313417 0 0) 0x7fe3e89c9e00 con 0x7fe3e8abd000
    -6> 2017-08-01 19:48:39.214347 7fe3d1a79700  5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:39.214348 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352
    -5> 2017-08-01 19:48:39.214394 7fe3d1a79700  2 mon.vpm089@1(leader) e2 send_reply 0x7fe3e85d0640 0x7fe3e89c9b80 auth_reply(proto 2 0 (0) Success) v1
    -4> 2017-08-01 19:48:39.214408 7fe3d1a79700  1 -- 172.21.2.89:6789/0 --> 172.21.2.161:0/2298754901 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x7fe3e89c9b80 con 0
    -3> 2017-08-01 19:48:39.215481 7fe3d5e90700  5 -- 172.21.2.89:6789/0 >> 172.21.2.161:0/2298754901 conn(0x7fe3e8abd000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=162 cs=1 l=1). rx client.? seq 2 0x7fe3e89c9b80 auth(proto 2 32 bytes epoch 0) v1
    -2> 2017-08-01 19:48:39.215514 7fe3d1a79700  1 -- 172.21.2.89:6789/0 <== client.? 172.21.2.161:0/2298754901 2 ==== auth(proto 2 32 bytes epoch 0) v1 ==== 62+0+0 (2204999250 0 0) 0x7fe3e89c9b80 con 0x7fe3e8abd000
    -1> 2017-08-01 19:48:39.215529 7fe3d1a79700  5 mon.vpm089@1(leader).paxos(paxos updating c 117720..118352) is_readable = 1 - now=2017-08-01 19:48:39.215530 lease_expire=2017-08-01 19:48:41.532724 has v0 lc 118352
     0> 2017-08-01 19:48:39.219115 7fe3d1a79700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/auth/Crypto.h: In function 'int CryptoKey::encrypt(CephContext*, const bufferlist&, ceph::bufferlist&, std::string*) const' thread 7fe3d1a79700 time 2017-08-01 19:48:39.215549
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/auth/Crypto.h: 109: FAILED assert(ckh)

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fe3de2b7310]
 2: (()+0x2c50f8) [0x7fe3ddfb70f8]
 3: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055]
 4: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c]
 5: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644]
 6: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162]
 7: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091]
 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661]
 9: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e]
 10: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303]
 11: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd]
 13: (()+0x7dc5) [0x7fe3dd081dc5]
 14: (clone()+0x6d) [0x7fe3da40d76d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.vpm089.log
--- end dump of recent events ---
2017-08-01 19:48:39.223796 7fe3d1a79700 -1 *** Caught signal (Aborted) **
 in thread 7fe3d1a79700 thread_name:ms_dispatch

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x852ec1) [0x7fe3de544ec1]
 2: (()+0xf370) [0x7fe3dd089370]
 3: (gsignal()+0x37) [0x7fe3da34b1d7]
 4: (abort()+0x148) [0x7fe3da34c8c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7fe3de2b7484]
 6: (()+0x2c50f8) [0x7fe3ddfb70f8]
 7: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055]
 8: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c]
 9: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644]
 10: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162]
 11: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091]
 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661]
 13: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e]
 14: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303]
 15: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd]
 17: (()+0x7dc5) [0x7fe3dd081dc5]
 18: (clone()+0x6d) [0x7fe3da40d76d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2017-08-01 19:48:39.223796 7fe3d1a79700 -1 *** Caught signal (Aborted) **
 in thread 7fe3d1a79700 thread_name:ms_dispatch

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x852ec1) [0x7fe3de544ec1]
 2: (()+0xf370) [0x7fe3dd089370]
 3: (gsignal()+0x37) [0x7fe3da34b1d7]
 4: (abort()+0x148) [0x7fe3da34c8c8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x7fe3de2b7484]
 6: (()+0x2c50f8) [0x7fe3ddfb70f8]
 7: (cephx_calc_client_server_challenge(CephContext*, CryptoKey&, unsigned long, unsigned long, unsigned long*, std::string&)+0x2f5) [0x7fe3de46c055]
 8: (CephxServiceHandler::handle_request(ceph::buffer::list::iterator&, ceph::buffer::list&, unsigned long&, AuthCapsInfo&, unsigned long*)+0x259c) [0x7fe3de27bf7c]
 9: (AuthMonitor::prep_auth(boost::intrusive_ptr<MonOpRequest>, bool)+0xc24) [0x7fe3de0ea644]
 10: (AuthMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x322) [0x7fe3de0ed162]
 11: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x811) [0x7fe3de1b5091]
 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x151) [0x7fe3de09d661]
 13: (Monitor::_ms_dispatch(Message*)+0x7de) [0x7fe3de09f06e]
 14: (Monitor::ms_dispatch(Message*)+0x23) [0x7fe3de0c7303]
 15: (DispatchQueue::entry()+0x792) [0x7fe3de4ef812]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe3de35d3cd]
 17: (()+0x7dc5) [0x7fe3dd081dc5]
 18: (clone()+0x6d) [0x7fe3da40d76d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.vpm089.log
--- end dump of recent events ---

logs link:
vpm089 -> http://chunk.io/f/cfeaeb298bfc41d49055d4a74cfbc1ce ( last 5000 lines)
vpm005 -> http://chunk.io/f/2749112a9b0c40cf91007d8b8860ae60
vpm161 -> http://chunk.io/f/043c290f17cf41149dde6d5b46b7199d

Also feel free to login to those nodes, since some of the log files are more than 300MB+ for uploads.

Actions #5

Updated by John Spray over 6 years ago

Thanks, I logged into the nodes and restarted the mons to experiment.

The key is indeed corrupted as per the issue that was discussed on the mailing list, in "ceph auth list" I see:

client.bootstrap-mgr
    key: AAAAAAAAAAAAAAAA
    caps: [mon] allow profile bootstrap-mgr

I think it was confusing because the key in the .keyring file looks fine, but the corrupted one is nevertheless stored in the mons.

The fix was https://github.com/ceph/ceph/pull/16395 (for http://tracker.ceph.com/issues/20666)

To work around this in 12.1.1, we can "ceph auth del client.bootstrap-mgr ; ceph auth get-or-create client.bootstrap-mgr mon "allow profile bootstrap-mgr" > /var/lib/ceph/bootstrap-mgr" -- I have gone ahead and done that on all three nodes.

Hopefully creating a mgr with ceph-deploy works now

Actions #6

Updated by Vasu Kulkarni over 6 years ago

Thanks John, mgr create is working now, Hopefully I can verify this when the next RC candidate comes out and I will close this then. Until then I will use the workaround to recreate the bootstrap-mgr which you suggested above.

Actions #7

Updated by Sage Weil over 6 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF