Bug #19705
closedUbuntu amd64 client can not discover the ubuntu arm64 ceph cluster
0%
Description
Hi help,
When upgrading my cluster from jewel 10.2.3 (no such issue) to jewwl 10.2.6, I found my client(ubuntu 16.04 amd64) can not discover my ceph cluster(ubuntu 16.04 arm64)
My cluster configuration:
Node username os machine
deploy cephadmin ubuntu16.04 amd64 x86 PC(test client)
node1 cephadmin ubuntu16.04 arm64 ARM dev board
node2 cephadmin ubuntu16.04 arm64 ARM dev board
node3 cephadmin ubuntu16.04 arm64 ARM dev board
run 'ceph -v' on all nodes
cephadmin@node1:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node2:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node3:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@deploy:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
run 'ceph -s' on each cluster node
cephadmin@node3:~$ ceph status
cluster 61b4956e-692f-4272-8a12-997538381feb
health HEALTH_OK
monmap e1: 3 mons at {node1=172.12.55.209:6789/0,node2=172.12.55.213:6789/0,node3=172.12.55.210:6789/0}
election epoch 4, quorum 0,1,2 node1,node3,node2
osdmap e14: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v38: 64 pgs, 1 pools, 0 bytes data, 0 objects
101 MB used, 5571 GB / 5571 GB avail
64 active+clean
run 'ceph -s' on client node
cephadmin@deploy:~$ ceph status
2017-04-07 14:07:38.223860 7f2b28816700 0 monclient(hunting): authenticate timed out after 300
2017-04-07 14:07:38.223909 7f2b28816700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
cephadmin@deploy:~$
cephadmin@deploy:~$
cephadmin@deoloy:~$ rbd create --size 1024 foo -m 172.12.55.209 --image-feature layering
2017-04-07 15:59:21.398973 7fa5f6511100 0 monclient(hunting): authenticate timed out after 300
2017-04-07 15:59:21.399030 7fa5f6511100 0 librados: client.admin authentication error (110) Connection timed out
rbd: couldn't connect to the cluster!
and all nodes do not have firewalls, I have upgraded to jewel 10.2.7, but the issue remains. I am not sure such issue (interoperable between AMD64 client and ARM64 cluster) was introduced from which version (surly from 10.2.3 to 10.2.6).
Files
Updated by Nathan Cutler almost 7 years ago
- Tracker changed from Tasks to Bug
- Project changed from Stable releases to Ceph
- Regression set to No
- Severity set to 3 - minor
Updated by Greg Farnum almost 7 years ago
- Priority changed from High to Normal
If you reproduce this while the clients and servers have "debug ms = 10" set, the logs will contain a lot of developer debugging that we can use to see which part of the connection is failing.
I would check and make sure that your configs are compatible, though (eg, that the client has a keyring which the cluster is willing to accept, and that they both agree on the required cephx security levels).
Updated by Jay Zhu almost 7 years ago
- File normal print when run _ceph -s_ on cluster node.txt normal print when run _ceph -s_ on cluster node.txt added
- File connection time out when run _ceph -s_ on client node.txt connection time out when run _ceph -s_ on client node.txt added
Thanks for your response, The following logs can help you check this issue?
Updated by Greg Farnum almost 7 years ago
We'd also need the logs from the server output. From the client-side one you provided it sure looks like it's connecting successfully and then getting booted because the authentication failed somehow, but I can't be sure.
Updated by Jay Zhu almost 7 years ago
- File ceph-mon.cc1.log.zip ceph-mon.cc1.log.zip added
+ ceph-mon.cc1.log
The client IP address is 172.12.55.166
Thanks
Updated by Greg Farnum almost 7 years ago
- Category set to msgr
The monitor node is printing out a lot of
reader got bad header crc 0 != 3802876777
log messages after it receives a KEEPALIVE2 packet. I'd look to see if your config changed in some way so that the client has disabled message CRCs.
Otherwise we do indeed have a bug across the architectures, but I don't see anything in the patches between those two versions that should have caused it.
Updated by Jay Zhu almost 7 years ago
I use the ceph-deploy tool to setup my ceph cluster, and the ceph.conf has not been changed.
Updated by Sergey Ponomarev over 6 years ago
I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64
It is critical bug.
I am ready to help to collect all the necessary information for bug fixing.
What data do you need to collect to fix the bug?
Updated by Nathan Cutler over 6 years ago
@Greg Farnum: This PR went into 10.2.6 and mentions the CRC check -> https://github.com/ceph/ceph/pull/13131
Updated by Nathan Cutler over 6 years ago
Sergey, I pushed a branch that is v10.2.7 plus revert of PR#13131 - packages are now building.
Are you willing/able to test this branch? (If you need some other Jewel point version I can re-do it.)
Updated by Sergey Ponomarev over 6 years ago
I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?
Updated by Sergey Ponomarev over 6 years ago
Sergey Ponomarev wrote:
I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?
I need a deb repository for Ubuntu Xenial ARM64 & X86_64 platform.
Updated by Nathan Cutler over 6 years ago
If the problem is fixed, will you fix the problem in 12th branch?
Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?
Regarding the repos, the Xenial x86_64 will appear here - https://shaman.ceph.com/builds/ceph/wip-19705/ebbe7e008538300586821ba29c24732ea7e12521/default/69764/ - but ARM repos are not currently built (that I know of). I'll see what can be done about that.
Updated by Sergey Ponomarev over 6 years ago
Nathan Cutler wrote:
Are you saying that the bug is also present in Luminous (v12.x.y)?
Yes, in branch 12 the same problem.
As I wrote above
I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64
Client with Ceph X64 not discovered ARM64 cluster (connection time out ).
If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump
Updated by Sergey Ponomarev over 6 years ago
Nathan Cutler wrote:
If the problem is fixed, will you fix the problem in 12th branch?
Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?
Does it make sense to create a bug report for Ceph version 12?
The problem is that a fresh installation of version 12 initially has this problem between the ARM64 architecture and x86_64 on Ubuntu Xenial
Updated by Nathan Cutler over 6 years ago
- Priority changed from Normal to Urgent
- Release set to master
- Release set to luminous
- Affected Versions v12.2.0 added
Raising priority - x86_64 clients should be able to connect to ARM clusters
Updated by Sergey Ponomarev over 6 years ago
Nathan Cutler wrote:
Raising priority - x86_64 clients should be able to connect to ARM clusters
What information do you need to gather from the cluster and the client to fix the error in Ceph 12.2.0?
Command output, debugging level? etc.
I have a cluster built on Ceph 12.2.0, ARM64 (Ubuntu 16.04.3) and clients x86_64 (Ubuntu 16.04.3)
Updated by Nathan Cutler over 6 years ago
Sergey Ponomarev wrote:
If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump
So the bug only reproduces with "ms nocrc" in ceph.conf?
Updated by Nathan Cutler over 6 years ago
It could be a regression caused by http://tracker.ceph.com/issues/17575 (which went into Jewel 10.2.4). The master fix went into Kraken (v11.1.0) and was backported to Jewel.
Updated by Kefu Chai over 6 years ago
https://github.com/ceph/ceph/pull/17420 might help.
but i want to know if "aarch64 crc extensions supported" is printed when building the package? to be specific, when dpkg-buildpackage is running "cmake".
Updated by Kefu Chai over 6 years ago
If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump
could you install "librados-dbg", and post the backtrace of the coredump?
Updated by Josh Durgin over 6 years ago
- Status changed from New to Need More Info
Updated by Peter Woodman over 6 years ago
I can confirm that the patch offered in https://github.com/ceph/ceph/pull/17420/ fixes the problem. It looks like without that, all ceph daemons were setting all CRCs to zero on arm64, which worked as long as all parts of the system behaved that way.
Did my own build with that patch and now it's working fine cross-arch.
Updated by Kefu Chai over 6 years ago
- Status changed from Need More Info to Resolved
- Assignee set to Kefu Chai
Updated by Kefu Chai over 6 years ago
- Status changed from Resolved to Pending Backport
- Backport changed from jewel to jewel,luminous
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21795: luminous: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21796: jewel: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added
Updated by Nathan Cutler about 6 years ago
- Status changed from Pending Backport to Resolved
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr)