Project

General

Profile

Bug #19705

Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster

Added by Jay Zhu 8 months ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
msgr
Target version:
-
Start date:
04/20/2017
Due date:
% Done:

0%

Source:
Tags:
amd64 client can't discover the arm64 cluster
Backport:
jewel,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
jewel, luminous, master
Needs Doc:
No

Description

Hi help,

When upgrading my cluster from jewel 10.2.3 (no such issue) to jewwl 10.2.6, I found my client(ubuntu 16.04 amd64) can not discover my ceph cluster(ubuntu 16.04 arm64)

My cluster configuration:

Node    username   os                 machine
deploy  cephadmin  ubuntu16.04 amd64  x86 PC(test client)
node1   cephadmin  ubuntu16.04 arm64  ARM dev board
node2   cephadmin  ubuntu16.04 arm64  ARM dev board
node3   cephadmin  ubuntu16.04 arm64  ARM dev board

run 'ceph -v' on all nodes

cephadmin@node1:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node2:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node3:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@deploy:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)

run 'ceph -s' on each cluster node

cephadmin@node3:~$ ceph status
cluster 61b4956e-692f-4272-8a12-997538381feb
health HEALTH_OK
monmap e1: 3 mons at {node1=172.12.55.209:6789/0,node2=172.12.55.213:6789/0,node3=172.12.55.210:6789/0}
election epoch 4, quorum 0,1,2 node1,node3,node2
osdmap e14: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v38: 64 pgs, 1 pools, 0 bytes data, 0 objects
101 MB used, 5571 GB / 5571 GB avail
64 active+clean

run 'ceph -s' on client node

cephadmin@deploy:~$ ceph status
2017-04-07 14:07:38.223860 7f2b28816700 0 monclient(hunting): authenticate timed out after 300
2017-04-07 14:07:38.223909 7f2b28816700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
cephadmin@deploy:~$
cephadmin@deploy:~$
cephadmin@deoloy:~$ rbd create --size 1024 foo -m 172.12.55.209 --image-feature layering
2017-04-07 15:59:21.398973 7fa5f6511100 0 monclient(hunting): authenticate timed out after 300
2017-04-07 15:59:21.399030 7fa5f6511100 0 librados: client.admin authentication error (110) Connection timed out
rbd: couldn't connect to the cluster!

and all nodes do not have firewalls, I have upgraded to jewel 10.2.7, but the issue remains. I am not sure such issue (interoperable between AMD64 client and ARM64 cluster) was introduced from which version (surly from 10.2.3 to 10.2.6).

normal print when run _ceph -s_ on cluster node.txt View (22 KB) Jay Zhu, 05/17/2017 10:36 AM

connection time out when run _ceph -s_ on client node.txt View (588 KB) Jay Zhu, 05/17/2017 10:36 AM

ceph-mon.cc1.log.zip (93.9 KB) Jay Zhu, 05/18/2017 02:57 AM


Related issues

Copied to Ceph - Backport #21795: luminous: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster Resolved
Copied to Ceph - Backport #21796: jewel: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster In Progress

History

#1 Updated by Nathan Cutler 8 months ago

  • Tracker changed from Tasks to Bug
  • Project changed from Stable releases to Ceph
  • Regression set to No
  • Severity set to 3 - minor

#2 Updated by Greg Farnum 8 months ago

  • Priority changed from High to Normal

If you reproduce this while the clients and servers have "debug ms = 10" set, the logs will contain a lot of developer debugging that we can use to see which part of the connection is failing.

I would check and make sure that your configs are compatible, though (eg, that the client has a keyring which the cluster is willing to accept, and that they both agree on the required cephx security levels).

#3 Updated by Jay Zhu 7 months ago

Thanks for your response, The following logs can help you check this issue?

#4 Updated by Greg Farnum 7 months ago

We'd also need the logs from the server output. From the client-side one you provided it sure looks like it's connecting successfully and then getting booted because the authentication failed somehow, but I can't be sure.

#5 Updated by Jay Zhu 7 months ago

+ ceph-mon.cc1.log

The client IP address is 172.12.55.166

Thanks

#6 Updated by Greg Farnum 6 months ago

  • Category set to msgr

The monitor node is printing out a lot of

reader got bad header crc 0 != 3802876777

log messages after it receives a KEEPALIVE2 packet. I'd look to see if your config changed in some way so that the client has disabled message CRCs.

Otherwise we do indeed have a bug across the architectures, but I don't see anything in the patches between those two versions that should have caused it.

#7 Updated by Jay Zhu 6 months ago

I use the ceph-deploy tool to setup my ceph cluster, and the ceph.conf has not been changed.

#8 Updated by Sergey Ponomarev 4 months ago

I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64

It is critical bug.

I am ready to help to collect all the necessary information for bug fixing.
What data do you need to collect to fix the bug?

#9 Updated by Nathan Cutler 4 months ago

@Greg: This PR went into 10.2.6 and mentions the CRC check -> https://github.com/ceph/ceph/pull/13131

#10 Updated by Nathan Cutler 4 months ago

Sergey, I pushed a branch that is v10.2.7 plus revert of PR#13131 - packages are now building.

Are you willing/able to test this branch? (If you need some other Jewel point version I can re-do it.)

#11 Updated by Sergey Ponomarev 4 months ago

I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?

#12 Updated by Sergey Ponomarev 4 months ago

Sergey Ponomarev wrote:

I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?

I need a deb repository for Ubuntu Xenial ARM64 & X86_64 platform.

#13 Updated by Nathan Cutler 4 months ago

If the problem is fixed, will you fix the problem in 12th branch?

Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?

Regarding the repos, the Xenial x86_64 will appear here - https://shaman.ceph.com/builds/ceph/wip-19705/ebbe7e008538300586821ba29c24732ea7e12521/default/69764/ - but ARM repos are not currently built (that I know of). I'll see what can be done about that.

#14 Updated by Sergey Ponomarev 4 months ago

Nathan Cutler wrote:

Are you saying that the bug is also present in Luminous (v12.x.y)?

Yes, in branch 12 the same problem.
As I wrote above
I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64

Client with Ceph X64 not discovered ARM64 cluster (connection time out ).
If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

#15 Updated by Sergey Ponomarev 4 months ago

Nathan Cutler wrote:

If the problem is fixed, will you fix the problem in 12th branch?

Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?

Does it make sense to create a bug report for Ceph version 12?
The problem is that a fresh installation of version 12 initially has this problem between the ARM64 architecture and x86_64 on Ubuntu Xenial

#16 Updated by Nathan Cutler 4 months ago

  • Priority changed from Normal to Urgent
  • Affected Versions v12.2.0 added
  • Release luminous, master added

Raising priority - x86_64 clients should be able to connect to ARM clusters

#17 Updated by Sergey Ponomarev 4 months ago

Nathan Cutler wrote:

Raising priority - x86_64 clients should be able to connect to ARM clusters

What information do you need to gather from the cluster and the client to fix the error in Ceph 12.2.0?
Command output, debugging level? etc.

I have a cluster built on Ceph 12.2.0, ARM64 (Ubuntu 16.04.3) and clients x86_64 (Ubuntu 16.04.3)

#18 Updated by Nathan Cutler 4 months ago

Sergey Ponomarev wrote:

If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

So the bug only reproduces with "ms nocrc" in ceph.conf?

#19 Updated by Nathan Cutler 4 months ago

It could be a regression caused by http://tracker.ceph.com/issues/17575 (which went into Jewel 10.2.4). The master fix went into Kraken (v11.1.0) and was backported to Jewel.

#20 Updated by Kefu Chai 4 months ago

https://github.com/ceph/ceph/pull/17420 might help.

but i want to know if "aarch64 crc extensions supported" is printed when building the package? to be specific, when dpkg-buildpackage is running "cmake".

#21 Updated by Kefu Chai 4 months ago

If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

could you install "librados-dbg", and post the backtrace of the coredump?

#22 Updated by Nathan Cutler 4 months ago

  • Backport set to jewel

#23 Updated by Josh Durgin 3 months ago

  • Status changed from New to Need More Info

#24 Updated by Peter Woodman 2 months ago

I can confirm that the patch offered in https://github.com/ceph/ceph/pull/17420/ fixes the problem. It looks like without that, all ceph daemons were setting all CRCs to zero on arm64, which worked as long as all parts of the system behaved that way.

Did my own build with that patch and now it's working fine cross-arch.

#25 Updated by Kefu Chai 2 months ago

  • Status changed from Need More Info to Resolved
  • Assignee set to Kefu Chai

#26 Updated by Kefu Chai 2 months ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from jewel to jewel,luminous

#27 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21795: luminous: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added

#28 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21796: jewel: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added

Also available in: Atom PDF