Project

General

Profile

Actions

Bug #19705

closed

Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster

Added by Jay Zhu almost 7 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
amd64 client can't discover the arm64 cluster
Backport:
jewel,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi help,

When upgrading my cluster from jewel 10.2.3 (no such issue) to jewwl 10.2.6, I found my client(ubuntu 16.04 amd64) can not discover my ceph cluster(ubuntu 16.04 arm64)

My cluster configuration:

Node    username   os                 machine
deploy  cephadmin  ubuntu16.04 amd64  x86 PC(test client)
node1   cephadmin  ubuntu16.04 arm64  ARM dev board
node2   cephadmin  ubuntu16.04 arm64  ARM dev board
node3   cephadmin  ubuntu16.04 arm64  ARM dev board

run 'ceph -v' on all nodes

cephadmin@node1:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node2:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@node3:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
cephadmin@deploy:~$ ceph -v
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)

run 'ceph -s' on each cluster node

cephadmin@node3:~$ ceph status
cluster 61b4956e-692f-4272-8a12-997538381feb
health HEALTH_OK
monmap e1: 3 mons at {node1=172.12.55.209:6789/0,node2=172.12.55.213:6789/0,node3=172.12.55.210:6789/0}
election epoch 4, quorum 0,1,2 node1,node3,node2
osdmap e14: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v38: 64 pgs, 1 pools, 0 bytes data, 0 objects
101 MB used, 5571 GB / 5571 GB avail
64 active+clean

run 'ceph -s' on client node

cephadmin@deploy:~$ ceph status
2017-04-07 14:07:38.223860 7f2b28816700 0 monclient(hunting): authenticate timed out after 300
2017-04-07 14:07:38.223909 7f2b28816700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut
cephadmin@deploy:~$
cephadmin@deploy:~$
cephadmin@deoloy:~$ rbd create --size 1024 foo -m 172.12.55.209 --image-feature layering
2017-04-07 15:59:21.398973 7fa5f6511100 0 monclient(hunting): authenticate timed out after 300
2017-04-07 15:59:21.399030 7fa5f6511100 0 librados: client.admin authentication error (110) Connection timed out
rbd: couldn't connect to the cluster!

and all nodes do not have firewalls, I have upgraded to jewel 10.2.7, but the issue remains. I am not sure such issue (interoperable between AMD64 client and ARM64 cluster) was introduced from which version (surly from 10.2.3 to 10.2.6).


Files


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #21795: luminous: Ubuntu amd64 client can not discover the ubuntu arm64 ceph clusterResolvedNathan CutlerActions
Copied to Ceph - Backport #21796: jewel: Ubuntu amd64 client can not discover the ubuntu arm64 ceph clusterResolvedNathan CutlerActions
Actions #1

Updated by Nathan Cutler almost 7 years ago

  • Tracker changed from Tasks to Bug
  • Project changed from Stable releases to Ceph
  • Regression set to No
  • Severity set to 3 - minor
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Priority changed from High to Normal

If you reproduce this while the clients and servers have "debug ms = 10" set, the logs will contain a lot of developer debugging that we can use to see which part of the connection is failing.

I would check and make sure that your configs are compatible, though (eg, that the client has a keyring which the cluster is willing to accept, and that they both agree on the required cephx security levels).

Actions #4

Updated by Greg Farnum almost 7 years ago

We'd also need the logs from the server output. From the client-side one you provided it sure looks like it's connecting successfully and then getting booted because the authentication failed somehow, but I can't be sure.

Actions #5

Updated by Jay Zhu almost 7 years ago

+ ceph-mon.cc1.log

The client IP address is 172.12.55.166

Thanks

Actions #6

Updated by Greg Farnum almost 7 years ago

  • Category set to msgr

The monitor node is printing out a lot of

reader got bad header crc 0 != 3802876777

log messages after it receives a KEEPALIVE2 packet. I'd look to see if your config changed in some way so that the client has disabled message CRCs.

Otherwise we do indeed have a bug across the architectures, but I don't see anything in the patches between those two versions that should have caused it.

Actions #7

Updated by Jay Zhu almost 7 years ago

I use the ceph-deploy tool to setup my ceph cluster, and the ceph.conf has not been changed.

Actions #8

Updated by Sergey Ponomarev over 6 years ago

I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64

It is critical bug.

I am ready to help to collect all the necessary information for bug fixing.
What data do you need to collect to fix the bug?

Actions #9

Updated by Nathan Cutler over 6 years ago

@Greg Farnum: This PR went into 10.2.6 and mentions the CRC check -> https://github.com/ceph/ceph/pull/13131

Actions #10

Updated by Nathan Cutler over 6 years ago

Sergey, I pushed a branch that is v10.2.7 plus revert of PR#13131 - packages are now building.

Are you willing/able to test this branch? (If you need some other Jewel point version I can re-do it.)

Actions #11

Updated by Sergey Ponomarev over 6 years ago

I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?

Actions #12

Updated by Sergey Ponomarev over 6 years ago

Sergey Ponomarev wrote:

I need a deb repository for Ubuntu Xenial.
I can check tomorrow and report the results.
If the problem is fixed, will you fix the problem in 12th branch?

I need a deb repository for Ubuntu Xenial ARM64 & X86_64 platform.

Actions #13

Updated by Nathan Cutler over 6 years ago

If the problem is fixed, will you fix the problem in 12th branch?

Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?

Regarding the repos, the Xenial x86_64 will appear here - https://shaman.ceph.com/builds/ceph/wip-19705/ebbe7e008538300586821ba29c24732ea7e12521/default/69764/ - but ARM repos are not currently built (that I know of). I'll see what can be done about that.

Actions #14

Updated by Sergey Ponomarev over 6 years ago

Nathan Cutler wrote:

Are you saying that the bug is also present in Luminous (v12.x.y)?

Yes, in branch 12 the same problem.
As I wrote above
I reproduced this bug on Ceph 12.1.1 - 12.1.3.
Ubuntu 16.04.03, Kernel 4.4.8, Mon & OSD - ARM64, Client - x86_64

Client with Ceph X64 not discovered ARM64 cluster (connection time out ).
If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

Actions #15

Updated by Sergey Ponomarev over 6 years ago

Nathan Cutler wrote:

If the problem is fixed, will you fix the problem in 12th branch?

Not sure what you mean? This bug is against Jewel v10.2.6/v10.2.7. The next jewel point release will be v10.2.10. Are you saying that the bug is also present in Luminous (v12.x.y)?

Does it make sense to create a bug report for Ceph version 12?
The problem is that a fresh installation of version 12 initially has this problem between the ARM64 architecture and x86_64 on Ubuntu Xenial

Actions #16

Updated by Nathan Cutler over 6 years ago

  • Priority changed from Normal to Urgent
  • Release set to master
  • Release set to luminous
  • Affected Versions v12.2.0 added

Raising priority - x86_64 clients should be able to connect to ARM clusters

Actions #17

Updated by Sergey Ponomarev over 6 years ago

Nathan Cutler wrote:

Raising priority - x86_64 clients should be able to connect to ARM clusters

What information do you need to gather from the cluster and the client to fix the error in Ceph 12.2.0?
Command output, debugging level? etc.

I have a cluster built on Ceph 12.2.0, ARM64 (Ubuntu 16.04.3) and clients x86_64 (Ubuntu 16.04.3)

Actions #18

Updated by Nathan Cutler over 6 years ago

Sergey Ponomarev wrote:

If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

So the bug only reproduces with "ms nocrc" in ceph.conf?

Actions #19

Updated by Nathan Cutler over 6 years ago

It could be a regression caused by http://tracker.ceph.com/issues/17575 (which went into Jewel 10.2.4). The master fix went into Kraken (v11.1.0) and was backported to Jewel.

Actions #20

Updated by Kefu Chai over 6 years ago

https://github.com/ceph/ceph/pull/17420 might help.

but i want to know if "aarch64 crc extensions supported" is printed when building the package? to be specific, when dpkg-buildpackage is running "cmake".

Actions #21

Updated by Kefu Chai over 6 years ago

If i disabled CRC check in msg (in ceph.conf) - ceph client (ceph -s) immediately falls with crash dump

could you install "librados-dbg", and post the backtrace of the coredump?

Actions #22

Updated by Nathan Cutler over 6 years ago

  • Backport set to jewel
Actions #23

Updated by Josh Durgin over 6 years ago

  • Status changed from New to Need More Info
Actions #24

Updated by Peter Woodman over 6 years ago

I can confirm that the patch offered in https://github.com/ceph/ceph/pull/17420/ fixes the problem. It looks like without that, all ceph daemons were setting all CRCs to zero on arm64, which worked as long as all parts of the system behaved that way.

Did my own build with that patch and now it's working fine cross-arch.

Actions #25

Updated by Kefu Chai over 6 years ago

  • Status changed from Need More Info to Resolved
  • Assignee set to Kefu Chai
Actions #26

Updated by Kefu Chai over 6 years ago

  • Status changed from Resolved to Pending Backport
  • Backport changed from jewel to jewel,luminous
Actions #27

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21795: luminous: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added
Actions #28

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21796: jewel: Ubuntu amd64 client can not discover the ubuntu arm64 ceph cluster added
Actions #29

Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #30

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF