Bug #22952: Monitor stopped responding after awhile - RADOS - Ceph

Actions

Copy link

Bug #22952

closed

Monitor stopped responding after awhile

Added by Frank Li about 6 years ago. Updated about 6 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Kefu Chai

Category:

Peering

Target version:

Ceph - v12.2.2

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After a crash of ceph-mon in 12.2.2 and using a private build provided by ceph developers, the ceph-mon would come up,
form a quorum, but after a while the monitors would stop responding, and no qorum formed.

please see related ticket for detail:
https://tracker.ceph.com/issues/22847

Files

Download all files

dl1approd-mon.dl1-kaf101.log.prob.2.gz (902 KB) dl1approd-mon.dl1-kaf101.log.prob.2.gz	monitor log for when it was restarted until it went to non reponding state	Frank Li, 02/07/2018 06:05 PM
dl1approd-mon.dl1-kaf101.log.prob.1.gz (960 KB) dl1approd-mon.dl1-kaf101.log.prob.1.gz		Frank Li, 02/07/2018 06:05 PM

Actions

Copy link

Updated by Frank Li about 6 years ago

here is where the first mon server is stuck, running mon_status hang:
[root@dl1-kaf101 frli]# ceph --admin-daemon /var/run/ceph/dl1approd-mon.dl1-kaf101.asok mon_status
^CTraceback (most recent call last):
File "/bin/ceph", line 1122, in <module>
retval = main()
File "/bin/ceph", line 822, in main
done, ret = maybe_daemon_command(parsed_args, childargs)
File "/bin/ceph", line 689, in maybe_daemon_command
raw_write(admin_socket(sockpath, childargs, parsed_args.output_format))
File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 83, in admin_socket
ret = do_sockio(asok_path, json.dumps(valid_dict).encode('utf-8'))
File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 46, in do_sockio
len_str = sock.recv(4)
KeyboardInterrupt

the first three, kaf101, kaf201, and kaf301 stopped responding to mon_status, only kaf302 and kaf401 are responding, and in electing mode:
[root@dl1-kaf401 frli]# ceph --admin-daemon /var/run/ceph/dl1approd-mon.*.asok mon_status {
"name": "dl1-kaf401",
"rank": 4,
"state": "electing",
"election_epoch": 35751,
"quorum": [],
"features": {
"required_con": "153140804152475648",
"required_mon": [
"kraken",
"luminous"
],
"quorum_con": "2305244844532236283",
"quorum_mon": [
"kraken",
"luminous"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],

Actions

Copy link

Updated by Greg Farnum about 6 years ago

Assignee set to Kefu Chai

Actions

Copy link

Updated by Frank Li about 6 years ago

I reproduced the issue in a seperate cluster, it seems that whichever ceph-mon became the leader will be stuck, as I modified the monmap to have only one ceph-mon in it, and as soon it is up, it stopped responding.

Actions

Copy link

Updated by Frank Li about 6 years ago

Frank Li wrote:

I reproduced the issue in a seperate cluster, it seems that whichever ceph-mon became the leader will be stuck, as I modified the monmap to have only one ceph-mon in it, and as soon it is up, it stopped responding.

the logs even at debug mon=20, stops at this, and the ceph-mon process is running on 100% CPU:

2018-02-09 20:38:21.220584 7fdfb8a24e00  1 mon.dc10-kaf03-beta1@-1(probing).paxosservice(auth 2751..2912) refresh upgraded, format 0 -> 2
2018-02-09 20:38:21.223903 7fdfb8a24e00  0 mon.dc10-kaf03-beta1@-1(probing) e5  my rank is now 0 (was -1)
2018-02-09 20:38:21.223929 7fdfb8a24e00  1 mon.dc10-kaf03-beta1@0(probing) e5 win_standalone_election
2018-02-09 20:38:21.223951 7fdfb8a24e00  1 mon.dc10-kaf03-beta1@0(probing).elector(217) init, last seen epoch 217, mid-election, bumping
2018-02-09 20:38:21.224982 7fdfb8a24e00  0 log_channel(cluster) log [INF] : mon.dc10-kaf03-beta1 is new leader, mons dc10-kaf03-beta1 in quorum (ranks 0)
2018-02-09 20:38:21.225101 7fdfb8a24e00  0 log_channel(cluster) log [DBG] : monmap e5: 1 mons at {dc10-kaf03-beta1=10.106.238.38:6789/0}
2018-02-09 20:38:21.225192 7fdfb8a24e00  0 log_channel(cluster) log [DBG] : fsmap
2018-02-09 20:38:21.225276 7fdfb8a24e00  0 log_channel(cluster) log [DBG] : osdmap e25992: 48 total, 48 up, 48 in
2018-02-09 20:38:21.228986 7fdfb8a24e00  0 log_channel(cluster) log [DBG] : mgrmap e502: no daemons active
2018-02-09 20:38:21.237630 7fdf81ffb700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 3] Generated table #123763: 468 keys, 3471545 bytes
2018-02-09 20:38:21.237652 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701237639, "cf_name": "default", "job": 3, "event": "table_file_creation", "file_number": 123763, "file_size": 3471545, "table_properties": {"data_size": 3446805, "index_size": 13435, "filter_size": 10382, "raw_key_size": 10225, "raw_average_key_size": 21, "raw_value_size": 3430273, "raw_average_value_size": 7329, "num_data_blocks": 393, "num_entries": 468, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
2018-02-09 20:38:21.237672 7fdf81ffb700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 3] Compacted 4@0 + 1@1 files to L1 => 3471545 bytes
2018-02-09 20:38:21.238195 7fdf81ffb700  4 rocksdb: (Original Log Time 2018/02/09-20:38:21.238104) [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base level 1 max bytes base 268435456 files[0 1 10 1 0 0 0] max score 0.01, MB/sec: 171.5 rd, 171.1 wr, level 1, files in(4, 1) out(1) MB in(0.1, 3.2) out(3.3), read-write-amplify(85.5) write-amplify(42.7) OK, records in:
2018-02-09 20:38:21.238201 7fdf81ffb700  4 rocksdb: (Original Log Time 2018/02/09-20:38:21.238122) EVENT_LOG_v1 {"time_micros": 1518208701238114, "job": 3, "event": "compaction_finished", "compaction_time_micros": 20290, "output_level": 1, "num_output_files": 1, "total_output_size": 3471545, "num_input_records": 494, "num_output_records": 468, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 10, 1, 0, 0, 0]}
2018-02-09 20:38:21.238274 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238270, "job": 3, "event": "table_file_deletion", "file_number": 123758}
2018-02-09 20:38:21.238315 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238313, "job": 3, "event": "table_file_deletion", "file_number": 123755}
2018-02-09 20:38:21.238361 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238359, "job": 3, "event": "table_file_deletion", "file_number": 123752}
2018-02-09 20:38:21.238395 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238394, "job": 3, "event": "table_file_deletion", "file_number": 123746}
2018-02-09 20:38:21.239303 7fdf81ffb700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701239301, "job": 3, "event": "table_file_deletion", "file_number": 123744}

Actions

Copy link

Updated by Frank Li about 6 years ago

Frank Li wrote:

Frank Li wrote:

I reproduced the issue in a seperate cluster, it seems that whichever ceph-mon became the leader will be stuck, as I modified the monmap to have only one ceph-mon in it, and as soon it is up, it stopped responding.

Also, it seems the the cluster will stay up, but as soon as I run a command like ceph osd unset nodown, or just start up an osd, it stops working. Is there anyway I can get a stable version of the ceph-mon with just the patch for the issue on:
https://tracker.ceph.com/issues/22847
??

the logs even at debug mon=20, stops at this, and the ceph-mon process is running on 100% CPU:
2018-02-09 20:38:21.220584 7fdfb8a24e00 1 mon.dc10-kaf03-beta1@-1(probing).paxosservice(auth 2751..2912) refresh upgraded, format 0 -> 2
2018-02-09 20:38:21.223903 7fdfb8a24e00 0 mon.dc10-kaf03-beta1@-1(probing) e5 my rank is now 0 (was -1)
2018-02-09 20:38:21.223929 7fdfb8a24e00 1 mon.dc10-kaf03-beta1@0(probing) e5 win_standalone_election
2018-02-09 20:38:21.223951 7fdfb8a24e00 1 mon.dc10-kaf03-beta1@0(probing).elector(217) init, last seen epoch 217, mid-election, bumping
2018-02-09 20:38:21.224982 7fdfb8a24e00 0 log_channel(cluster) log [INF] : mon.dc10-kaf03-beta1 is new leader, mons dc10-kaf03-beta1 in quorum (ranks 0)
2018-02-09 20:38:21.225101 7fdfb8a24e00 0 log_channel(cluster) log [DBG] : monmap e5: 1 mons at {dc10-kaf03-beta1=10.106.238.38:6789/0}
2018-02-09 20:38:21.225192 7fdfb8a24e00 0 log_channel(cluster) log [DBG] : fsmap
2018-02-09 20:38:21.225276 7fdfb8a24e00 0 log_channel(cluster) log [DBG] : osdmap e25992: 48 total, 48 up, 48 in
2018-02-09 20:38:21.228986 7fdfb8a24e00 0 log_channel(cluster) log [DBG] : mgrmap e502: no daemons active
2018-02-09 20:38:21.237630 7fdf81ffb700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 3] Generated table #123763: 468 keys, 3471545 bytes
2018-02-09 20:38:21.237652 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701237639, "cf_name": "default", "job": 3, "event": "table_file_creation", "file_number": 123763, "file_size": 3471545, "table_properties": {"data_size": 3446805, "index_size": 13435, "filter_size": 10382, "raw_key_size": 10225, "raw_average_key_size": 21, "raw_value_size": 3430273, "raw_average_value_size": 7329, "num_data_blocks": 393, "num_entries": 468, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
2018-02-09 20:38:21.237672 7fdf81ffb700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 3] Compacted 4@0 + 1@1 files to L1 => 3471545 bytes
2018-02-09 20:38:21.238195 7fdf81ffb700 4 rocksdb: (Original Log Time 2018/02/09-20:38:21.238104) [/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2-673-gf04a4a3/rpm/el7/BUILD/ceph-12.2.2-673-gf04a4a3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base level 1 max bytes base 268435456 files[0 1 10 1 0 0 0] max score 0.01, MB/sec: 171.5 rd, 171.1 wr, level 1, files in(4, 1) out(1) MB in(0.1, 3.2) out(3.3), read-write-amplify(85.5) write-amplify(42.7) OK, records in:
2018-02-09 20:38:21.238201 7fdf81ffb700 4 rocksdb: (Original Log Time 2018/02/09-20:38:21.238122) EVENT_LOG_v1 {"time_micros": 1518208701238114, "job": 3, "event": "compaction_finished", "compaction_time_micros": 20290, "output_level": 1, "num_output_files": 1, "total_output_size": 3471545, "num_input_records": 494, "num_output_records": 468, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 10, 1, 0, 0, 0]}
2018-02-09 20:38:21.238274 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238270, "job": 3, "event": "table_file_deletion", "file_number": 123758}
2018-02-09 20:38:21.238315 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238313, "job": 3, "event": "table_file_deletion", "file_number": 123755}
2018-02-09 20:38:21.238361 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238359, "job": 3, "event": "table_file_deletion", "file_number": 123752}
2018-02-09 20:38:21.238395 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701238394, "job": 3, "event": "table_file_deletion", "file_number": 123746}
2018-02-09 20:38:21.239303 7fdf81ffb700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1518208701239301, "job": 3, "event": "table_file_deletion", "file_number": 123744}

Actions

Copy link

Updated by Kefu Chai about 6 years ago

Related to Bug #22847: ceph osd force-create-pg cause all ceph-mon to crash and unable to come up again added

Actions

Copy link

Updated by Kefu Chai about 6 years ago

I reproduced the issue in a seperate cluster

could you share the steps to reproduce this issue? so i can try it out in my testbed?

Is there anyway I can get a stable version of the ceph-mon with just the patch for the issue on:

what do you mean by "stable version"? i can help get you a build which is 12.2.2 + https://github.com/ceph/ceph/pull/20267 or 12.2.3 (luminous-rc) + https://github.com/ceph/ceph/pull/20267.

Actions

Copy link

Updated by Frank Li about 6 years ago

Kefu Chai wrote:

I reproduced the issue in a seperate cluster

could you share the steps to reproduce this issue? so i can try it out in my testbed?

I had another ceph cluster with almost identical hardware/OS version and also was running 12.2.2, and it was in a healthy state. I then ran the same ceph osd force-create-osd <non-existing pg> command that killed the ceph-mon, and I then applied the private build supplied in the previous bug:
https://tracker.ceph.com/issues/22847

and after that, the cluster stopped working, all the OSD would report to be up even though some was stopped, and the active ceph-mon would stay up until either I restart an OSD, or I ran some command like ceph osd unset noout, then it would become non-responsive.

after that, the hea

Is there anyway I can get a stable version of the ceph-mon with just the patch for the issue on:

what do you mean by "stable version"? i can help get you a build which is 12.2.2 + https://github.com/ceph/ceph/pull/20267 or 12.2.3 (luminous-rc) + https://github.com/ceph/ceph/pull/20267.

what I meant is maybe the private build had more than the changes intended ? that's why it is not working ?
I was hoping the 12.2.3 official release would have this fix, but if not, then I think a 12.2.3 RC + this fix would be good.
Thanks.

Actions

Copy link

Updated by Frank Li about 6 years ago

either 12.2.2 + the patch or 12.2.3 RC + the patch would be good, whichever is more convenient to build.

Actions

Copy link

#10

Updated by Frank Li about 6 years ago

Frank Li wrote:

either 12.2.2 + the patch or 12.2.3 RC + the patch would be good, whichever is more convenient to build.

Hi, Please let me know if/when I might have this kind of build ? The cluster has been down for a week now, I would
really like to avoid a full data loss. Thanks.

Actions

Copy link

#11

Updated by Kefu Chai about 6 years ago

Frank, sorry for the latency. i am just back from the holiday. i pushed 12.2.3 + https://github.com/ceph/ceph/pull/20267 to https://shaman.ceph.com/builds/ceph/wip-22942-luminous/, the build will be available in around 1.5 hours.

Actions

Copy link

#12

Updated by Frank Li about 6 years ago

Kefu Chai wrote:

Frank, sorry for the latency. i am just back from the holiday. i pushed 12.2.3 + https://github.com/ceph/ceph/pull/20267 to https://shaman.ceph.com/builds/ceph/wip-22942-luminous/, the build will be available in around 1.5 hours.

great ! happy chinese new year to you !

Actions

Copy link

#13