Project

General

Profile

Bug #43587

mon shutdown timeout (race with async compaction)

Added by Sage Weil about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-01-12T23:37:46.434 INFO:teuthology.misc:Shutting down mon daemons...
2020-01-12T23:37:46.434 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2020-01-12T23:37:46.435 INFO:teuthology.orchestra.run:waiting for 300
2020-01-12T23:37:46.522 INFO:tasks.ceph.mon.a.smithi023.stderr:2020-01-12T23:37:46.499+0000 1633b700 -1 received  signal: Terminated from /usr/bin/python3 /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes 
--xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 28458) UID: 0
2020-01-12T23:37:46.534 INFO:tasks.ceph.mon.a.smithi023.stderr:2020-01-12T23:37:46.500+0000 1633b700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2020-01-12T23:37:50.475 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.6 has been restored
...long delay here...
2020-01-12T23:42:45.518 INFO:tasks.ceph:Checking cluster log for badness...

the mon log shows
2020-01-12T23:37:46.135+0000 1af44700 20 mon.a@0(leader).paxosservice(mgrstat 266..281)  compacting prefix mgrstat
...
2020-01-12T23:37:46.200+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
...
2020-01-12T23:37:46.499+0000 1633b700 -1 received  signal: Terminated from /usr/bin/python3 /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 28458) UID: 0
...
2020-01-12T23:37:46.707+0000 4044f00  1 -- 172.21.15.23:0/28493 shutdown_connections 
2020-01-12T23:37:46.715+0000 4044f00  1 -- 172.21.15.23:0/28493 wait complete.
2020-01-12T23:37:46.835+0000 16b3c700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 60] Generated table #95: 1126 keys, 11338818 bytes
2020-01-12T23:37:46.835+0000 16b3c700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1578872266835807, "cf_name": "default", "job": 60, "event": "table_file_creation", "file_number": 95, "file_size": 11338818, "table_properties": {"data_size": 11312992, "index_size": 22120, "filter_size": 2885, "raw_key_size": 20527, "raw_average_key_size": 18, "raw_value_size": 11281722, "raw_average_value_size": 10019, "num_data_blocks": 698, "num_entries": 1126, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2020-01-12T23:37:46.837+0000 16b3c700  4 rocksdb: [db/compaction_job.cc:1395] [default] [JOB 60] Compacted 1@0 + 1@6 files to L6 => 11338818 bytes
2020-01-12T23:37:46.837+0000 16b3c700  3 rocksdb: [db/version_set.cc:2581] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
2020-01-12T23:37:46.838+0000 16b3c700  4 rocksdb: (Original Log Time 2020/01/12-23:37:46.838673) [db/compaction_job.cc:768] [default] compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 17.9 rd, 17.9 wr, level 6, files in(1, 1) out(1) MB in(0.1, 10.8) out(10.8), read-write-amplify(267.8) write-amplify(133.7) OK, records in: 1159, records dropped: 33 output_compression: NoCompression

2020-01-12T23:37:46.838+0000 16b3c700  4 rocksdb: (Original Log Time 2020/01/12-23:37:46.838978) EVENT_LOG_v1 {"time_micros": 1578872266838807, "job": 60, "event": "compaction_finished", "compaction_time_micros": 633653, "compaction_time_cpu_micros": 460034, "output_level": 6, "num_output_files": 1, "total_output_size": 11338818, "num_input_records": 1159, "num_output_records": 1126, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 1]}
2020-01-12T23:37:46.839+0000 16b3c700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1578872266839776, "job": 60, "event": "table_file_deletion", "file_number": 94}
2020-01-12T23:37:46.840+0000 16b3c700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1578872266840859, "job": 60, "event": "table_file_deletion", "file_number": 92}
2020-01-12T23:37:46.840+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.840+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.840+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.840+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.840+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2020-01-12T23:37:46.841+0000 23753700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
...
hang

/a/sage-2020-01-12_21:37:03-rados-wip-sage-testing-2020-01-12-0621-distro-basic-smithi/4660785
description: rados/verify/{centos_latest.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml}
d-thrash/none.yaml msgr-failures/few.yaml msgr/async.yaml objectstore/bluestore-stupid.yaml
rados.yaml tasks/rados_cls_all.yaml validater/valgrind.yaml}

Related issues

Copied to RADOS - Backport #43620: nautilus: mon shutdown timeout (race with async compaction) Resolved

History

#1 Updated by Sage Weil about 4 years ago

  • Pull request ID set to 32619

#2 Updated by Sage Weil about 4 years ago

  • Status changed from New to Fix Under Review

#3 Updated by Sage Weil about 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus

#4 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43620: nautilus: mon shutdown timeout (race with async compaction) added

#5 Updated by Nathan Cutler about 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF