Project

General

Profile

Bug #38027

osd/osd-backfill-space.sh fails

Added by Sage Weil 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
01/23/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2019-01-23T22:18:59.437 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:70: wait_for_state:  ((  9 >= 9  ))
2019-01-23T22:18:59.437 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:71: wait_for_state:  ceph pg dump pgs
2019-01-23T22:18:59.765 INFO:tasks.workunit.client.0.smithi065.stdout:PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES   LOG DISK_LOG STATE                                                STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP   
             LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:8.0         600                  0       82         0       0 2457600 100      100      active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:42.309805 127'600 155:1682  [3,15]          3    [3]              3        0'0 2019-01-23 22:
08:44.818149             0'0 2019-01-23 22:08:44.818149             0
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:7.0         600                  0      600         0       0 2457600 100      100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:41.710519 127'600  155:667   [5,1]          5    [5]              5        0'0 2019-01-23 22:
08:41.931566             0'0 2019-01-23 22:08:41.931566             0
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:2.0         600                  0      600         0       0 2457600 100      100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:34.343792 127'600  155:676  [17,5]         17   [17]             17        0'0 2019-01-23 22:
08:28.073752             0'0 2019-01-23 22:08:28.073752             0
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:1.0         600                  0       53         0       0 2457600 100      100      active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:31.958326 127'600 155:1771   [8,0]          8    [8]              8        0'0 2019-01-23 22:
08:25.484862             0'0 2019-01-23 22:08:25.484862             0
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:3.0         600                  0      600         0       0 2457600 100      100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:34.205933 127'600  155:678   [1,7]          1    [1]              1        0'0 2019-01-23 22:
08:30.734024             0'0 2019-01-23 22:08:30.734024             0
2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:4.0         600                  0       59         0       0 2457600 100      100      active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:34.931307 127'600 155:1747  [4,10]          4    [4]              4        0'0 2019-01-23 22:
08:33.316636             0'0 2019-01-23 22:08:33.316636             0
2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stdout:5.0         600                  0       66         0       0 2457600 100      100      active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:37.073688 127'600 155:1717 [12,19]         12   [12]             12        0'0 2019-01-23 22:
08:36.365619             0'0 2019-01-23 22:08:36.365619             0
2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stdout:6.0         600                  0      600         0       0 2457600 100      100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:38.238313 127'600  155:675   [7,4]          7    [7]              7        0'0 2019-01-23 22:
08:39.010010             0'0 2019-01-23 22:08:39.010010             0
2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stderr:dumped pgs
2019-01-23T22:18:59.775 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:72: wait_for_state:  return 1
2019-01-23T22:18:59.776 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:231: TEST_backfill_test_multi:  return 1
2019-01-23T22:18:59.776 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:37: run:  return 1

/a/sage-2019-01-23_18:09:58-rados-wip-sage2-testing-2019-01-23-0826-distro-basic-smithi/3497926

Related issues

Related to RADOS - Bug #38195: osd-backfill-space.sh exposes rocksdb hang Verified 02/05/2019

History

#1 Updated by David Zafman 7 months ago

  • Assignee set to David Zafman

This doesn't look like a big deal. The test expected backfilling to finish within 2 minutes. According to the log for an example pg still backfilling is that it sent PushOps for 2 minutes before the OSD was killed.

I will bump the timeout to 240 on line 231 of qa/standalone/osd/osd-backfill-space.sh then rerun this test multiple times. Theoretically, this could be a very large timeout. I have to wonder if we should use a consistent large timeout for anything that is expected to finish in any reasonable amount of time. Like 3600 (1 hour)?

#4 Updated by David Zafman 7 months ago

After increasing the time out, saw a different failure. As expected all 4 PG backfills completed and 4 PGs are in backfill_toofull. The kill_daemons() function failed and doesn't log any messages, so it isn't clear exactly what happened. Based on 20 runs this happens about 10% of time.

2019-02-02T02:28:54.743 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:260: TEST_backfill_test_multi:  kill_daemons td/osd-backfill-space
2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  shopt -q -o xtrace
2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  echo true
2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  local trace=true
2019-02-02T02:28:54.745 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons:  true
2019-02-02T02:28:54.745 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons:  shopt -u -o xtrace
2019-02-02T02:34:58.085 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:362: kill_daemons:  return 1
2019-02-02T02:34:58.086 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:260: TEST_backfill_test_multi:  return 1
2019-02-02T02:34:58.086 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:37: run:  return 1

#5 Updated by David Zafman 7 months ago

dzafman-2019-02-04_11:24:54-rados-wip-zafman-testing2-distro-basic-smithi/3549933

Before mon shutdowns it appears to hang in rocksdb before SIGKILL sent by teardown. This run exposes the kill_daemons() execution.

teuthology.log

2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.c35707: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:255: kill_daemon:  cat td/osd-backfill-space/mon.a.pid
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:255: kill_daemon:  local pid=126389
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:256: kill_daemon:  local send_signal=TERM
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:257: kill_daemon:  local 'delays=0.1 0.2 1 1 1 2 3 5 5 5 10 10 20 60 60 60 120'
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:258: kill_daemon:  local exit_code=1
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:259: kill_daemon:  for try in '$delays'
....
2019-02-04T20:30:25.590 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:259: kill_daemon:  for try in '$delays'
2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:260: kill_daemon:  kill -0 126389
2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:261: kill_daemon:  exit_code=1
2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:266: kill_daemon:  send_signal=0
2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:267: kill_daemon:  sleep 120
2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:269: kill_daemon:  return 1
2019-02-04T20:30:25.594 INFO:tasks.workunit.client.0.smithi194.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1866: run_in_background:  return 1

mon.a.log

...
2019-02-04 20:24:25.296 7f1b39312700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.257518) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/memtable_list.cc:377] [default] Level-0 commit table #19 started
2019-02-04 20:24:25.296 7f1b39312700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299890) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/memtable_list.cc:424] [default] Level-0 commit table #19: memtable #1 done
2019-02-04 20:24:25.296 7f1b39312700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299929) EVENT_LOG_v1 {"time_micros": 1549311865299904, "job": 9, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [1, 0, 0, 0, 0, 0, 1], "immutable_memtables": 0}
2019-02-04 20:24:25.296 7f1b39312700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299950) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:177] [default] Level summary: base level 6 max bytes base 268435456 files[1 0 0 0 0 0 1] max score 0.25

2019-02-04 20:24:25.296 7f1b39312700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_files.cc:347] [JOB 9] Try to delete WAL files size 9979119, prev total WAL file size 10025925, number of live WAL files 2.

2019-02-04 20:24:25.296 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
2019-02-04 20:24:25.296 7f1b39b13700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.300155) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1847] [default] Manual compaction from level-0 to level-6 from 'paxos .. 'paxos; will stop at (end)

2019-02-04 20:24:25.296 7f1b39b13700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1636] [default] [JOB 10] Compacting 1@0 + 1@6 files to L6, score -1.00
2019-02-04 20:24:25.296 7f1b39b13700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1640] [default] Compaction start summary: Base version 9 Base level 0, inputs: [19(9580KB)], [17(18MB)]

2019-02-04 20:24:25.296 7f1b39b13700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865300218, "job": 10, "event": "compaction_started", "compaction_reason": "ManualCompaction", "files_L0": [19], "files_L6": [17], "score": -1, "input_data_size": 29158732}
2019-02-04 20:24:25.524 7f1b39b13700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1324] [default] [JOB 10] Generated table #20: 2415 keys, 25628909 bytes
2019-02-04 20:24:25.524 7f1b39b13700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865525468, "cf_name": "default", "job": 10, "event": "table_file_creation", "file_number": 20, "file_size": 25628909, "table_properties": {"data_size": 25510081, "index_size": 52487, "filter_size": 65628, "raw_key_size": 44917, "raw_average_key_size": 18, "raw_value_size": 25439814, "raw_average_value_size": 10534, "num_data_blocks": 1603, "num_entries": 2415, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}
2019-02-04 20:24:25.548 7f1b39b13700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1390] [default] [JOB 10] Compacted 1@0 + 1@6 files to L6 => 25628909 bytes
2019-02-04 20:24:25.548 7f1b39b13700  3 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/version_set.cc:2578] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
2019-02-04 20:24:25.616 7f1b39b13700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.617853) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:752] [default] compacted to: base level 6 max bytes base 268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 129.4 rd, 113.8 wr, level 6, files in(1, 1) out(1) MB in(9.4, 18.5) out(24.4), read-write-amplify(5.6) write-amplify(2.6) OK, records in: 2929, records dropped: 514 output_compression: NoCompression

2019-02-04 20:24:25.616 7f1b39b13700  4 rocksdb: (Original Log Time 2019/02/04-20:24:25.617896) EVENT_LOG_v1 {"time_micros": 1549311865617881, "job": 10, "event": "compaction_finished", "compaction_time_micros": 225308, "output_level": 6, "num_output_files": 1, "total_output_size": 25628909, "num_input_records": 2929, "num_output_records": 2415, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 1]}
2019-02-04 20:24:25.616 7f1b39b13700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865619390, "job": 10, "event": "table_file_deletion", "file_number": 19}
2019-02-04 20:24:25.620 7f1b39b13700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865621737, "job": 10, "event": "table_file_deletion", "file_number": 17}
2019-02-04 20:24:25.620 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
2019-02-04 20:24:25.620 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
2019-02-04 20:24:25.620 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
2019-02-04 20:24:25.620 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
2019-02-04 20:24:25.620 7f1b30300700  4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting
<END OF LOG>

#6 Updated by David Zafman 7 months ago

  • Related to Bug #38195: osd-backfill-space.sh exposes rocksdb hang added

#7 Updated by David Zafman 7 months ago

  • Pull request ID set to 26290

#8 Updated by David Zafman 7 months ago

  • Status changed from Verified to Resolved

Also available in: Atom PDF