Bug #38027
osd/osd-backfill-space.sh fails
0%
Description
2019-01-23T22:18:59.437 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:70: wait_for_state: (( 9 >= 9 )) 2019-01-23T22:18:59.437 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:71: wait_for_state: ceph pg dump pgs 2019-01-23T22:18:59.765 INFO:tasks.workunit.client.0.smithi065.stdout:PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:8.0 600 0 82 0 0 2457600 100 100 active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:42.309805 127'600 155:1682 [3,15] 3 [3] 3 0'0 2019-01-23 22: 08:44.818149 0'0 2019-01-23 22:08:44.818149 0 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:7.0 600 0 600 0 0 2457600 100 100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:41.710519 127'600 155:667 [5,1] 5 [5] 5 0'0 2019-01-23 22: 08:41.931566 0'0 2019-01-23 22:08:41.931566 0 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:2.0 600 0 600 0 0 2457600 100 100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:34.343792 127'600 155:676 [17,5] 17 [17] 17 0'0 2019-01-23 22: 08:28.073752 0'0 2019-01-23 22:08:28.073752 0 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:1.0 600 0 53 0 0 2457600 100 100 active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:31.958326 127'600 155:1771 [8,0] 8 [8] 8 0'0 2019-01-23 22: 08:25.484862 0'0 2019-01-23 22:08:25.484862 0 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:3.0 600 0 600 0 0 2457600 100 100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:34.205933 127'600 155:678 [1,7] 1 [1] 1 0'0 2019-01-23 22: 08:30.734024 0'0 2019-01-23 22:08:30.734024 0 2019-01-23T22:18:59.766 INFO:tasks.workunit.client.0.smithi065.stdout:4.0 600 0 59 0 0 2457600 100 100 active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:34.931307 127'600 155:1747 [4,10] 4 [4] 4 0'0 2019-01-23 22: 08:33.316636 0'0 2019-01-23 22:08:33.316636 0 2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stdout:5.0 600 0 66 0 0 2457600 100 100 active+undersized+degraded+remapped+backfilling 2019-01-23 22:16:37.073688 127'600 155:1717 [12,19] 12 [12] 12 0'0 2019-01-23 22: 08:36.365619 0'0 2019-01-23 22:08:36.365619 0 2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stdout:6.0 600 0 600 0 0 2457600 100 100 active+undersized+degraded+remapped+backfill_toofull 2019-01-23 22:18:38.238313 127'600 155:675 [7,4] 7 [7] 7 0'0 2019-01-23 22: 08:39.010010 0'0 2019-01-23 22:08:39.010010 0 2019-01-23T22:18:59.767 INFO:tasks.workunit.client.0.smithi065.stderr:dumped pgs 2019-01-23T22:18:59.775 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:72: wait_for_state: return 1 2019-01-23T22:18:59.776 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:231: TEST_backfill_test_multi: return 1 2019-01-23T22:18:59.776 INFO:tasks.workunit.client.0.smithi065.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:37: run: return 1
/a/sage-2019-01-23_18:09:58-rados-wip-sage2-testing-2019-01-23-0826-distro-basic-smithi/3497926
Related issues
History
#1 Updated by David Zafman about 5 years ago
- Assignee set to David Zafman
This doesn't look like a big deal. The test expected backfilling to finish within 2 minutes. According to the log for an example pg still backfilling is that it sent PushOps for 2 minutes before the OSD was killed.
I will bump the timeout to 240 on line 231 of qa/standalone/osd/osd-backfill-space.sh then rerun this test multiple times. Theoretically, this could be a very large timeout. I have to wonder if we should use a consistent large timeout for anything that is expected to finish in any reasonable amount of time. Like 3600 (1 hour)?
#3 Updated by David Zafman about 5 years ago
#4 Updated by David Zafman about 5 years ago
After increasing the time out, saw a different failure. As expected all 4 PG backfills completed and 4 PGs are in backfill_toofull. The kill_daemons() function failed and doesn't log any messages, so it isn't clear exactly what happened. Based on 20 runs this happens about 10% of time.
2019-02-02T02:28:54.743 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:260: TEST_backfill_test_multi: kill_daemons td/osd-backfill-space 2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: shopt -q -o xtrace 2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: echo true 2019-02-02T02:28:54.744 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: local trace=true 2019-02-02T02:28:54.745 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons: true 2019-02-02T02:28:54.745 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons: shopt -u -o xtrace 2019-02-02T02:34:58.085 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:362: kill_daemons: return 1 2019-02-02T02:34:58.086 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:260: TEST_backfill_test_multi: return 1 2019-02-02T02:34:58.086 INFO:tasks.workunit.client.0.smithi198.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-space.sh:37: run: return 1
#5 Updated by David Zafman about 5 years ago
dzafman-2019-02-04_11:24:54-rados-wip-zafman-testing2-distro-basic-smithi/3549933
Before mon shutdowns it appears to hang in rocksdb before SIGKILL sent by teardown. This run exposes the kill_daemons() execution.
teuthology.log
2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.c35707: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:255: kill_daemon: cat td/osd-backfill-space/mon.a.pid 2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:255: kill_daemon: local pid=126389 2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:256: kill_daemon: local send_signal=TERM 2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:257: kill_daemon: local 'delays=0.1 0.2 1 1 1 2 3 5 5 5 10 10 20 60 60 60 120' 2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:258: kill_daemon: local exit_code=1 2019-02-04T20:24:27.598 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:259: kill_daemon: for try in '$delays' .... 2019-02-04T20:30:25.590 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:259: kill_daemon: for try in '$delays' 2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:260: kill_daemon: kill -0 126389 2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:261: kill_daemon: exit_code=1 2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:266: kill_daemon: send_signal=0 2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:267: kill_daemon: sleep 120 2019-02-04T20:30:25.591 INFO:tasks.workunit.client.0.smithi194.stderr:35707: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:269: kill_daemon: return 1 2019-02-04T20:30:25.594 INFO:tasks.workunit.client.0.smithi194.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1866: run_in_background: return 1
mon.a.log
... 2019-02-04 20:24:25.296 7f1b39312700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.257518) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/memtable_list.cc:377] [default] Level-0 commit table #19 started 2019-02-04 20:24:25.296 7f1b39312700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299890) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/memtable_list.cc:424] [default] Level-0 commit table #19: memtable #1 done 2019-02-04 20:24:25.296 7f1b39312700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299929) EVENT_LOG_v1 {"time_micros": 1549311865299904, "job": 9, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [1, 0, 0, 0, 0, 0, 1], "immutable_memtables": 0} 2019-02-04 20:24:25.296 7f1b39312700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.299950) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:177] [default] Level summary: base level 6 max bytes base 268435456 files[1 0 0 0 0 0 1] max score 0.25 2019-02-04 20:24:25.296 7f1b39312700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_files.cc:347] [JOB 9] Try to delete WAL files size 9979119, prev total WAL file size 10025925, number of live WAL files 2. 2019-02-04 20:24:25.296 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting 2019-02-04 20:24:25.296 7f1b39b13700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.300155) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1847] [default] Manual compaction from level-0 to level-6 from 'paxos .. 'paxos; will stop at (end) 2019-02-04 20:24:25.296 7f1b39b13700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1636] [default] [JOB 10] Compacting 1@0 + 1@6 files to L6, score -1.00 2019-02-04 20:24:25.296 7f1b39b13700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1640] [default] Compaction start summary: Base version 9 Base level 0, inputs: [19(9580KB)], [17(18MB)] 2019-02-04 20:24:25.296 7f1b39b13700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865300218, "job": 10, "event": "compaction_started", "compaction_reason": "ManualCompaction", "files_L0": [19], "files_L6": [17], "score": -1, "input_data_size": 29158732} 2019-02-04 20:24:25.524 7f1b39b13700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1324] [default] [JOB 10] Generated table #20: 2415 keys, 25628909 bytes 2019-02-04 20:24:25.524 7f1b39b13700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865525468, "cf_name": "default", "job": 10, "event": "table_file_creation", "file_number": 20, "file_size": 25628909, "table_properties": {"data_size": 25510081, "index_size": 52487, "filter_size": 65628, "raw_key_size": 44917, "raw_average_key_size": 18, "raw_value_size": 25439814, "raw_average_value_size": 10534, "num_data_blocks": 1603, "num_entries": 2415, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}} 2019-02-04 20:24:25.548 7f1b39b13700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:1390] [default] [JOB 10] Compacted 1@0 + 1@6 files to L6 => 25628909 bytes 2019-02-04 20:24:25.548 7f1b39b13700 3 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/version_set.cc:2578] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed. 2019-02-04 20:24:25.616 7f1b39b13700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.617853) [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/compaction_job.cc:752] [default] compacted to: base level 6 max bytes base 268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 129.4 rd, 113.8 wr, level 6, files in(1, 1) out(1) MB in(9.4, 18.5) out(24.4), read-write-amplify(5.6) write-amplify(2.6) OK, records in: 2929, records dropped: 514 output_compression: NoCompression 2019-02-04 20:24:25.616 7f1b39b13700 4 rocksdb: (Original Log Time 2019/02/04-20:24:25.617896) EVENT_LOG_v1 {"time_micros": 1549311865617881, "job": 10, "event": "compaction_finished", "compaction_time_micros": 225308, "output_level": 6, "num_output_files": 1, "total_output_size": 25628909, "num_input_records": 2929, "num_output_records": 2415, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 1]} 2019-02-04 20:24:25.616 7f1b39b13700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865619390, "job": 10, "event": "table_file_deletion", "file_number": 19} 2019-02-04 20:24:25.620 7f1b39b13700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1549311865621737, "job": 10, "event": "table_file_deletion", "file_number": 17} 2019-02-04 20:24:25.620 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting 2019-02-04 20:24:25.620 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting 2019-02-04 20:24:25.620 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting 2019-02-04 20:24:25.620 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting 2019-02-04 20:24:25.620 7f1b30300700 4 rocksdb: [/build/ceph-14.0.1-3131-ge01bb11/src/rocksdb/db/db_impl_compaction_flush.cc:1060] [default] Manual compaction starting <END OF LOG>
#6 Updated by David Zafman about 5 years ago
- Related to Bug #38195: osd-backfill-space.sh exposes rocksdb hang added
#7 Updated by David Zafman about 5 years ago
- Pull request ID set to 26290
#8 Updated by David Zafman about 5 years ago
- Status changed from 12 to Resolved