Project

General

Profile

Actions

Bug #37656

open

FileStore::_do_transaction() crashed with error 17 (merge collection vs osd restart)

Added by Neha Ojha over 5 years ago. Updated almost 4 years ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-12-13T16:44:55.693 INFO:teuthology.orchestra.run.smithi091:Running:
2018-12-13T16:44:55.693 INFO:teuthology.orchestra.run.smithi091:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_historic_ops
2018-12-13T16:44:55.698 INFO:tasks.ceph.osd.6.smithi091.stderr:2018-12-13 16:44:55.693 7f62e5052700 -1 filestore(/var/lib/ceph/osd/ceph-6)  error (17) File exists not handled on operation 0xec8e000 (8205.0.0, or op 0, counting from 0)
2018-12-13T16:44:55.702 INFO:tasks.ceph.osd.6.smithi091.stderr:/build/ceph-14.0.1-1674-g822b02c/src/os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*, const char*)' thread 7f62e5052700 time 2018-12-13 16:44:55.697661
2018-12-13T16:44:55.702 INFO:tasks.ceph.osd.6.smithi091.stderr:/build/ceph-14.0.1-1674-g822b02c/src/os/filestore/FileStore.cc: 3293: abort()
2018-12-13T16:44:55.702 INFO:tasks.ceph.osd.6.smithi091.stderr: ceph version 14.0.1-1674-g822b02c (822b02ce1b5d8d2126974c0b6aec117223b299b2) nautilus (dev)
2018-12-13T16:44:55.702 INFO:tasks.ceph.osd.6.smithi091.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x82cd64]
2018-12-13T16:44:55.702 INFO:tasks.ceph.osd.6.smithi091.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*, char const*)+0xa58) [0xd01e18]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*, char const*)+0x48) [0xd08d08]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a4) [0xd08ee4]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xbee) [0xfc12de]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0xfca540]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 7: (()+0x76ba) [0x7f62f6ebd6ba]
2018-12-13T16:44:55.703 INFO:tasks.ceph.osd.6.smithi091.stderr: 8: (clone()+0x6d) [0x7f62f64c441d]

/a/nojha-2018-12-13_15:13:17-rados-master-distro-basic-smithi/3358402/

Actions #1

Updated by Josh Durgin over 5 years ago

/a/dzafman-2018-12-14_11:02:20-rados-wip-zafman-testing-distro-basic-smithi/3362534

Actions #2

Updated by Neha Ojha over 5 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Josh Durgin over 4 years ago

  • Status changed from New to Can't reproduce
Actions #4

Updated by Neha Ojha about 4 years ago

  • Status changed from Can't reproduce to New

/a/teuthology-2020-02-11_02:30:03-upgrade:mimic-x-nautilus-distro-basic-smithi/4753470/

upgrade:mimic-x/stress-split/{0-cluster/{openstack.yaml start.yaml} 1-ceph-install/mimic.yaml 1.1-pg-log-overrides/normal_pg_log.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{radosbench.yaml rbd-cls.yaml rbd-import-export.yaml rbd_api.yaml readwrite.yaml rgw_ragweed_prepare.yaml snaps-few-objects.yaml} 5-finish-upgrade.yaml 6-msgr2.yaml 6-nautilus.yaml 7-final-workload/{rbd-python.yaml rgw-swift-ragweed_check.yaml snaps-many-objects.yaml} objectstore/filestore-xfs.yaml supported-all-distro/ubuntu_16.04.yaml thrashosds-health.yaml}

020-02-11 09:28:01.145 7fdb10310700 10 filestore(/var/lib/ceph/osd/ceph-2) _do_transaction(2773): on 0xf358520
2020-02-11 09:28:01.145 7fdb10310700 15 filestore(/var/lib/ceph/osd/ceph-2) _create_collection(5448): /var/lib/ceph/osd/ceph-2/current/1.165_head
2020-02-11 09:28:01.145 7fdb10310700 10 filestore(/var/lib/ceph/osd/ceph-2) _create_collection(5454): /var/lib/ceph/osd/ceph-2/current/1.165_head = -17
2020-02-11 09:28:01.145 7fdb10310700 -1 filestore(/var/lib/ceph/osd/ceph-2)  error (17) File exists not handled on operation 0x10553000 (866015.0.0, or op 0, counting from 0)
2020-02-11 09:28:01.433 7fdb10310700 -1 /build/ceph-14.2.7-508-g6036662/src/os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*, const char*)' thread 7fdb10310700 time 2020-02-11 09:28:01.148817
/build/ceph-14.2.7-508-g6036662/src/os/filestore/FileStore.cc: 3320: ceph_abort_msg("unexpected error")

 ceph version 14.2.7-508-g6036662 (6036662bc04e4d9a0f8ab284b1a18ca54100c1d7) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x83c5f0]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*, char const*)+0x6124) [0xc7db84]
 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*, char const*)+0x48) [0xc7f168]
 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a4) [0xc7f344]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x9d6) [0xee3346]
 6: (ThreadPool::WorkThread::entry()+0x11) [0xee41d1]
 7: (()+0x76ba) [0x7fdb21a5f6ba]
 8: (clone()+0x6d) [0x7fdb2106641d]
Actions #5

Updated by Sage Weil about 4 years ago

  -737> 2020-03-04T09:28:58.109+0000 7effc1c02700 10 filestore(/var/lib/ceph/osd/ceph-6) _create_collection(5447): /var/lib/ceph/osd/ceph-6/current/2.10_head = -17
  -732> 2020-03-04T09:28:58.109+0000 7effc1c02700 -1 filestore(/var/lib/ceph/osd/ceph-6)  error (17) File exists not handled on operation 0x55fcf753f000 (11684.0.0, or op 0, counting from 0)
  -731> 2020-03-04T09:28:58.109+0000 7effc1c02700  0 filestore(/var/lib/ceph/osd/ceph-6) unexpected error code
  -672> 2020-03-04T09:28:58.109+0000 7effc1c02700  0 filestore(/var/lib/ceph/osd/ceph-6)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "mkcoll",
            "collection": "2.10_head" 
        },
        {
            "op_num": 1,
            "op_name": "coll_hint",
            "collection": "2.10_head",
            "type": 1,
            "pg_num": 18,
            "expected_num_objects": 0
        },
        {
            "op_num": 2,
            "op_name": "touch",
            "collection": "2.10_head",
            "oid": "#2:08000000::::head#" 
        },
        {
            "op_num": 3,
            "op_name": "omap_setkeys",
            "collection": "2.10_head",
            "oid": "#2:08000000::::head#",
            "attr_lens": {
                "_infover": 1
            }
        },
        {
            "op_num": 4,
            "op_name": "touch",
            "collection": "2.10_head",
            "oid": "#2:08000000::::head#" 
        },
        {
            "op_num": 5,
            "op_name": "omap_setkeys",
            "collection": "2.10_head",
            "oid": "#2:08000000::::head#",
            "attr_lens": {
                "_biginfo": 95,
                "_epoch": 4,
                "_info": 948,
                "may_include_deletes_in_missing": 0
            }
        },
        {
            "op_num": 6,
            "op_name": "omap_rmkeys",
            "collection": "2.10_head",
            "oid": "#2:08000000::::head#",
            "attrs": [
                "_fastinfo" 
            ]
        }
    ]
}
    -7> 2020-03-04T09:28:58.110+0000 7effc1c02700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1606-gecd8670/rpm/el8/BUILD/ceph-15.1.0-1606-gecd8670/src/os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*, const char*)' thread 7effc1c02700 time 2020-03-04T09:28:58.109601+0000

/a/sage-2020-03-03_19:20:19-rados-wip-sage2-testing-2020-03-03-0959-distro-basic-smithi/4821257
Actions #6

Updated by Sage Weil about 4 years ago

the merge happens right before we shut down:

2020-03-04T09:21:35.006+0000 7f44cc7a0700 10 filestore(/var/lib/ceph/osd/ceph-6) _destroy_collection(5505): /var/lib/ceph/osd/ceph-6/current/2.10_head = 0
2020-03-04T09:21:35.006+0000 7f44cc7a0700 10 filestore(/var/lib/ceph/osd/ceph-6) _close_replay_guard(2653): 3445.0.1
2020-03-04T09:21:35.006+0000 7f44cc7a0700 20 filestore dbobjectmap: seq is 27
2020-03-04T09:21:35.006+0000 7f44cc7a0700 10 filestore(/var/lib/ceph/osd/ceph-6) _close_replay_guard(2683): 3445.0.1 done
2020-03-04T09:21:35.006+0000 7f44cc7a0700 10 filestore(/var/lib/ceph/osd/ceph-6) _close_replay_guard(2653): 3445.0.1
2020-03-04T09:21:35.006+0000 7f44cc7a0700 20 filestore dbobjectmap: seq is 27
2020-03-04T09:21:35.006+0000 7f44cc7a0700 10 filestore(/var/lib/ceph/osd/ceph-6) _close_replay_guard(2683): 3445.0.1 done
2020-03-04T09:21:35.006+0000 7f44cc7a0700 15 filestore(/var/lib/ceph/osd/ceph-6) _merge_collection(5908): 2.10_head 2.0_head bits 4 = 0
...
2020-03-04T09:21:36.095+0000 7f44de034700 10 --1- v1:172.21.15.130:6805/15572 >> v1:172.21.15.130:0/15835 conn(0x55e15d8ef800 0x55e15e9bc800 :6805 s=OPENED pgs=18 cs=1 l=1).write_message sending 0x55e15df05080 done.
2020-03-04T09:21:54.508+0000 7effd522aec0  0 ceph version 15.1.0-1606-gecd8670 (ecd867096a473d425947a07b70a4dfca72c6fb88) octopus (rc), process ceph-osd, pid 31676
...
2020-03-04T09:21:54.519+0000 7effd522aec0 20 filestore(/var/lib/ceph/osd/ceph-6)  ls [meta,2.4_head,2.4_TEMP,2.c_head,2.c_TEMP,1.1_head,1.1_TEMP,1.a_head,1.a_TEMP,3.3_head,3.3_TEMP,2.0_head,2.0_TEMP,2.10_head,2.10_TEMP,1.6_head,1.6_TEMP]

the thrasher is doing this:
2020-03-04T09:21:36.234 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd.6
2020-03-04T09:21:36.235 INFO:teuthology.orchestra.run.smithi130:> type ceph-objectstore-tool
2020-03-04T09:21:36.289 INFO:teuthology.orchestra.run.smithi130.stdout:ceph-objectstore-tool is /usr/bin/ceph-objectstore-tool
2020-03-04T09:21:36.290 INFO:teuthology.orchestra.run.smithi130:> sudo adjust-ulimits ceph-objectstore-tool --no-mon-config '--log-file=/var/log/ceph/objectstore_tool.$pid.log' --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op list-pgs
2020-03-04T09:21:36.447 INFO:teuthology.orchestra.run.smithi130:> sudo adjust-ulimits ceph-objectstore-tool --no-mon-config '--log-file=/var/log/ceph/objectstore_tool.$pid.log' --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op export --pgid 1.6 --file /var/log/ceph/exp.1.6.6
2020-03-04T09:21:36.501 INFO:teuthology.orchestra.run.smithi130:> sudo adjust-ulimits ceph-objectstore-tool --no-mon-config '--log-file=/var/log/ceph/objectstore_tool.$pid.log' --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --force --op remove --pgid 1.6
2020-03-04T09:21:36.560 INFO:teuthology.orchestra.run.smithi130:> sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal '--log-file=/var/log/ceph/objectstore_tool.$pid.log' --op import --file /var/log/ceph/exp.1.6.6
2020-03-04T09:21:36.923 INFO:teuthology.orchestra.run.smithi130:> CEPH_ARGS='--filestore-merge-threshold 1 --filestore-split-multiple 1' sudo -E ceph-objectstore-tool --no-mon-config --log-file=/var/log/ceph/objectstore_tool.$pid.log --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op apply-layout-settings --pool device_health_metrics
2020-03-04T09:21:36.977 INFO:teuthology.orchestra.run.smithi130:> CEPH_ARGS='--filestore-merge-threshold 1 --filestore-split-multiple 1' sudo -E ceph-objectstore-tool --no-mon-config --log-file=/var/log/ceph/objectstore_tool.$pid.log --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op apply-layout-settings --pool rbd
2020-03-04T09:21:37.073 INFO:teuthology.orchestra.run.smithi130:> CEPH_ARGS='--filestore-merge-threshold 1 --filestore-split-multiple 1' sudo -E ceph-objectstore-tool --no-mon-config --log-file=/var/log/ceph/objectstore_tool.$pid.log --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op apply-layout-settings --pool base
2020-03-04T09:21:37.179 INFO:teuthology.orchestra.run.smithi130:> CEPH_ARGS='--filestore-merge-threshold 1 --filestore-split-multiple 1' sudo -E ceph-objectstore-tool --no-mon-config --log-file=/var/log/ceph/objectstore_tool.$pid.log --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op apply-layout-settings --pool cache

the first cot incantation objectstore_tool.30268.log:

2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _do_transaction(2771): on 0x5651ddb8fe60
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _check_replay_guard(2716): 2.10_head dne
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) _create_collection(5441): /var/lib/ceph/osd/ceph-6/current/2.10_head
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _create_collection(5447): /var/lib/ceph/osd/ceph-6/current/2.10_head = 0
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 20 read_settings split_rand_factor = 6
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _collection_set_bits(5078): /var/lib/ceph/osd/ceph-6/current/2.10_head 5
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _collection_set_bits(5091): /var/lib/ceph/osd/ceph-6/current/2.10_head 5 = 4
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) _create_collection(5441): /var/lib/ceph/osd/ceph-6/current/2.10_TEMP
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _create_collection(5447): /var/lib/ceph/osd/ceph-6/current/2.10_TEMP = 0
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 20 read_settings split_rand_factor = 17
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _collection_set_bits(5078): /var/lib/ceph/osd/ceph-6/current/2.10_TEMP 0
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _collection_set_bits(5091): /var/lib/ceph/osd/ceph-6/current/2.10_TEMP 0 = 4
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _set_replay_guard(2589): 3416.0.0
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 20 filestore dbobjectmap: seq is 27
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _set_replay_guard(2629): 3416.0.0 done
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _set_replay_guard(2589): 3416.0.0
2020-03-04T09:21:36.391+0000 7fd7cbc35cc0 20 filestore dbobjectmap: seq is 27
2020-03-04T09:21:36.392+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _set_replay_guard(2629): 3416.0.0 done
2020-03-04T09:21:36.392+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _check_replay_guard(2760): object has 3416.0.0 < current pos 3416.0.1, in past, will replay
2020-03-04T09:21:36.392+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) _collection_hint_expected_num_objs(5407): collection: 2.10_head pg number: 17 expected number of objects: 0
2020-03-04T09:21:36.392+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) collection_empty(5048): 2.10_head
...
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) _remove(3642): 2.10_head/#2:08000000::::head# = 0
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) _merge_collection(5801): 2.10_head 2.0_head bits 4
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) collection_stat(5026): /var/lib/ceph/osd/ceph-6/current/2.10_head
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) collection_stat(5030): /var/lib/ceph/osd/ceph-6/current/2.10_head = 0
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 15 filestore(/var/lib/ceph/osd/ceph-6) collection_stat(5026): /var/lib/ceph/osd/ceph-6/current/2.0_head
2020-03-04T09:21:36.438+0000 7fd7cbc35cc0 10 filestore(/var/lib/ceph/osd/ceph-6) collection_stat(5030): /var/lib/ceph/osd/ceph-6/current/2.0_head = 0

so.. looks like a problem with filestore replay vs merge_collection

Actions #7

Updated by Sage Weil about 4 years ago

  • Subject changed from FileStore::_do_transaction() crashed with error 17 to FileStore::_do_transaction() crashed with error 17 (merge collection vs osd restart)
  • Status changed from New to Triaged
Actions #8

Updated by Neha Ojha almost 4 years ago

  • Priority changed from High to Normal
Actions

Also available in: Atom PDF