Bug #12012: IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2) - Ceph - Ceph

Actions

Copy link

Bug #12012

closed

IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2)

Added by Mehdi Abaakouk almost 9 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Kefu Chai

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v0.94.2

ceph-qa-suite:

rbd

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

issue-12012-ceph-report-20150614.gz can be found in sftp cephdrop@ceph.com

I have recently upgraded a giant (0.87.1) cluster to hammer 0.94.2. We got no issue with the upgrade itself.
The main usage of the cluster kvm virtual machine backed on rbd image (format 2).

We have some 'replicated' pools and some 'erasure coding' pools.

On all Virtual Machines using 'erasure coding' pools tier, we got many 'IO Error' on their filesystems (ext4) since the upgrade.

Running a 'fsck' on the FS, doesn't solve this problem (fsck repairs the FS, but it continues to be re-corrupted when we reuse it)

I can't reproduce the issue on freshly created image on the 'erasure coding' pool.

Ceph doesn't report any issue in logs.

Output on 'dmesg' inside on of the VM:

[8536286.192503] end_request: I/O error, dev vdf, sector 4637171712
[8536286.530987] end_request: I/O error, dev vdf, sector 4637179904
[8536287.225984] end_request: I/O error, dev vdf, sector 4637188096
[8536287.556361] end_request: I/O error, dev vdf, sector 4637196288
[8536288.758833] end_request: I/O error, dev vdf, sector 4637204480
[8536289.305179] end_request: I/O error, dev vdf, sector 4637212672
[8536289.745542] end_request: I/O error, dev vdf, sector 4637220864
[8536290.192655] end_request: I/O error, dev vdf, sector 4637229056
[8536290.924421] end_request: I/O error, dev vdf, sector 4637237248
[8536291.635483] end_request: I/O error, dev vdf, sector 6468124152
[8557542.208535] end_request: I/O error, dev vdc, sector 7616889296
[8557542.944758] end_request: I/O error, dev vdc, sector 41232
[8557542.945423] Buffer I/O error on device vdc, logical block 5154
[8557542.946000] Buffer I/O error on device vdc, logical block 5155
[8557542.946572] Buffer I/O error on device vdc, logical block 5156
[8557542.947156] Buffer I/O error on device vdc, logical block 5157
[8557542.947728] Buffer I/O error on device vdc, logical block 5158
[8557542.948305] Buffer I/O error on device vdc, logical block 5159
[8557542.948741] Buffer I/O error on device vdc, logical block 5160
[8557542.948741] Buffer I/O error on device vdc, logical block 5161
[8557542.948741] Buffer I/O error on device vdc, logical block 5162
[8557543.804862] EXT4-fs error (device vdc): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/4/lost+found: directory contains a hole at offset 0
[8557543.806331] Aborting journal on device vdc-8.
[8557543.806787] EXT4-fs (vdc): Remounting filesystem read-only
[8557543.813079] end_request: I/O error, dev vdb, sector 7138738712
[8557544.515328] end_request: I/O error, dev vdb, sector 41232
[8557545.979173] EXT4-fs error (device vdb): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/6/lost+found: directory contains a hole at offset 0
[8557546.340171] end_request: I/O error, dev vdf, sector 41232
[8557546.913031] EXT4-fs error (device vdf): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/8/lost+found: directory contains a hole at offset 0
[8557546.914464] Aborting journal on device vdf-8.
[8557548.249942] EXT4-fs (vdf): Remounting filesystem read-only
[8557566.732395] end_request: I/O error, dev vdd, sector 41224
[8557844.829111] end_request: I/O error, dev vdd, sector 4915757312
[8583680.992045] EXT4-fs (vdb): error count since last fsck: 2
[8583680.992057] EXT4-fs (vdb): initial error at time 1434120904: ext4_find_entry:932: inode 111542328
[8583680.992060] EXT4-fs (vdb): last error at time 1434181428: ext4_readdir:172: inode 11
[8586076.784775] UDP: bad checksum. From 46.183.220.250:3099 to 89.234.156.236:5060 ulen 237
[8644236.256068] EXT4-fs (vdc): error count since last fsck: 1
[8644236.256070] EXT4-fs (vdc): initial error at time 1434181426: ext4_readdir:172: inode 11
[8644236.256072] EXT4-fs (vdc): last error at time 1434181426: ext4_readdir:172: inode 11
[8644236.256075] EXT4-fs (vdf): error count since last fsck: 1
[8644236.256076] EXT4-fs (vdf): initial error at time 1434181429: ext4_readdir:172: inode 11
[8644236.256078] EXT4-fs (vdf): last error at time 1434181429: ext4_readdir:172: inode 11
[8652845.649692] end_request: I/O error, dev vdd, sector 8328
[8652845.651157] EXT4-fs error (device vdd): ext4_read_inode_bitmap:161: comm touch: Cannot read inode bitmap - block_group = 0, inode_bitmap = 1041
[8652846.052461] Aborting journal on device vdd-8.
[8652846.408963] EXT4-fs (vdd): Remounting filesystem read-only
[8652846.409564] EXT4-fs error (device vdd) in ext4_new_inode:945: IO failure
[8652846.446874] EXT4-fs error (device vdd) in ext4_create:1776: IO failure

Files

Download all files

osd_errors.txt (5.96 KB) osd_errors.txt		Jason Dillaman, 06/22/2015 01:57 PM
output_os_map.txt (24.3 KB) output_os_map.txt		Laurent GUERBY, 06/22/2015 09:49 PM
output_os_map_json.txt (31.9 KB) output_os_map_json.txt		Laurent GUERBY, 06/22/2015 09:54 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

Could you also show the tier configuration you have ?

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

issue-12012-ceph-report-20150614.gz can be found in sftp cephdrop@ceph.com

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

            {
                "pool": 80,
                "pool_name": "ec4p1",
                "flags": 1,
                "flags_names": "hashpspool",
                "type": 3,
                "size": 5,
                "min_size": 2,
                "crush_ruleset": 2,
                "object_hash": 2,
                "pg_num": 4096,
                "pg_placement_num": 4096,
                "crash_replay_interval": 0,
                "last_change": "27770",
                "last_force_op_resend": "27732",
                "auid": 0,
                "snap_mode": "selfmanaged",
                "snap_seq": 0,
                "snap_epoch": 27730,
                "pool_snaps": [],
                "removed_snaps": "[]",
                "quota_max_bytes": 0,
                "quota_max_objects": 0,
                "tiers": [
                    81
                ],
                "tier_of": -1,
                "read_tier": 81,
                "write_tier": 81,
                "cache_mode": "none",
                "target_max_bytes": 0,
                "target_max_objects": 0,
                "cache_target_dirty_ratio_micro": 400000,
                "cache_target_full_ratio_micro": 800000,
                "cache_min_flush_age": 0,
                "cache_min_evict_age": 0,
                "erasure_code_profile": "ec4p1profile",
                "hit_set_params": {
                    "type": "none" 
                },
                "hit_set_period": 0,
                "hit_set_count": 0,
                "min_read_recency_for_promote": 0,
                "stripe_width": 4096,
                "expected_num_objects": 0
            },
            {
                "pool": 81,
                "pool_name": "ec4p1c",
                "flags": 9,
                "flags_names": "hashpspool,incomplete_clones",
                "type": 1,
                "size": 3,
                "min_size": 2,
                "crush_ruleset": 0,
                "object_hash": 2,
                "pg_num": 4096,
                "pg_placement_num": 4096,
                "crash_replay_interval": 0,
                "last_change": "27771",
                "last_force_op_resend": "0",
                "auid": 0,
                "snap_mode": "selfmanaged",
                "snap_seq": 0,
                "snap_epoch": 27730,
                "pool_snaps": [],
                "removed_snaps": "[]",
                "quota_max_bytes": 0,
                "quota_max_objects": 0,
                "tiers": [],
                "tier_of": 80,
                "read_tier": -1,
                "write_tier": -1,
                "cache_mode": "writeback",
                "target_max_bytes": 100000000000,
                "target_max_objects": 100000000000,
                "cache_target_dirty_ratio_micro": 400000,
                "cache_target_full_ratio_micro": 800000,
                "cache_min_flush_age": 0,
                "cache_min_evict_age": 0,
                "erasure_code_profile": "",
                "hit_set_params": {
                    "type": "bloom",
                    "false_positive_probability": 0.050000,
                    "target_size": 0,
                    "seed": 0
                },
                "hit_set_period": 3600,
                "hit_set_count": 1,
                "min_read_recency_for_promote": 0,
                "stripe_width": 0,
                "expected_num_objects": 0
            }

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

All PG are active+clean in the 80 and 81 pools

$ jq '.pgmap.pg_stats[] | select(.pgid | startswith("80.") or startswith("81.") ) | select(.state != "active+clean")' < ~/Downloads/issue-12012-ceph-report-20150614 | wc -l
0

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

When was the last time this pool had a deep-scrub ?

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

The most recent deep-scrub is six month old

$ jq '.pgmap.pg_stats[] | select(.pgid | startswith("80.") or startswith("81.") ) | .last_deep_scrub_stamp' < ~/Downloads/issue-12012-ceph-report-20150614 | sort -u | tail -1
"2014-12-26 10:43:22.313538"

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

Could you list the operations you did to upgrade ? Was it just a matter of apt-get install ? Or did you do other maintenance operations as well ?

Actions

Copy link

Updated by Loïc Dachary almost 9 years ago

joshd or jdillaman may have a suggestion on how to narrow the error down. Maybe "vdf, sector 4637237248" allows to find which rbd object is faulty and focus on it

Actions

Copy link

Updated by Mehdi Abaakouk almost 9 years ago

We have not scrub since we upgrade to giant due to http://tracker.ceph.com/issues/8011, we upgrade to hammer to be able to do scrub again, but not yet re-renable it.

Upgrade process:

ceph osd set noout
apt-get install ... everywhere
restart one by one all mon nodes first.
restart one by one all OSDs and waiting for recovery to finish between each.
live-migrate all the vms to got last version of librbd in all running kvm processes
ceph osd unset noout

No special maintenance stuffs.

Actions

Copy link

#10

Updated by Loïc Dachary almost 9 years ago

Subject changed from ext4 IO error on kvm/rbd on an erasure coded pool to IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2)

Actions

Copy link

#11

Updated by Loïc Dachary almost 9 years ago

Status changed from New to 12

Actions

Copy link

#12

Updated by Loïc Dachary almost 9 years ago

Description updated (diff)

Actions

Copy link

#13

Updated by Loïc Dachary almost 9 years ago

Description updated (diff)

Actions

Copy link

#14

Updated by Loïc Dachary almost 9 years ago

Description updated (diff)

Actions

Copy link

#15

Updated by Mehdi Abaakouk almost 9 years ago

kvm/qemu is 2.2 and have not changed before and after the upgrade. librbd is 0.94.2-1-ga11cca9

Actions

Copy link

#16

Updated by Mehdi Abaakouk almost 9 years ago

I have also stopped one impacted VM, do a fsck with 'virt-rescue utility', and restart the VM, but I continue to get the FS correupted.

Actions

Copy link

#17

Updated by Loïc Dachary almost 9 years ago

jdillaman suggests: might be good to get librbd debug logs (“debug rbd = 20”) from qemu. Can you get that please Mehdi ?

Actions

Copy link

#18

Updated by Loïc Dachary almost 9 years ago

For the record http://tracker.ceph.com/issues/12029

Actions

Copy link

#19

Updated by Mehdi Abaakouk almost 9 years ago

I have made the following additionnal test:

Inside the VM, I have read the EC volume and got many 'IO Error'
Then I have detached this EC volume from VM
Mount it with the kernel client on another server with rbd map... (kernel 3.16)
Do the same command to read the EC volume, and doesn't get any errors
To confirm, I have reattached the EC volume to the kvm process,
Done the test again inside the VM and I got the 'IO Error' again.

The test command I have used is 'find /mnt -type f -exec cat {} \; > /dev/null' to read the whole device with the FS layer.

I will try to do it again with rbd debug enabled

Actions

Copy link

#20

Updated by Mehdi Abaakouk almost 9 years ago

I have generated 'IO error' with same command as previous test.

And uploaded the resulting log on sftp cephdrop@ceph.com:

issue-12012-dmesg.gz (from within the VM on which /dev/vdb is attached)
issue-12012-rbd-debug-qemu-guest.21599.log.gz

143 "IO errors" have occurred during the test.

Actions

Copy link

#21

Updated by Mehdi Abaakouk almost 9 years ago

Just in case the qemu command line launched (generated by libvirt and nova):

qemu-system-x86_64 -enable-kvm -name fe6bd6b7-152c-4240-a119-3cd7a760fc0a -S -machine pc-i440fx-2.2,accel=kvm,usb=off -cpu Nehalem -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid fe6bd6b7-152c-4240-a119-3cd7a760fc0a -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=2015.1.0,serial=7350747a-f98d-489c-954f-18263c6fd088,uuid=fe6bd6b7-152c-4240-a119-3cd7a760fc0a -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/fe6bd6b7-152c-4240-a119-3cd7a760fc0a.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:disks/fe6bd6b7-152c-4240-a119-3cd7a760fc0a_disk.config:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-ide0-1-1,readonly=on,format=raw,cache=none,aio=native -device ide-cd,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -drive file=rbd:disks/volume-0776d998-b84e-4a60-9304-69834af247c0:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-virtio-disk0,format=raw,serial=0776d998-b84e-4a60-9304-69834af247c0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=rbd:ec4p1/volume-ecd57a41-51a7-40a0-a4ea-315537acbfb8:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-virtio-disk1,format=raw,serial=ecd57a41-51a7-40a0-a4ea-315537acbfb8,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:37:57:d0,bus=pci.0,addr=0x3 -chardev socket,id=charserial0,host=192.168.3.106,port=10001,server,nowait -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -spice port=5900,addr=0.0.0.0,disable-ticketing,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on

Actions

Copy link

#22

Updated by Mehdi Abaakouk almost 9 years ago

I have made some other tests by changing the io policy of qemu:

cache=none, io=native -> IO Errors
cache=none -> IO Errors # I guess io=threads by default
none of both -> no errors # I guess io=threads, cache=writethrough by default.

So, the issue seems to be triggered only when the rbd cache is disabled.

Actions

Copy link

#23

Updated by Josh Durgin almost 9 years ago

Project changed from Ceph to rbd

Actions

Copy link

#24

Updated by Jason Dillaman almost 9 years ago

I wasn't able to reproduce an issue using a clean v0.94.2 build with the librbd cache disabled. Looking at your RBD client logs, the client is receiving -95 (EOPNOTSUPP) errors from the OSDs when reading from the image. Can you search your OSD logs in the cache tier for this error code related to objects prefixed with "rbd_data."?

Actions

Copy link

#25

Updated by Mehdi Abaakouk almost 9 years ago

I don't find this error code (either by code or by name) on my OSDs logs, do I need to increase a particular log level this see them ?

Actions

Copy link

#26

Updated by Mehdi Abaakouk almost 9 years ago

I have retried why 'debug osd 20', and didn't find this errors in OSDs logs too.

Actions

Copy link

#27

Updated by Jason Dillaman almost 9 years ago

File osd_errors.txt osd_errors.txt added

I've attached all the objects that encountered errors in your log. I extract the errors via 'zgrep -E "should_complete.*r = -" issue-12012-rbd-debug-qemu-guest.21599.log.gz | grep -Eo 'rbd_data[^ ]*' | sort | uniq'.

You should be able to map these objects back to PGs via 'for x in $(cat osd_errors.txt); do ceph osd map ec4p1 $x; done'.

You might need to add 'debug ms = 1' to your OSD's ceph.conf to have the OSD op replies logged.

Actions

Copy link

#28

Updated by Laurent GUERBY almost 9 years ago

File output_os_map.txt output_os_map.txt added

Here is the output of the for loop. Let me know if more is needed.

Actions

Copy link

#29

Updated by Laurent GUERBY almost 9 years ago

File output_os_map_json.txt output_os_map_json.txt added

With -f json per Loic request

Actions

Copy link

#30

Updated by Josh Durgin almost 9 years ago

Status changed from 12 to Need More Info

We haven't been able to reproduce this - can you give more details on the erasure codeded pool setup?

Actions

Copy link

#31

Updated by Mehdi Abaakouk over 8 years ago

I guess all our pool setup is described into the previously send ceph-report. Lets me known is you need something else and how to gather it.

Actions

Copy link

#32

Updated by Jason Dillaman over 8 years ago

Mehdi Abaakouk wrote:

I guess all our pool setup is described into the previously send ceph-report. Lets me known is you need something else and how to gather it.

We really will need the OSD reply outputs after applying "debug ms = 1" to your OSD config. I would apply the setting first to the cache tier, reproduce the issue, and scan the OSD logs for "-95" errors against "rbd_data.*" objects.

Actions

Copy link

#33

Updated by Jason Dillaman over 8 years ago

Actually, this might be as simple as the fact that when not using the librbd cache, the read ops are issued as a sparse read request and when the cache is enabled a standard read is used. If the cache tier is full, (sparse) read requests will be redirected to the EC pool, but the EC pool does not support sparse reads so it will return -95 (EOPNOTSUPP). Since this redirection only occurs when the cache tier is full, that might explain why we haven't seen it during our testing.

Actions

Copy link

#34

Updated by Jason Dillaman over 8 years ago

Project changed from rbd to Ceph

Actions

Copy link

#35

Updated by Kefu Chai over 8 years ago

to fix this issue,

we need sparse reads in ReplicatedPG::do_op to be treated as normal reads on ec pools
it would be good for ceph_test_rados to send sparse reads some times, maybe have a ratio option send the read as a bunch of sparse reads in the read op. the option defaults to 0.5.
also, the api tests should do sparse reads
- on both replicated
- on a replicated cache on ec base