Project

General

Profile

Bug #12012

IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2)

Added by Mehdi Abaakouk over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

issue-12012-ceph-report-20150614.gz can be found in sftp

I have recently upgraded a giant (0.87.1) cluster to hammer 0.94.2. We got no issue with the upgrade itself.
The main usage of the cluster kvm virtual machine backed on rbd image (format 2).

We have some 'replicated' pools and some 'erasure coding' pools.

On all Virtual Machines using 'erasure coding' pools tier, we got many 'IO Error' on their filesystems (ext4) since the upgrade.

Running a 'fsck' on the FS, doesn't solve this problem (fsck repairs the FS, but it continues to be re-corrupted when we reuse it)

I can't reproduce the issue on freshly created image on the 'erasure coding' pool.

Ceph doesn't report any issue in logs.

Output on 'dmesg' inside on of the VM:

[8536286.192503] end_request: I/O error, dev vdf, sector 4637171712
[8536286.530987] end_request: I/O error, dev vdf, sector 4637179904
[8536287.225984] end_request: I/O error, dev vdf, sector 4637188096
[8536287.556361] end_request: I/O error, dev vdf, sector 4637196288
[8536288.758833] end_request: I/O error, dev vdf, sector 4637204480
[8536289.305179] end_request: I/O error, dev vdf, sector 4637212672
[8536289.745542] end_request: I/O error, dev vdf, sector 4637220864
[8536290.192655] end_request: I/O error, dev vdf, sector 4637229056
[8536290.924421] end_request: I/O error, dev vdf, sector 4637237248
[8536291.635483] end_request: I/O error, dev vdf, sector 6468124152
[8557542.208535] end_request: I/O error, dev vdc, sector 7616889296
[8557542.944758] end_request: I/O error, dev vdc, sector 41232
[8557542.945423] Buffer I/O error on device vdc, logical block 5154
[8557542.946000] Buffer I/O error on device vdc, logical block 5155
[8557542.946572] Buffer I/O error on device vdc, logical block 5156
[8557542.947156] Buffer I/O error on device vdc, logical block 5157
[8557542.947728] Buffer I/O error on device vdc, logical block 5158
[8557542.948305] Buffer I/O error on device vdc, logical block 5159
[8557542.948741] Buffer I/O error on device vdc, logical block 5160
[8557542.948741] Buffer I/O error on device vdc, logical block 5161
[8557542.948741] Buffer I/O error on device vdc, logical block 5162
[8557543.804862] EXT4-fs error (device vdc): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/4/lost+found: directory contains a hole at offset 0
[8557543.806331] Aborting journal on device vdc-8.
[8557543.806787] EXT4-fs (vdc): Remounting filesystem read-only
[8557543.813079] end_request: I/O error, dev vdb, sector 7138738712
[8557544.515328] end_request: I/O error, dev vdb, sector 41232
[8557545.979173] EXT4-fs error (device vdb): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/6/lost+found: directory contains a hole at offset 0
[8557546.340171] end_request: I/O error, dev vdf, sector 41232
[8557546.913031] EXT4-fs error (device vdf): ext4_readdir:172: inode #11: comm find: path /data/tdcpb/archive/8/lost+found: directory contains a hole at offset 0
[8557546.914464] Aborting journal on device vdf-8.
[8557548.249942] EXT4-fs (vdf): Remounting filesystem read-only
[8557566.732395] end_request: I/O error, dev vdd, sector 41224
[8557844.829111] end_request: I/O error, dev vdd, sector 4915757312
[8583680.992045] EXT4-fs (vdb): error count since last fsck: 2
[8583680.992057] EXT4-fs (vdb): initial error at time 1434120904: ext4_find_entry:932: inode 111542328
[8583680.992060] EXT4-fs (vdb): last error at time 1434181428: ext4_readdir:172: inode 11
[8586076.784775] UDP: bad checksum. From 46.183.220.250:3099 to 89.234.156.236:5060 ulen 237
[8644236.256068] EXT4-fs (vdc): error count since last fsck: 1
[8644236.256070] EXT4-fs (vdc): initial error at time 1434181426: ext4_readdir:172: inode 11
[8644236.256072] EXT4-fs (vdc): last error at time 1434181426: ext4_readdir:172: inode 11
[8644236.256075] EXT4-fs (vdf): error count since last fsck: 1
[8644236.256076] EXT4-fs (vdf): initial error at time 1434181429: ext4_readdir:172: inode 11
[8644236.256078] EXT4-fs (vdf): last error at time 1434181429: ext4_readdir:172: inode 11
[8652845.649692] end_request: I/O error, dev vdd, sector 8328
[8652845.651157] EXT4-fs error (device vdd): ext4_read_inode_bitmap:161: comm touch: Cannot read inode bitmap - block_group = 0, inode_bitmap = 1041
[8652846.052461] Aborting journal on device vdd-8.
[8652846.408963] EXT4-fs (vdd): Remounting filesystem read-only
[8652846.409564] EXT4-fs error (device vdd) in ext4_new_inode:945: IO failure
[8652846.446874] EXT4-fs error (device vdd) in ext4_create:1776: IO failure

osd_errors.txt View (5.96 KB) Jason Dillaman, 06/22/2015 01:57 PM

output_os_map.txt View (24.3 KB) Laurent GUERBY, 06/22/2015 09:49 PM

output_os_map_json.txt View (31.9 KB) Laurent GUERBY, 06/22/2015 09:54 PM


Related issues

Copied to Ceph - Backport #12940: IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2) Resolved

Associated revisions

Revision 700d42ef (diff)
Added by Kefu Chai over 7 years ago

osd: translate sparse_read to read for ecpool

Fixes: #12012
Signed-off-by: Kefu Chai <>

Revision 31804991 (diff)
Added by Kefu Chai about 7 years ago

osd: translate sparse_read to read for ecpool

Fixes: #12012
Signed-off-by: Kefu Chai <>
(cherry picked from commit 700d42ef1c82f5602249b96690ae881c1d259d54)

History

#1 Updated by Loïc Dachary over 7 years ago

Could you also show the tier configuration you have ?

#2 Updated by Loïc Dachary over 7 years ago

issue-12012-ceph-report-20150614.gz can be found in sftp

#3 Updated by Loïc Dachary over 7 years ago

            {
                "pool": 80,
                "pool_name": "ec4p1",
                "flags": 1,
                "flags_names": "hashpspool",
                "type": 3,
                "size": 5,
                "min_size": 2,
                "crush_ruleset": 2,
                "object_hash": 2,
                "pg_num": 4096,
                "pg_placement_num": 4096,
                "crash_replay_interval": 0,
                "last_change": "27770",
                "last_force_op_resend": "27732",
                "auid": 0,
                "snap_mode": "selfmanaged",
                "snap_seq": 0,
                "snap_epoch": 27730,
                "pool_snaps": [],
                "removed_snaps": "[]",
                "quota_max_bytes": 0,
                "quota_max_objects": 0,
                "tiers": [
                    81
                ],
                "tier_of": -1,
                "read_tier": 81,
                "write_tier": 81,
                "cache_mode": "none",
                "target_max_bytes": 0,
                "target_max_objects": 0,
                "cache_target_dirty_ratio_micro": 400000,
                "cache_target_full_ratio_micro": 800000,
                "cache_min_flush_age": 0,
                "cache_min_evict_age": 0,
                "erasure_code_profile": "ec4p1profile",
                "hit_set_params": {
                    "type": "none" 
                },
                "hit_set_period": 0,
                "hit_set_count": 0,
                "min_read_recency_for_promote": 0,
                "stripe_width": 4096,
                "expected_num_objects": 0
            },
            {
                "pool": 81,
                "pool_name": "ec4p1c",
                "flags": 9,
                "flags_names": "hashpspool,incomplete_clones",
                "type": 1,
                "size": 3,
                "min_size": 2,
                "crush_ruleset": 0,
                "object_hash": 2,
                "pg_num": 4096,
                "pg_placement_num": 4096,
                "crash_replay_interval": 0,
                "last_change": "27771",
                "last_force_op_resend": "0",
                "auid": 0,
                "snap_mode": "selfmanaged",
                "snap_seq": 0,
                "snap_epoch": 27730,
                "pool_snaps": [],
                "removed_snaps": "[]",
                "quota_max_bytes": 0,
                "quota_max_objects": 0,
                "tiers": [],
                "tier_of": 80,
                "read_tier": -1,
                "write_tier": -1,
                "cache_mode": "writeback",
                "target_max_bytes": 100000000000,
                "target_max_objects": 100000000000,
                "cache_target_dirty_ratio_micro": 400000,
                "cache_target_full_ratio_micro": 800000,
                "cache_min_flush_age": 0,
                "cache_min_evict_age": 0,
                "erasure_code_profile": "",
                "hit_set_params": {
                    "type": "bloom",
                    "false_positive_probability": 0.050000,
                    "target_size": 0,
                    "seed": 0
                },
                "hit_set_period": 3600,
                "hit_set_count": 1,
                "min_read_recency_for_promote": 0,
                "stripe_width": 0,
                "expected_num_objects": 0
            }

#4 Updated by Loïc Dachary over 7 years ago

All PG are active+clean in the 80 and 81 pools

$ jq '.pgmap.pg_stats[] | select(.pgid | startswith("80.") or startswith("81.") ) | select(.state != "active+clean")' < ~/Downloads/issue-12012-ceph-report-20150614 | wc -l
0

#5 Updated by Loïc Dachary over 7 years ago

When was the last time this pool had a deep-scrub ?

#6 Updated by Loïc Dachary over 7 years ago

The most recent deep-scrub is six month old

$ jq '.pgmap.pg_stats[] | select(.pgid | startswith("80.") or startswith("81.") ) | .last_deep_scrub_stamp' < ~/Downloads/issue-12012-ceph-report-20150614 | sort -u | tail -1
"2014-12-26 10:43:22.313538" 

#7 Updated by Loïc Dachary over 7 years ago

Could you list the operations you did to upgrade ? Was it just a matter of apt-get install ? Or did you do other maintenance operations as well ?

#8 Updated by Loïc Dachary over 7 years ago

joshd or jdillaman may have a suggestion on how to narrow the error down. Maybe "vdf, sector 4637237248" allows to find which rbd object is faulty and focus on it

#9 Updated by Mehdi Abaakouk over 7 years ago

We have not scrub since we upgrade to giant due to http://tracker.ceph.com/issues/8011, we upgrade to hammer to be able to do scrub again, but not yet re-renable it.

Upgrade process:

  • ceph osd set noout
  • apt-get install ... everywhere
  • restart one by one all mon nodes first.
  • restart one by one all OSDs and waiting for recovery to finish between each.
  • live-migrate all the vms to got last version of librbd in all running kvm processes
  • ceph osd unset noout

No special maintenance stuffs.

#10 Updated by Loïc Dachary over 7 years ago

  • Subject changed from ext4 IO error on kvm/rbd on an erasure coded pool to IO error on kvm/rbd with an erasure coded pool tier (after an upgrade from 0.87.1 to 0.94.2)

#11 Updated by Loïc Dachary over 7 years ago

  • Status changed from New to 12

#12 Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)

#13 Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)

#14 Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)

#15 Updated by Mehdi Abaakouk over 7 years ago

kvm/qemu is 2.2 and have not changed before and after the upgrade. librbd is 0.94.2-1-ga11cca9

#16 Updated by Mehdi Abaakouk over 7 years ago

I have also stopped one impacted VM, do a fsck with 'virt-rescue utility', and restart the VM, but I continue to get the FS correupted.

#17 Updated by Loïc Dachary over 7 years ago

jdillaman suggests: might be good to get librbd debug logs (“debug rbd = 20”) from qemu. Can you get that please Mehdi ?

#19 Updated by Mehdi Abaakouk over 7 years ago

I have made the following additionnal test:

  • Inside the VM, I have read the EC volume and got many 'IO Error'
  • Then I have detached this EC volume from VM
  • Mount it with the kernel client on another server with rbd map... (kernel 3.16)
  • Do the same command to read the EC volume, and doesn't get any errors
  • To confirm, I have reattached the EC volume to the kvm process,
  • Done the test again inside the VM and I got the 'IO Error' again.

The test command I have used is 'find /mnt -type f -exec cat {} \; > /dev/null' to read the whole device with the FS layer.

I will try to do it again with rbd debug enabled

#20 Updated by Mehdi Abaakouk over 7 years ago

I have generated 'IO error' with same command as previous test.

And uploaded the resulting log on sftp :

  • issue-12012-dmesg.gz (from within the VM on which /dev/vdb is attached)
  • issue-12012-rbd-debug-qemu-guest.21599.log.gz

143 "IO errors" have occurred during the test.

#21 Updated by Mehdi Abaakouk over 7 years ago

Just in case the qemu command line launched (generated by libvirt and nova):

qemu-system-x86_64 -enable-kvm -name fe6bd6b7-152c-4240-a119-3cd7a760fc0a -S -machine pc-i440fx-2.2,accel=kvm,usb=off -cpu Nehalem -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid fe6bd6b7-152c-4240-a119-3cd7a760fc0a -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=2015.1.0,serial=7350747a-f98d-489c-954f-18263c6fd088,uuid=fe6bd6b7-152c-4240-a119-3cd7a760fc0a -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/fe6bd6b7-152c-4240-a119-3cd7a760fc0a.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:disks/fe6bd6b7-152c-4240-a119-3cd7a760fc0a_disk.config:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-ide0-1-1,readonly=on,format=raw,cache=none,aio=native -device ide-cd,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -drive file=rbd:disks/volume-0776d998-b84e-4a60-9304-69834af247c0:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-virtio-disk0,format=raw,serial=0776d998-b84e-4a60-9304-69834af247c0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=rbd:ec4p1/volume-ecd57a41-51a7-40a0-a4ea-315537acbfb8:id=openstack-service:key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=192.168.99.251\:6789\;192.168.99.252\:6789\;192.168.99.253\:6789,if=none,id=drive-virtio-disk1,format=raw,serial=ecd57a41-51a7-40a0-a4ea-315537acbfb8,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:37:57:d0,bus=pci.0,addr=0x3 -chardev socket,id=charserial0,host=192.168.3.106,port=10001,server,nowait -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -spice port=5900,addr=0.0.0.0,disable-ticketing,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on

#22 Updated by Mehdi Abaakouk over 7 years ago

I have made some other tests by changing the io policy of qemu:

  • cache=none, io=native -> IO Errors
  • cache=none -> IO Errors # I guess io=threads by default
  • none of both -> no errors # I guess io=threads, cache=writethrough by default.

So, the issue seems to be triggered only when the rbd cache is disabled.

#23 Updated by Josh Durgin over 7 years ago

  • Project changed from Ceph to rbd

#24 Updated by Jason Dillaman over 7 years ago

I wasn't able to reproduce an issue using a clean v0.94.2 build with the librbd cache disabled. Looking at your RBD client logs, the client is receiving -95 (EOPNOTSUPP) errors from the OSDs when reading from the image. Can you search your OSD logs in the cache tier for this error code related to objects prefixed with "rbd_data."?

#25 Updated by Mehdi Abaakouk over 7 years ago

I don't find this error code (either by code or by name) on my OSDs logs, do I need to increase a particular log level this see them ?

#26 Updated by Mehdi Abaakouk over 7 years ago

I have retried why 'debug osd 20', and didn't find this errors in OSDs logs too.

#27 Updated by Jason Dillaman over 7 years ago

I've attached all the objects that encountered errors in your log. I extract the errors via 'zgrep -E "should_complete.*r = -" issue-12012-rbd-debug-qemu-guest.21599.log.gz | grep -Eo 'rbd_data[^ ]*' | sort | uniq'.

You should be able to map these objects back to PGs via 'for x in $(cat osd_errors.txt); do ceph osd map ec4p1 $x; done'.

You might need to add 'debug ms = 1' to your OSD's ceph.conf to have the OSD op replies logged.

#28 Updated by Laurent GUERBY over 7 years ago

Here is the output of the for loop. Let me know if more is needed.

#29 Updated by Laurent GUERBY over 7 years ago

With -f json per Loic request

#30 Updated by Josh Durgin over 7 years ago

  • Status changed from 12 to Need More Info

We haven't been able to reproduce this - can you give more details on the erasure codeded pool setup?

#31 Updated by Mehdi Abaakouk over 7 years ago

I guess all our pool setup is described into the previously send ceph-report. Lets me known is you need something else and how to gather it.

#32 Updated by Jason Dillaman over 7 years ago

Mehdi Abaakouk wrote:

I guess all our pool setup is described into the previously send ceph-report. Lets me known is you need something else and how to gather it.

We really will need the OSD reply outputs after applying "debug ms = 1" to your OSD config. I would apply the setting first to the cache tier, reproduce the issue, and scan the OSD logs for "-95" errors against "rbd_data.*" objects.

#33 Updated by Jason Dillaman over 7 years ago

Actually, this might be as simple as the fact that when not using the librbd cache, the read ops are issued as a sparse read request and when the cache is enabled a standard read is used. If the cache tier is full, (sparse) read requests will be redirected to the EC pool, but the EC pool does not support sparse reads so it will return -95 (EOPNOTSUPP). Since this redirection only occurs when the cache tier is full, that might explain why we haven't seen it during our testing.

#34 Updated by Jason Dillaman over 7 years ago

  • Project changed from rbd to Ceph

#35 Updated by Kefu Chai over 7 years ago

to fix this issue,

  • we need sparse reads in ReplicatedPG::do_op to be treated as normal reads on ec pools
  • it would be good for ceph_test_rados to send sparse reads some times, maybe have a ratio option send the read as a bunch of sparse reads in the read op. the option defaults to 0.5.
  • also, the api tests should do sparse reads
    • on both replicated
    • on a replicated cache on ec base

#36 Updated by Kefu Chai over 7 years ago

  • Status changed from Need More Info to 12

#37 Updated by Kefu Chai over 7 years ago

  • Assignee set to Kefu Chai

#38 Updated by Kefu Chai over 7 years ago

  • Status changed from 12 to In Progress

#39 Updated by Kefu Chai over 7 years ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to hammer

#40 Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to Pending Backport

#41 Updated by Loïc Dachary almost 7 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF