Project

General

Profile

Actions

Bug #13219

closed

get_device_by_uuid->blkid_find_dev_with_tag() may hang for 3 min

Added by Sage Weil over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Date: Wed, 23 Sep 2015 22:48:07 +0000
From: Somnath Roy <>
To: "Samuel Just ()" <>,
"Sage Weil ()" <>
Cc: ceph-devel <>
Subject: RE: Very slow recovery/peering with latest master

Sam/Sage,
I debugged it down and found out that the
get_device_by_uuid->blkid_find_dev_with_tag() call within
FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL.
I saw this portion is newly added after hammer.
Commenting it out resolves the issue. BTW, I saw this value is stored as
metadata but not used anywhere , am I missing anything ?
Here is my Linux details..

root@emsnode5:~/wip-write-path-optimization/src# uname -a
Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux

root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty

Actions #1

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info
Actions #2

Updated by Sage Weil over 8 years ago

  • Priority changed from Urgent to High
Actions #3

Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #4

Updated by Somnath Roy over 8 years ago

Sage,
Here is snippet of my last mail to the community regarding this.

Xiaoxi,
Thanks for giving me some pointers.
Now, with the help of strace I am able to figure out why it is taking so long in my setup to complete blkid* calls.
In my case, the partitions are showing properly even if it is connected to JBOD controller.

root@emsnode10:~/wip-write-path-optimization/src/os# strace -t -o /root/strace_blkid.txt blkid
/dev/sda1: UUID="d2060642-1af4-424f-9957-6a8dc77ff301" TYPE="ext4"
/dev/sda5: UUID="2a987cc0-e3cd-43d4-99cd-b8d8e58617e7" TYPE="swap"
/dev/sdy2: UUID="0ebd1631-52e7-4dc2-8bff-07102b877bfc" TYPE="xfs"
/dev/sdw2: UUID="29f1203b-6f44-45e3-8f6a-8ad1d392a208" TYPE="xfs"
/dev/sdt2: UUID="94f6bb55-ac61-499c-8552-600581e13dfa" TYPE="xfs"
/dev/sdr2: UUID="b629710e-915d-4c56-b6a5-4782e6d6215d" TYPE="xfs"
/dev/sdv2: UUID="69623b7f-9036-4a35-8298-dc7f5cecdb21" TYPE="xfs"
/dev/sds2: UUID="75d941c5-a85c-4c37-b409-02de34483314" TYPE="xfs"
/dev/sdx: UUID="cc84bc66-208b-4387-8470-071ec71532f2" TYPE="xfs"
/dev/sdu2: UUID="c9817831-8362-48a9-9a6c-920e0f04d029" TYPE="xfs"

But, it is taking time on the drives those are not reserved for this host. Basically, I am using 2 heads in front of a JBOF and I am using sg_persist to reserve the drives between 2 hosts.
Here is the strace output of blkid.

http://pastebin.com/qz2Z7Phj

You can see lot of input/output errors on accessing the drives which are not reserved for this host.

This is an inefficiency part of blkid* calls (?) since calls like fdisk/lsscsi are not taking time.

Regards
Somnath

It will be very helpful, if we can add this call within a config option. I will send out a pull request..

Actions #5

Updated by Somnath Roy over 8 years ago

Part of the following pull request..

https://github.com/ceph/ceph/pull/6670

Actions

Also available in: Atom PDF