Project

General

Profile

Bug #46124

Potential race condition regression around new OSD flock()s

Added by Niklas Hambuechen 4 months ago. Updated 21 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus, nautilus
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

In #38150 and PR https://github.com/ceph/ceph/pull/26245, a new `flock()` approach was introuduced.

When I use `ceph-osd --mkfs` on v15.2.3, I get racy behaviour that it sometimes tries to take the lock when it already has it, resulting in `EAGAIN (Resource temporarily unavailable)` errors, and OSD creation fails.

Here are 3 runs with `strace`:

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close*dev/xvd|^\+\+\+ exited'
[pid 19933] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 19933] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 19933] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 19933] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 19933] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
2020-06-21T02:59:26.498+0000 7fce7d0abdc0 -1 bdev(0x55a5f810e700 /var/lib/ceph/osd/ceph-0/block.db) _lock flock failed on /var/lib/ceph/osd/ceph-0/block.db
2020-06-21T02:59:26.498+0000 7fce7d0abdc0 -1 bdev(0x55a5f810e700 /var/lib/ceph/osd/ceph-0/block.db) open failed to lock /var/lib/ceph/osd/ceph-0/block.db: (11) Resource temporarily unavailable
2020-06-21T02:59:26.499+0000 7fce7d0abdc0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minimal_open_bluefs add block device(/var/lib/ceph/osd/ceph-0/block.db) returned: (11) Resource temporarily unavailable
2020-06-21T02:59:26.837+0000 7fce7d0abdc0 -1 OSD::mkfs: couldn't mount ObjectStore: error (11) Resource temporarily unavailable
2020-06-21T02:59:26.837+0000 7fce7d0abdc0 -1  ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (11) Resource temporarily unavailable
+++ exited with 250 +++

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close*dev/xvd|^\+\+\+ exited'
[pid 21477] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 21477] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 21477] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
+++ exited with 250 +++

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close*dev/xvd|^\+\+\+ exited'
[pid 20320] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 20320] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 20320] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 20320] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 20320] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
+++ exited with 0 +++
  • The first one fails with errors
  • The second one, oddly, also has errors and exits with exit code 250, but no error is printed to the user (silent failure)
  • The third one succeeds

ceph-issue-46124-strace-flock-close-clone.txt View (46.3 KB) Niklas Hambuechen, 06/21/2020 05:20 PM


Related issues

Copied to bluestore - Backport #47707: nautilus: Potential race condition regression around new OSD flock()s New
Copied to bluestore - Backport #47708: octopus: Potential race condition regression around new OSD flock()s New

History

#1 Updated by Niklas Hambuechen 4 months ago

I did not experience that in Mimic.

#2 Updated by Niklas Hambuechen 4 months ago

In case it helps, here are `strace` invocations, each showing slightly different behaviour and error messages, that include the `close()` syscalls, so you can see which pattern leads to the suspected double-lock:

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close.*dev/xvd|^\+\+\+ exited'
[pid 16133] close(10</dev/xvdg> <unfinished ...>
[pid 16133] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 16133] close(24</dev/xvdg>)        = 0
[pid 16133] close(24</dev/xvdg>)        = 0
[pid 16133] close(24</dev/xvdg>)        = 0
[pid 16133] close(24</dev/xvdg>)        = 0
[pid 16133] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 16133] close(36</dev/xvdh>)        = 0
[pid 16133] close(48</dev/xvdg>)        = 0
[pid 16133] close(24</dev/xvdh>)        = 0
[pid 16133] close(25</dev/xvdh>)        = 0
[pid 16133] close(26</dev/xvdh>)        = 0
[pid 16133] close(27</dev/xvdh>)        = 0
[pid 16133] close(28</dev/xvdh>)        = 0
[pid 16133] close(29</dev/xvdh>)        = 0
[pid 16133] close(30</dev/xvdh>)        = 0
[pid 16133] close(31</dev/xvdh>)        = 0
[pid 16133] close(32</dev/xvdh>)        = 0
[pid 16133] close(33</dev/xvdh>)        = 0
[pid 16133] close(34</dev/xvdh>)        = 0
[pid 16133] close(35</dev/xvdh>)        = 0
[pid 16133] close(36</dev/xvdg>)        = 0
[pid 16133] close(37</dev/xvdg>)        = 0
[pid 16133] close(38</dev/xvdg>)        = 0
[pid 16133] close(39</dev/xvdg>)        = 0
[pid 16133] close(40</dev/xvdg>)        = 0
[pid 16133] close(41</dev/xvdg>)        = 0
[pid 16133] close(42</dev/xvdg>)        = 0
[pid 16133] close(43</dev/xvdg>)        = 0
[pid 16133] close(44</dev/xvdg>)        = 0
[pid 16133] close(45</dev/xvdg>)        = 0
[pid 16133] close(46</dev/xvdg>)        = 0
[pid 16133] close(47</dev/xvdg>)        = 0
[pid 16133] close(12</dev/xvdg>)        = 0
[pid 16133] close(13</dev/xvdg>)        = 0
[pid 16133] close(14</dev/xvdg>)        = 0
[pid 16133] close(15</dev/xvdg>)        = 0
[pid 16133] close(16</dev/xvdg>)        = 0
[pid 16133] close(17</dev/xvdg>)        = 0
[pid 16133] close(18</dev/xvdg>)        = 0
[pid 16133] close(19</dev/xvdg>)        = 0
[pid 16133] close(20</dev/xvdg>)        = 0
[pid 16133] close(21</dev/xvdg>)        = 0
[pid 16133] close(22</dev/xvdg>)        = 0
[pid 16133] close(23</dev/xvdg>)        = 0
[pid 16133] close(10</dev/xvdg>)        = 0
[pid 16133] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
2020-06-21T03:11:10.417+0000 7fa45796edc0 -1 bdev(0x55babde3c000 /var/lib/ceph/osd/ceph-0/block) _lock flock failed on /var/lib/ceph/osd/ceph-0/block[pid 16133] close(12</dev/xvdg>)        = 0
[pid 16133] close(13</dev/xvdg>)        = 0
2020-06-21T03:11:10.417+0000 7fa45796edc0 -1 bdev(0x55babde3c000 /var/lib/ceph/osd/ceph-0/block) open failed to lock /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable[pid 16133] close(14</dev/xvdg>
[pid 16133] close(15</dev/xvdg>)        = 0
[pid 16133] close(16</dev/xvdg>)        = 0
[pid 16133] close(17</dev/xvdg>)        = 0
[pid 16133] close(18</dev/xvdg>)        = 0
[pid 16133] close(19</dev/xvdg>)        = 0
[pid 16133] close(20</dev/xvdg>)        = 0
[pid 16133] close(21</dev/xvdg>)        = 0
[pid 16133] close(22</dev/xvdg>)        = 0
[pid 16133] close(23</dev/xvdg>)        = 0
2020-06-21T03:11:10.418+0000 7fa45796edc0 -1 OSD::mkfs: couldn't mount ObjectStore: error (11) Resource temporarily unavailable
2020-06-21T03:11:10.419+0000 7fa45796edc0 -1  ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (11) Resource temporarily unavailable
+++ exited with 250 +++

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close.*dev/xvd|^\+\+\+ exited'
[pid 19261] close(10</dev/xvdg>)        = 0
[pid 19261] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 19261] close(24</dev/xvdg>)        = 0
[pid 19261] close(24</dev/xvdg>)        = 0
[pid 19261] close(24</dev/xvdg>)        = 0
[pid 19261] close(24</dev/xvdg>)        = 0
[pid 19261] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 19261] close(36</dev/xvdh>)        = 0
[pid 19261] close(48</dev/xvdg>)        = 0
[pid 19261] close(24</dev/xvdh>)        = 0
[pid 19261] close(25</dev/xvdh>)        = 0
[pid 19261] close(26</dev/xvdh>)        = 0
[pid 19261] close(27</dev/xvdh>)        = 0
[pid 19261] close(28</dev/xvdh>)        = 0
[pid 19261] close(29</dev/xvdh>)        = 0
[pid 19261] close(30</dev/xvdh>)        = 0
[pid 19261] close(31</dev/xvdh>)        = 0
[pid 19261] close(32</dev/xvdh>)        = 0
[pid 19261] close(33</dev/xvdh>)        = 0
[pid 19261] close(34</dev/xvdh>)        = 0
[pid 19261] close(35</dev/xvdh>)        = 0
[pid 19261] close(36</dev/xvdg>)        = 0
[pid 19261] close(37</dev/xvdg>)        = 0
[pid 19261] close(38</dev/xvdg>)        = 0
[pid 19261] close(39</dev/xvdg>)        = 0
[pid 19261] close(40</dev/xvdg>)        = 0
[pid 19261] close(41</dev/xvdg>)        = 0
[pid 19261] close(42</dev/xvdg>)        = 0
[pid 19261] close(43</dev/xvdg>)        = 0
[pid 19261] close(44</dev/xvdg>)        = 0
[pid 19261] close(45</dev/xvdg>)        = 0
[pid 19261] close(46</dev/xvdg>)        = 0
[pid 19261] close(47</dev/xvdg>)        = 0
[pid 19261] close(12</dev/xvdg>)        = 0
[pid 19261] close(13</dev/xvdg>)        = 0
[pid 19261] close(14</dev/xvdg>)        = 0
[pid 19261] close(15</dev/xvdg>)        = 0
[pid 19261] close(16</dev/xvdg>)        = 0
[pid 19261] close(17</dev/xvdg>)        = 0
[pid 19261] close(18</dev/xvdg>)        = 0
[pid 19261] close(19</dev/xvdg>)        = 0
[pid 19261] close(20</dev/xvdg>)        = 0
[pid 19261] close(21</dev/xvdg>)        = 0
[pid 19261] close(22</dev/xvdg>)        = 0
[pid 19261] close(23</dev/xvdg>)        = 0
[pid 19261] close(10</dev/xvdg>)        = 0
[pid 19261] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19261] close(12</dev/xvdg>2020-06-21T03:12:01.110+0000 7fa9791b7dc0 -1 bdev(0x5586e29f4000 /var/lib/ceph/osd/ceph-0/block) _lock flock failed on /var/lib/ceph/osd/ceph-0/block
2020-06-21T03:12:01.110+0000 7fa9791b7dc0 -1 bdev(0x5586e29f4000 /var/lib/ceph/osd/ceph-0/block) open failed to lock /var/lib/ceph/osd/ceph-0/block: (11) Resource temporarily unavailable
[pid 19261] close(13</dev/xvdg>)        = 0
[pid 19261] close(14</dev/xvdg>)        = 0
[pid 19261] close(15</dev/xvdg>)        = 0
[pid 19261] close(16</dev/xvdg>)        = 0
[pid 19261] close(17</dev/xvdg>)        = 0
[pid 19261] close(18</dev/xvdg>)        = 0
[pid 19261] close(19</dev/xvdg>)        = 0
[pid 19261] close(20</dev/xvdg>)        = 0
[pid 19261] close(21</dev/xvdg>)        = 0
[pid 19261] close(22</dev/xvdg>)        = 0
[pid 19261] close(23</dev/xvdg>)        = 0
+++ exited with 250 +++

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close.*dev/xvd|^\+\+\+ exited'
[pid 24470] close(10</dev/xvdg>)        = 0
[pid 24470] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 24470] close(36</dev/xvdh>)        = 0
[pid 24470] close(48</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdh>)        = 0
[pid 24470] close(25</dev/xvdh>)        = 0
[pid 24470] close(26</dev/xvdh>)        = 0
[pid 24470] close(27</dev/xvdh>)        = 0
[pid 24470] close(28</dev/xvdh>)        = 0
[pid 24470] close(29</dev/xvdh>)        = 0
[pid 24470] close(30</dev/xvdh>)        = 0
[pid 24470] close(31</dev/xvdh>)        = 0
[pid 24470] close(32</dev/xvdh>)        = 0
[pid 24470] close(33</dev/xvdh>)        = 0
[pid 24470] close(34</dev/xvdh>)        = 0
[pid 24470] close(35</dev/xvdh>)        = 0
[pid 24470] close(36</dev/xvdg>)        = 0
[pid 24470] close(37</dev/xvdg>)        = 0
[pid 24470] close(38</dev/xvdg>)        = 0
[pid 24470] close(39</dev/xvdg>)        = 0
[pid 24470] close(40</dev/xvdg>)        = 0
[pid 24470] close(41</dev/xvdg>)        = 0
[pid 24470] close(42</dev/xvdg>)        = 0
[pid 24470] close(43</dev/xvdg>)        = 0
[pid 24470] close(44</dev/xvdg>)        = 0
[pid 24470] close(45</dev/xvdg>)        = 0
[pid 24470] close(46</dev/xvdg>)        = 0
[pid 24470] close(47</dev/xvdg>)        = 0
[pid 24470] close(12</dev/xvdg>)        = 0
[pid 24470] close(13</dev/xvdg>)        = 0
[pid 24470] close(14</dev/xvdg>)        = 0
[pid 24470] close(15</dev/xvdg>)        = 0
[pid 24470] close(16</dev/xvdg>)        = 0
[pid 24470] close(17</dev/xvdg>)        = 0
[pid 24470] close(18</dev/xvdg>)        = 0
[pid 24470] close(19</dev/xvdg>)        = 0
[pid 24470] close(20</dev/xvdg>)        = 0
[pid 24470] close(21</dev/xvdg>)        = 0
[pid 24470] close(22</dev/xvdg>)        = 0
[pid 24470] close(23</dev/xvdg>)        = 0
[pid 24470] close(10</dev/xvdg>)        = 0
[pid 24470] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 24470] close(36</dev/xvdh>)        = 0
[pid 24470] close(48</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdh>)        = 0
[pid 24470] close(25</dev/xvdh>)        = 0
[pid 24470] close(26</dev/xvdh>)        = 0
[pid 24470] close(27</dev/xvdh>)        = 0
[pid 24470] close(28</dev/xvdh>)        = 0
[pid 24470] close(29</dev/xvdh>)        = 0
[pid 24470] close(30</dev/xvdh>)        = 0
[pid 24470] close(31</dev/xvdh>)        = 0
[pid 24470] close(32</dev/xvdh>)        = 0
[pid 24470] close(33</dev/xvdh>)        = 0
[pid 24470] close(34</dev/xvdh>)        = 0
[pid 24470] close(35</dev/xvdh>)        = 0
[pid 24470] close(36</dev/xvdg>)        = 0
[pid 24470] close(37</dev/xvdg>)        = 0
[pid 24470] close(38</dev/xvdg>)        = 0
[pid 24470] close(39</dev/xvdg>)        = 0
[pid 24470] close(40</dev/xvdg>)        = 0
[pid 24470] close(41</dev/xvdg>)        = 0
[pid 24470] close(42</dev/xvdg>)        = 0
[pid 24470] close(43</dev/xvdg>)        = 0
[pid 24470] close(44</dev/xvdg>)        = 0
[pid 24470] close(45</dev/xvdg>)        = 0
[pid 24470] close(46</dev/xvdg>)        = 0
[pid 24470] close(47</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] close(24</dev/xvdg>)        = 0
[pid 24470] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
2020-06-21T03:13:24.640+0000 7f555224bdc0 -1 bdev(0x55873b676700 /var/lib/ceph/osd/ceph-0/block.db) _lock flock failed on /var/lib/ceph/osd/ceph-0/block.db
2020-06-21T03:13:24.640+0000 7f555224bdc0 -1 bdev(0x55873b676700 /var/lib/ceph/osd/ceph-0/block.db) open failed to lock /var/lib/ceph/osd/ceph-0/block.db: (11) Resource temporarily unavailable
[pid 24470] close(24</dev/xvdh>)        = 0
[pid 24470] close(25</dev/xvdh>)        = 0
[pid 24470] close(26</dev/xvdh>)        = 0
[pid 24470] close(27</dev/xvdh>)        = 0
[pid 24470] close(28</dev/xvdh>)        = 0
[pid 24470] close(29</dev/xvdh>)        = 0
[pid 24470] close(30</dev/xvdh>)        = 0
[pid 24470] close(31</dev/xvdh>)        = 0
[pid 24470] close(32</dev/xvdh>)        = 0
[pid 24470] close(33</dev/xvdh>)        = 0
[pid 24470] close(34</dev/xvdh>)        = 0
[pid 24470] close(35</dev/xvdh>)        = 0
2020-06-21T03:13:24.640+0000 7f555224bdc0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _minimal_open_bluefs add block device(/var/lib/ceph/osd/ceph-0/block.db) returned: (11) Resource temporarily unavailable
[pid 24470] close(12</dev/xvdg>)        = 0
[pid 24470] close(13</dev/xvdg>)        = 0
[pid 24470] close(14</dev/xvdg>)        = 0
[pid 24470] close(15</dev/xvdg>)        = 0
[pid 24470] close(16</dev/xvdg>)        = 0
[pid 24470] close(17</dev/xvdg>)        = 0
[pid 24470] close(18</dev/xvdg>)        = 0
[pid 24470] close(19</dev/xvdg>)        = 0
[pid 24470] close(20</dev/xvdg>)        = 0
[pid 24470] close(21</dev/xvdg>)        = 0
[pid 24470] close(22</dev/xvdg>)        = 0
[pid 24470] close(23</dev/xvdg>)        = 0
2020-06-21T03:13:25.006+0000 7f555224bdc0 -1 OSD::mkfs: couldn't mount ObjectStore: error (11) Resource temporarily unavailable
2020-06-21T03:13:25.108+0000 7f555224bdc0 -1  ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (11) Resource temporarily unavailable
+++ exited with 250 +++

# strace -fye flock,close ceph-osd -i 0 --mkfs --osd-uuid f1889a99-b4dc-4044-a244-eabca4539ac9 --setuser ceph --setgroup ceph --osd-objectstore bluestore 2>&1 | grep -E 'flock|Resource|close.*dev/xvd|^\+\+\+ exited'
[pid 16441] close(10</dev/xvdg>)        = 0
[pid 16441] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 16441] close(36</dev/xvdh>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdh>)        = 0
[pid 16441] close(25</dev/xvdh>)        = 0
[pid 16441] close(26</dev/xvdh>)        = 0
[pid 16441] close(27</dev/xvdh>)        = 0
[pid 16441] close(28</dev/xvdh>)        = 0
[pid 16441] close(29</dev/xvdh>)        = 0
[pid 16441] close(30</dev/xvdh>)        = 0
[pid 16441] close(31</dev/xvdh>)        = 0
[pid 16441] close(32</dev/xvdh>)        = 0
[pid 16441] close(33</dev/xvdh>)        = 0
[pid 16441] close(34</dev/xvdh>)        = 0
[pid 16441] close(35</dev/xvdh>)        = 0
[pid 16441] close(36</dev/xvdg>)        = 0
[pid 16441] close(37</dev/xvdg>)        = 0
[pid 16441] close(38</dev/xvdg>)        = 0
[pid 16441] close(39</dev/xvdg>)        = 0
[pid 16441] close(40</dev/xvdg>)        = 0
[pid 16441] close(41</dev/xvdg>)        = 0
[pid 16441] close(42</dev/xvdg>)        = 0
[pid 16441] close(43</dev/xvdg>)        = 0
[pid 16441] close(44</dev/xvdg>)        = 0
[pid 16441] close(45</dev/xvdg>)        = 0
[pid 16441] close(46</dev/xvdg>)        = 0
[pid 16441] close(47</dev/xvdg>)        = 0
[pid 16441] close(12</dev/xvdg>)        = 0
[pid 16441] close(13</dev/xvdg>)        = 0
[pid 16441] close(14</dev/xvdg>)        = 0
[pid 16441] close(15</dev/xvdg>)        = 0
[pid 16441] close(16</dev/xvdg>)        = 0
[pid 16441] close(17</dev/xvdg>)        = 0
[pid 16441] close(18</dev/xvdg>)        = 0
[pid 16441] close(19</dev/xvdg>)        = 0
[pid 16441] close(20</dev/xvdg>)        = 0
[pid 16441] close(21</dev/xvdg>)        = 0
[pid 16441] close(22</dev/xvdg>)        = 0
[pid 16441] close(23</dev/xvdg>)        = 0
[pid 16441] close(10</dev/xvdg>)        = 0
[pid 16441] flock(12</dev/xvdg>, LOCK_EX|LOCK_NB) = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 16441] close(36</dev/xvdh>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdh>)        = 0
[pid 16441] close(25</dev/xvdh>)        = 0
[pid 16441] close(26</dev/xvdh>)        = 0
[pid 16441] close(27</dev/xvdh>)        = 0
[pid 16441] close(28</dev/xvdh>)        = 0
[pid 16441] close(29</dev/xvdh>)        = 0
[pid 16441] close(30</dev/xvdh>)        = 0
[pid 16441] close(31</dev/xvdh>)        = 0
[pid 16441] close(32</dev/xvdh>)        = 0
[pid 16441] close(33</dev/xvdh>)        = 0
[pid 16441] close(34</dev/xvdh>)        = 0
[pid 16441] close(35</dev/xvdh>)        = 0
[pid 16441] close(36</dev/xvdg>)        = 0
[pid 16441] close(37</dev/xvdg>)        = 0
[pid 16441] close(38</dev/xvdg>)        = 0
[pid 16441] close(39</dev/xvdg>)        = 0
[pid 16441] close(40</dev/xvdg>)        = 0
[pid 16441] close(41</dev/xvdg>)        = 0
[pid 16441] close(42</dev/xvdg>)        = 0
[pid 16441] close(43</dev/xvdg>)        = 0
[pid 16441] close(44</dev/xvdg>)        = 0
[pid 16441] close(45</dev/xvdg>)        = 0
[pid 16441] close(46</dev/xvdg>)        = 0
[pid 16441] close(47</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdg>)        = 0
[pid 16441] flock(24</dev/xvdh>, LOCK_EX|LOCK_NB) = 0
[pid 16441] close(36</dev/xvdh>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(48</dev/xvdg>)        = 0
[pid 16441] close(24</dev/xvdh>)        = 0
[pid 16441] close(25</dev/xvdh>)        = 0
[pid 16441] close(26</dev/xvdh>)        = 0
[pid 16441] close(27</dev/xvdh>)        = 0
[pid 16441] close(28</dev/xvdh>)        = 0
[pid 16441] close(29</dev/xvdh>)        = 0
[pid 16441] close(30</dev/xvdh>)        = 0
[pid 16441] close(31</dev/xvdh>)        = 0
[pid 16441] close(32</dev/xvdh>)        = 0
[pid 16441] close(33</dev/xvdh>)        = 0
[pid 16441] close(34</dev/xvdh>)        = 0
[pid 16441] close(35</dev/xvdh>)        = 0
[pid 16441] close(36</dev/xvdg>)        = 0
[pid 16441] close(37</dev/xvdg>)        = 0
[pid 16441] close(38</dev/xvdg>)        = 0
[pid 16441] close(39</dev/xvdg>)        = 0
[pid 16441] close(40</dev/xvdg>)        = 0
[pid 16441] close(41</dev/xvdg>)        = 0
[pid 16441] close(42</dev/xvdg>)        = 0
[pid 16441] close(43</dev/xvdg>)        = 0
[pid 16441] close(44</dev/xvdg>)        = 0
[pid 16441] close(45</dev/xvdg>)        = 0
[pid 16441] close(46</dev/xvdg>)        = 0
[pid 16441] close(47</dev/xvdg>)        = 0
[pid 16441] close(12</dev/xvdg>)        = 0
[pid 16441] close(13</dev/xvdg>)        = 0
[pid 16441] close(14</dev/xvdg>)        = 0
[pid 16441] close(15</dev/xvdg>)        = 0
[pid 16441] close(16</dev/xvdg>)        = 0
[pid 16441] close(17</dev/xvdg>)        = 0
[pid 16441] close(18</dev/xvdg>)        = 0
[pid 16441] close(19</dev/xvdg>)        = 0
[pid 16441] close(20</dev/xvdg>)        = 0
[pid 16441] close(21</dev/xvdg>)        = 0
[pid 16441] close(22</dev/xvdg>)        = 0
[pid 16441] close(23</dev/xvdg>)        = 0
+++ exited with 0 +++

#3 Updated by Niklas Hambuechen 4 months ago

Another question:

Would it not be better to use OFD locks (Open File Description locks), that is via F_OFD_SETLK? See

They are the newest of the 3 types of file locking (classical fcntl(F_SETLK), flock(), and fcntl(F_OFD_SETLK)), and also solve the problem of

classic posix lock gotcha: if you close any fd to the same inode from the process, the lock(s) go away.

that the PR aimed to address.

#4 Updated by Niklas Hambuechen 4 months ago

From the strace above, we can see that there's always a `close()` after a matching `flock()` within the same PID, so the single thread seems to behave correctly.

I suspect that Ceph starts other threads (using clone() on Linux) while the lock is held, which leads to the lock being held further even though the thread that took the lock originally already close()d the file descriptor.

The attached strace output (2 runs) shows a failing and a working case, with added `clone()` syscalls and process birth/death.

From `man 2 flock`:

Locks created by flock() are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lock,

This would also explain why it worked before: `fcntl(F_SETLK)` locks are process-local, so this would not raise that error.

The main thing I do not understand is how this can make it into 2 Ceph releases (v14 and v15) without being noticed, as for me it quite reliably breaks both OSD creation and OSD process startup.

#5 Updated by Niklas Hambuechen 4 months ago

I suspect that Ceph starts other threads (using clone() on Linux) while the lock is held

Sorry, this should be threads/processes. And threads should not be relevant, because they share file descriptor state.

From the attached strace, it the only `clone()` that's a real process, not a thread, is at the beginning, even before the first `flock()`, so my suspicion above is probably not well-founded.

#6 Updated by Neha Ojha 4 months ago

  • Status changed from New to Closed

Please feel free to reopen if you find a real bug somewhere.

#7 Updated by Niklas Hambuechen 4 months ago

@Neha Ojha:

I think the bug remains as real as it gets; I did not retract that this is a bug. With my last comment I only stated that my suspicion about the underlying reason of the bug is probably wrong.

How can I reopen the ticket?

I'm not sure I have sufficient Redmine permissions for that.

#8 Updated by Neha Ojha 4 months ago

  • Status changed from Closed to New

#9 Updated by Igor Fedotov 4 months ago

I'm able to reproduce the issue with src/objectstore/store_test and pretty pawerfull HW:
-10> 2020-07-08T09:28:28.175+0000 7f1ac982e0c0 -1 bdev(0x5596ad2eea80 bluesto
re.test_temp_dir/block.db) _lock flock failed on bluestore.test_temp_dir/block.d
b 11
-9> 2020-07-08T09:28:28.175+0000 7f1ac982e0c0 -1 bdev(0x5596ad2eea80 bluesto
re.test_temp_dir/block.db) open failed to lock bluestore.test_temp_dir/block.db:
(11) Resource temporarily unavailable

#10 Updated by Igor Fedotov 4 months ago

  • Pull request ID set to 34566

#11 Updated by Neha Ojha 3 months ago

  • Status changed from New to Fix Under Review

#12 Updated by Igor Fedotov about 2 months ago

  • Status changed from Fix Under Review to Resolved

#13 Updated by Igor Fedotov about 1 month ago

  • Backport set to octopus, nautilus

#14 Updated by Igor Fedotov about 1 month ago

  • Status changed from Resolved to Pending Backport

#15 Updated by Niklas Hambuechen about 1 month ago

I believe this bug was not fixed properly; the added commit "just retries 3 times".

This makes success more likely, but does not address the underlying race condition, so I don't think this bug should be auto-closed by that commit. (Retrying indefinitely seems like the more correct fix to me.)

See:

#16 Updated by Kefu Chai about 1 month ago

  • Status changed from Pending Backport to New

per the discussion in https://github.com/ceph/ceph/pull/34566, i am reopening this ticket.

#17 Updated by Kefu Chai about 1 month ago

  • Assignee set to Kefu Chai

#18 Updated by Kefu Chai about 1 month ago

per https://repo.or.cz/glibc.git/blob/HEAD:/sysdeps/posix/flock.c,

int
__flock (int fd, int operation)
{
  struct flock lbuf;

  switch (operation & ~LOCK_NB)
    {
    case LOCK_SH:
      lbuf.l_type = F_RDLCK;
      break;
    case LOCK_EX:
      lbuf.l_type = F_WRLCK;
      break;
    case LOCK_UN:
      lbuf.l_type = F_UNLCK;
      break;
    default:
      __set_errno (EINVAL);
      return -1;
    }

  lbuf.l_whence = SEEK_SET;
  lbuf.l_start = lbuf.l_len = 0L; /* Lock the whole file.  */

  return __fcntl (fd, (operation & LOCK_NB) ? F_SETLK : F_SETLKW, &lbuf);
}

the change of https://github.com/ceph/ceph/pull/26245/files#diff-af76abecbe3293ccd298b6deded4ca49L58-L63 in https://github.com/ceph/ceph/pull/26245 does not change anything.

#19 Updated by Kefu Chai about 1 month ago

i think we also need to include https://github.com/ceph/ceph/pull/37153 for properly returning the right errno.

#20 Updated by Kefu Chai about 1 month ago

  • Status changed from New to Fix Under Review

#21 Updated by Igor Fedotov 21 days ago

  • Status changed from Fix Under Review to Pending Backport

#22 Updated by Nathan Cutler 21 days ago

  • Copied to Backport #47707: nautilus: Potential race condition regression around new OSD flock()s added

#23 Updated by Nathan Cutler 21 days ago

  • Copied to Backport #47708: octopus: Potential race condition regression around new OSD flock()s added

Also available in: Atom PDF