Project

General

Profile

Bug #21401

rgw: Missing error handling when gen_rand_alphanumeric is failing

Added by Jens Harbott almost 2 years ago. Updated over 1 year ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
Start date:
09/15/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

The function gen_rand_alphanumeric() tries to read some randomness from /dev/urandom and converts it into a string. The read operation may fail (e.g. with "Too many open files") and there will be a negative error code returned.

int gen_rand_alphanumeric(CephContext *cct, char *dest, int size) /* size should be the required string size + 1 */
{
  int ret = get_random_bytes(dest, size);
  if (ret < 0) {
    lderr(cct) << "cannot get random bytes: " << cpp_strerror(-ret) << dendl;
    return ret;                  
  }
...
}

The consuming function append_rand_alpha() however does not check the return code, it uses the uninitialized char buf and appends that to the result string.

static inline void append_rand_alpha(CephContext *cct, const string& src, string& dest, int len)
{
  dest = src;
  char buf[len + 1];
  gen_rand_alphanumeric(cct, buf, len);
  dest.append("_");
  dest.append(buf);
}

As a result, when this happens while an object is being copied, we see its tag and prefix fields containing garbage instead of the expected 24 character string. In particular the prefix field seems to always contain ".P_" now, leading to collisions for tail object names and in the long run to data loss, as a second objects tail objects will now overwrite those of the first object.

Originally found in v0.94.10 but the code looks still the same in master.


Related issues

Related to rgw - Bug #22006: RGWCrashError: RGW will crash when generating random bucket name and object name during loadgen process Resolved 11/02/2017
Related to rgw - Bug #22225: rgw:socket leak in s3 multi part upload In Progress 11/22/2017
Copied to rgw - Backport #21851: luminous: rgw: Missing error handling when gen_rand_alphanumeric is failing Need More Info

History

#1 Updated by Jens Harbott almost 2 years ago

In order to reproduce, just make /dev/urandom inaccessible for some time, e.g. run (on the rgw node):

mv /dev/urandom /dev/blah;sleep 60; mv /dev/blah /dev/urandom

During these 60 seconds now run on a client node:

$ dd if=/dev/urandom of=test01 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.044453 s, 12.0 MB/s
$ dd if=/dev/urandom of=test02 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.0425381 s, 12.5 MB/s
$ s3cmd put test01  s3://test1/atest01
test01 -> s3://test1/atest01  [1 of 1]
 532480 of 532480   100% in    0s     3.57 MB/s  done
$ s3cmd put test02  s3://test1/atest02
test02 -> s3://test1/atest02  [1 of 1]
 532480 of 532480   100% in    0s     3.91 MB/s  done

Now the first object is corrupted, as its tail object has been overwritten by the second PUT.

$ s3cmd get s3://test1/atest01 atest01
s3://test1/atest01 -> atest01  [1 of 1]
 532480 of 532480   100% in    0s    24.54 MB/s  done
WARNING: MD5 signatures do not match: computed=949048077122cc83f883a72b690db3cb, received="c9212fd208c64173a7b4b1e057f5b752" 

One will also get truncated objects when one of these objects is removed, see http://tracker.ceph.com/issues/20107 and http://tracker.ceph.com/issues/20166 for related issues.

#2 Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Verified
  • Assignee set to Casey Bodley

#3 Updated by Casey Bodley almost 2 years ago

  • Status changed from Verified to Need Review

#4 Updated by Yehuda Sadeh almost 2 years ago

  • Priority changed from Normal to High

#5 Updated by Abhishek Lekshmanan almost 2 years ago

  • Status changed from Need Review to Pending Backport
  • Backport set to Luminous

Jewel backports may need a seperate fix

#6 Updated by Abhishek Lekshmanan almost 2 years ago

  • Copied to Backport #21851: luminous: rgw: Missing error handling when gen_rand_alphanumeric is failing added

#7 Updated by Nathan Cutler almost 2 years ago

  • Backport changed from Luminous to luminous

#8 Updated by Nathan Cutler almost 2 years ago

  • Related to Bug #22006: RGWCrashError: RGW will crash when generating random bucket name and object name during loadgen process added

#9 Updated by Nathan Cutler over 1 year ago

Note that the luminous backport is non-trivial because random number generation has been reworked in master.

#10 Updated by Casey Bodley over 1 year ago

  • Related to Bug #22225: rgw:socket leak in s3 multi part upload added

Also available in: Atom PDF