Project

General

Profile

Actions

Bug #21401

closed

rgw: Missing error handling when gen_rand_alphanumeric is failing

Added by Jens Harbott over 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The function gen_rand_alphanumeric() tries to read some randomness from /dev/urandom and converts it into a string. The read operation may fail (e.g. with "Too many open files") and there will be a negative error code returned.

int gen_rand_alphanumeric(CephContext *cct, char *dest, int size) /* size should be the required string size + 1 */
{
  int ret = get_random_bytes(dest, size);
  if (ret < 0) {
    lderr(cct) << "cannot get random bytes: " << cpp_strerror(-ret) << dendl;
    return ret;                  
  }
...
}

The consuming function append_rand_alpha() however does not check the return code, it uses the uninitialized char buf and appends that to the result string.

static inline void append_rand_alpha(CephContext *cct, const string& src, string& dest, int len)
{
  dest = src;
  char buf[len + 1];
  gen_rand_alphanumeric(cct, buf, len);
  dest.append("_");
  dest.append(buf);
}

As a result, when this happens while an object is being copied, we see its tag and prefix fields containing garbage instead of the expected 24 character string. In particular the prefix field seems to always contain ".P_" now, leading to collisions for tail object names and in the long run to data loss, as a second objects tail objects will now overwrite those of the first object.

Originally found in v0.94.10 but the code looks still the same in master.


Related issues 3 (1 open2 closed)

Related to rgw - Bug #22006: RGWCrashError: RGW will crash when generating random bucket name and object name during loadgen processResolved11/02/2017

Actions
Related to rgw - Bug #22225: rgw:socket leak in s3 multi part uploadIn Progress11/22/2017

Actions
Copied to rgw - Backport #21851: luminous: rgw: Missing error handling when gen_rand_alphanumeric is failingRejectedCasey BodleyActions
Actions #1

Updated by Jens Harbott over 6 years ago

In order to reproduce, just make /dev/urandom inaccessible for some time, e.g. run (on the rgw node):

mv /dev/urandom /dev/blah;sleep 60; mv /dev/blah /dev/urandom

During these 60 seconds now run on a client node:

$ dd if=/dev/urandom of=test01 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.044453 s, 12.0 MB/s
$ dd if=/dev/urandom of=test02 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.0425381 s, 12.5 MB/s
$ s3cmd put test01  s3://test1/atest01
test01 -> s3://test1/atest01  [1 of 1]
 532480 of 532480   100% in    0s     3.57 MB/s  done
$ s3cmd put test02  s3://test1/atest02
test02 -> s3://test1/atest02  [1 of 1]
 532480 of 532480   100% in    0s     3.91 MB/s  done

Now the first object is corrupted, as its tail object has been overwritten by the second PUT.

$ s3cmd get s3://test1/atest01 atest01
s3://test1/atest01 -> atest01  [1 of 1]
 532480 of 532480   100% in    0s    24.54 MB/s  done
WARNING: MD5 signatures do not match: computed=949048077122cc83f883a72b690db3cb, received="c9212fd208c64173a7b4b1e057f5b752" 

One will also get truncated objects when one of these objects is removed, see http://tracker.ceph.com/issues/20107 and http://tracker.ceph.com/issues/20166 for related issues.

Actions #2

Updated by Casey Bodley over 6 years ago

  • Status changed from New to 12
  • Assignee set to Casey Bodley
Actions #3

Updated by Casey Bodley over 6 years ago

  • Status changed from 12 to Fix Under Review
Actions #4

Updated by Yehuda Sadeh over 6 years ago

  • Priority changed from Normal to High
Actions #5

Updated by Abhishek Lekshmanan over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to Luminous

Jewel backports may need a seperate fix

Actions #6

Updated by Abhishek Lekshmanan over 6 years ago

  • Copied to Backport #21851: luminous: rgw: Missing error handling when gen_rand_alphanumeric is failing added
Actions #7

Updated by Nathan Cutler over 6 years ago

  • Backport changed from Luminous to luminous
Actions #8

Updated by Nathan Cutler over 6 years ago

  • Related to Bug #22006: RGWCrashError: RGW will crash when generating random bucket name and object name during loadgen process added
Actions #9

Updated by Nathan Cutler over 6 years ago

Note that the luminous backport is non-trivial because random number generation has been reworked in master.

Actions #10

Updated by Casey Bodley almost 6 years ago

  • Related to Bug #22225: rgw:socket leak in s3 multi part upload added
Actions #11

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF