Bug #49244: WARNING: blocking librados call with http status=404 - rgw - Ceph

Actions

Copy link

Bug #49244

open

WARNING: blocking librados call with http status=404

Added by Ist Gab about 3 years ago. Updated about 3 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.7

ceph-qa-suite:

ceph-ansible

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I have a huge amount of http status 404 in my multisite cluster with 1 op->ERRORHANDLER: err_no=-2 new_err_no=-2 messages. I've attached debug log of a request.

Files

debug.txt (9.64 KB) debug.txt

Ist Gab, 02/11/2021 08:41 AM

Actions

Copy link

Updated by Casey Bodley about 3 years ago

Status changed from New to Need More Info

all i see in the log is a request for an object that doesn't exist. the request isn't coming from multisite. do you have reason to believe that object should exist?

the client has this user-agent:
aws-sdk-dotnet-45/3.5.3.2 aws-sdk-dotnet-core/3.5.1.24 .NET_Runtime/4.0 .NET_Framework/4.0 OS/Microsoft_Windows_NT_6.3.9600.0 ClientSync

Actions

Copy link

Updated by Ist Gab about 3 years ago

Casey Bodley wrote:

all i see in the log is a request for an object that doesn't exist. the request isn't coming from multisite. do you have reason to believe that object should exist?

the client has this user-agent:
aws-sdk-dotnet-45/3.5.3.2 aws-sdk-dotnet-core/3.5.1.24 .NET_Runtime/4.0 .NET_Framework/4.0 OS/Microsoft_Windows_NT_6.3.9600.0 ClientSync

Hello Casey,

Thank you for your answer.
Yeah, we are using bucket sync policy so it should be there but even couple of days it doesn't syncronised.

Can you please have a look our configuration maybe you see something in this https://tracker.ceph.com/issues/49075 thread?

I've followed your suggestion in this mailthread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033629.html

Has been initiated data sync from all site and restarted the gateways 4 days ago. Now I have crashed rgw-s and large omaps but the sync is still incredibly slow. I put a file in 1 dc and it not even arrive to another in 4 days :(

This is the just an example crash info:

{
    "backtrace": [
        "(()+0x12dd0) [0x7f7070f82dd0]",
        "(RGWCoroutine::set_sleeping(bool)+0x10) [0x7f707beba4d0]",
        "(RGWOmapAppend::flush_pending()+0x4a) [0x7f707bec03da]",
        "(RGWOmapAppend::finish()+0x14) [0x7f707bec04d4]",
        "(RGWDataSyncShardCR::stop_spawned_services()+0x2f) [0x7f707bdff9df]",
        "(RGWDataSyncShardCR::incremental_sync()+0x771) [0x7f707be175d1]",
        "(RGWDataSyncShardCR::operate()+0x9d) [0x7f707be1acdd]",
        "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x67) [0x7f707bebaac7]",
        "(RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x271) [0x7f707bebb8f1]",
        "(RGWCoroutinesManager::run(RGWCoroutine*)+0x8b) [0x7f707bebcb5b]",
        "(RGWRemoteDataLog::run_sync(int)+0x1ad) [0x7f707bdf35bd]",
        "(RGWDataSyncProcessorThread::process()+0x46) [0x7f707bf85226]",
        "(RGWRadosThread::Worker::entry()+0x176) [0x7f707bf4db86]",
        "(()+0x82de) [0x7f7070f782de]",
        "(clone()+0x43) [0x7f706f68ee83]" 
    ],
    "ceph_version": "15.2.7",
    "crash_id": "2021-02-15T00:18:16.271987Z_df86bab1-4eb3-4466-bd42-b15cdb1c1261",
    "entity_name": "client.rgw.hk-cephmon-2s02.rgw0",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8 (Core)",
    "os_version_id": "8",
    "process_name": "radosgw",
    "stack_sig": "8f62d50897d7b1b190387523f6d687e60dbef4e6746b430310d721c5a558f3b5",
    "timestamp": "2021-02-15T00:18:16.271987Z",
    "utsname_hostname": "hk-cephmon-2s02",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-193.28.1.el8_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Oct 22 00:20:22 UTC 2020" 
}

Omaps are the datalogs but I don't want to trim because I guess that one is holding the changes.

Also here is some perf dumps:

In HKG:
"data-sync-from-ash": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 425395,
  "fetch_errors": 1,
  "poll_latency": {
    "avgcount": 890,
    "sum": 47481.671537512,
    "avgtime": 53.350192738
  },
  "poll_errors": 0
},
"data-sync-from-sin": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 484757,
  "fetch_errors": 0,
  "poll_latency": {
    "avgcount": 21686,
    "sum": 135649.750753768,
    "avgtime": 6.255176185
  },
  "poll_errors": 3

In ASH:
"data-sync-from-hkg": {
  "fetch_bytes": {
    "avgcount": 7904,
    "sum": 497898243
  },
  "fetch_not_modified": 7383973,
  "fetch_errors": 654,
  "poll_latency": {
    "avgcount": 6586,
    "sum": 2568055.690045521,
    "avgtime": 389.926463717
  },
  "poll_errors": 3
},
"data-sync-from-sin": {
  "fetch_bytes": {
    "avgcount": 13362,
    "sum": 800114616
  },
  "fetch_not_modified": 7326406,
  "fetch_errors": 558,
  "poll_latency": {
    "avgcount": 10137,
    "sum": 3145053.032619919,
    "avgtime": 310.254812333
  },
  "poll_errors": 5
},

In SGP:
"data-sync-from-ash": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 2057839,
  "fetch_errors": 1,
  "poll_latency": {
    "avgcount": 8874,
    "sum": 682176.718044618,
    "avgtime": 76.873644133
  },
  "poll_errors": 0
},
"data-sync-from-hkg": {
  "fetch_bytes": {
    "avgcount": 114,
    "sum": 1097512
  },
  "fetch_not_modified": 1939198,
  "fetch_errors": 823,
  "poll_latency": {
    "avgcount": 2123,
    "sum": 60947.760976996,
    "avgtime": 28.708318877
  },
  "poll_errors": 1

Also interesting that in the data sync status in SGP I see a lot of pending bucket syncs from a bucket which shouldn't be syncronised, which is weird because it doesn't syncronize the data so it works as designed I guess but it is using the sync shard.

I'd very appreciate your help.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #49244

WARNING: blocking librados call with http status=404

Updated by Casey Bodley about 3 years ago

Updated by Ist Gab about 3 years ago