Project

General

Profile

Actions

Bug #49244

open

WARNING: blocking librados call with http status=404

Added by Ist Gab about 3 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I have a huge amount of http status 404 in my multisite cluster with 1 op->ERRORHANDLER: err_no=-2 new_err_no=-2 messages. I've attached debug log of a request.


Files

debug.txt (9.64 KB) debug.txt Ist Gab, 02/11/2021 08:41 AM
Actions #1

Updated by Casey Bodley about 3 years ago

  • Status changed from New to Need More Info

all i see in the log is a request for an object that doesn't exist. the request isn't coming from multisite. do you have reason to believe that object should exist?

the client has this user-agent:
aws-sdk-dotnet-45/3.5.3.2 aws-sdk-dotnet-core/3.5.1.24 .NET_Runtime/4.0 .NET_Framework/4.0 OS/Microsoft_Windows_NT_6.3.9600.0 ClientSync

Actions #2

Updated by Ist Gab about 3 years ago

Casey Bodley wrote:

all i see in the log is a request for an object that doesn't exist. the request isn't coming from multisite. do you have reason to believe that object should exist?

the client has this user-agent:
aws-sdk-dotnet-45/3.5.3.2 aws-sdk-dotnet-core/3.5.1.24 .NET_Runtime/4.0 .NET_Framework/4.0 OS/Microsoft_Windows_NT_6.3.9600.0 ClientSync

Hello Casey,

Thank you for your answer.
Yeah, we are using bucket sync policy so it should be there but even couple of days it doesn't syncronised.

Can you please have a look our configuration maybe you see something in this https://tracker.ceph.com/issues/49075 thread?

I've followed your suggestion in this mailthread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033629.html

Has been initiated data sync from all site and restarted the gateways 4 days ago. Now I have crashed rgw-s and large omaps but the sync is still incredibly slow. I put a file in 1 dc and it not even arrive to another in 4 days :(

This is the just an example crash info:

{
    "backtrace": [
        "(()+0x12dd0) [0x7f7070f82dd0]",
        "(RGWCoroutine::set_sleeping(bool)+0x10) [0x7f707beba4d0]",
        "(RGWOmapAppend::flush_pending()+0x4a) [0x7f707bec03da]",
        "(RGWOmapAppend::finish()+0x14) [0x7f707bec04d4]",
        "(RGWDataSyncShardCR::stop_spawned_services()+0x2f) [0x7f707bdff9df]",
        "(RGWDataSyncShardCR::incremental_sync()+0x771) [0x7f707be175d1]",
        "(RGWDataSyncShardCR::operate()+0x9d) [0x7f707be1acdd]",
        "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x67) [0x7f707bebaac7]",
        "(RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x271) [0x7f707bebb8f1]",
        "(RGWCoroutinesManager::run(RGWCoroutine*)+0x8b) [0x7f707bebcb5b]",
        "(RGWRemoteDataLog::run_sync(int)+0x1ad) [0x7f707bdf35bd]",
        "(RGWDataSyncProcessorThread::process()+0x46) [0x7f707bf85226]",
        "(RGWRadosThread::Worker::entry()+0x176) [0x7f707bf4db86]",
        "(()+0x82de) [0x7f7070f782de]",
        "(clone()+0x43) [0x7f706f68ee83]" 
    ],
    "ceph_version": "15.2.7",
    "crash_id": "2021-02-15T00:18:16.271987Z_df86bab1-4eb3-4466-bd42-b15cdb1c1261",
    "entity_name": "client.rgw.hk-cephmon-2s02.rgw0",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8 (Core)",
    "os_version_id": "8",
    "process_name": "radosgw",
    "stack_sig": "8f62d50897d7b1b190387523f6d687e60dbef4e6746b430310d721c5a558f3b5",
    "timestamp": "2021-02-15T00:18:16.271987Z",
    "utsname_hostname": "hk-cephmon-2s02",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-193.28.1.el8_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Oct 22 00:20:22 UTC 2020" 
}

Omaps are the datalogs but I don't want to trim because I guess that one is holding the changes.

Also here is some perf dumps:

In HKG:
"data-sync-from-ash": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 425395,
  "fetch_errors": 1,
  "poll_latency": {
    "avgcount": 890,
    "sum": 47481.671537512,
    "avgtime": 53.350192738
  },
  "poll_errors": 0
},
"data-sync-from-sin": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 484757,
  "fetch_errors": 0,
  "poll_latency": {
    "avgcount": 21686,
    "sum": 135649.750753768,
    "avgtime": 6.255176185
  },
  "poll_errors": 3

In ASH:
"data-sync-from-hkg": {
  "fetch_bytes": {
    "avgcount": 7904,
    "sum": 497898243
  },
  "fetch_not_modified": 7383973,
  "fetch_errors": 654,
  "poll_latency": {
    "avgcount": 6586,
    "sum": 2568055.690045521,
    "avgtime": 389.926463717
  },
  "poll_errors": 3
},
"data-sync-from-sin": {
  "fetch_bytes": {
    "avgcount": 13362,
    "sum": 800114616
  },
  "fetch_not_modified": 7326406,
  "fetch_errors": 558,
  "poll_latency": {
    "avgcount": 10137,
    "sum": 3145053.032619919,
    "avgtime": 310.254812333
  },
  "poll_errors": 5
},

In SGP:
"data-sync-from-ash": {
  "fetch_bytes": {
    "avgcount": 0,
    "sum": 0
  },
  "fetch_not_modified": 2057839,
  "fetch_errors": 1,
  "poll_latency": {
    "avgcount": 8874,
    "sum": 682176.718044618,
    "avgtime": 76.873644133
  },
  "poll_errors": 0
},
"data-sync-from-hkg": {
  "fetch_bytes": {
    "avgcount": 114,
    "sum": 1097512
  },
  "fetch_not_modified": 1939198,
  "fetch_errors": 823,
  "poll_latency": {
    "avgcount": 2123,
    "sum": 60947.760976996,
    "avgtime": 28.708318877
  },
  "poll_errors": 1

Also interesting that in the data sync status in SGP I see a lot of pending bucket syncs from a bucket which shouldn't be syncronised, which is weird because it doesn't syncronize the data so it works as designed I guess but it is using the sync shard.

I'd very appreciate your help.

Actions

Also available in: Atom PDF