Project

General

Profile

Actions

Bug #64852

open

MDS hangs on "joining batch getattr" when client does statx

Added by Niklas Hambuechen about 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Every couple days, our CephFS hangs on one specific directory:

# ls -1U /var/mycluster/files/logs/web-server
mycluster-node-6
mycluster-node-5

In the above case, it hands listing the directory `mycluster-node-4`.

The directory in question contains 3 log files. To each one, a single process on the corresponding machine does append-only log writing.
A backup process on node-6 reads all of them from time to time, front to current-end.

This seems to cause some MDS hangs with ops "joining batch getattr", see further down.

This is very annoying, because it hangs the web-servers' operations, thus causing downtimes.

  • Kernel mounts, Linux 6.1.51 x86_64
  • roughly 3-machine setup (named "mycluster-node-4/5/6")

I suspect this started happening after theu pgrade from Ceph 16.2.7 to 18.2.1.

`ceph status`:

  cluster:
    id:     d9000ec0-93c2-479f-bd5d-94ae9673e347
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 clients failing to advance oldest client/flush tid
            1 MDSs report slow requests

`ceph health detail`:

HEALTH_WARN 1 clients failing to respond to capability release; 1 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.mycluster-node-4(mds.0): Client mycluster-node-6 failing to respond to capability release client_id: 2834268
[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid
    mds.mycluster-node-4(mds.0): Client mycluster-node-4 failing to advance its oldest client/flush tid.  client_id: 2824133
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.mycluster-node-4(mds.0): 20 slow requests are blocked > 30 secs

`ceph daemon mds.mycluster-node-4 dump_ops_in_flight`:

{
    "ops": [
        {
            "description": "client_request(client.2824133:9919861 getattr AsLsFs #0x100000005eb 2024-03-12T03:44:31.681505+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:44:31.682342+0000",
            "age": 771.75823718300001,
            "duration": 771.75826522499995,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9919861
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:44:31.682342+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:44:31.682342+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:44:31.682342+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:31.682346+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:31.682350+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:44:31.682364+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2834268:1816805 getattr AsLsFs #0x100000005eb 2024-03-12T03:44:46.716862+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:44:46.718273+0000",
            "age": 756.72230640800001,
            "duration": 756.722368695,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2834268
                    },
                    "tid": 1816805
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:44:46.718273+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:44:46.718273+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:44:46.718273+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:46.718458+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:46.718469+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:44:46.718487+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920012 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:51:51.085328+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:51:51.086754+0000",
            "age": 332.35382480800001,
            "duration": 332.35391520799999,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920012
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:51:51.086754+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:51:51.086755+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:51:51.086754+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:51:51.086763+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:51:51.086855+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:51:51.086882+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920007 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:45:49.074642+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:45:49.075714+0000",
            "age": 694.36486495400004,
            "duration": 694.36498380700004,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920007
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:45:49.075714+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:45:49.075714+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:45:49.075714+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:45:49.075720+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:45:49.075862+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:45:49.075900+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2834268:1816883 getattr AsLsFs #0x100000005eb 2024-03-12T03:52:40.459299+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:52:40.459690+0000",
            "age": 282.98088915099999,
            "duration": 282.98103522500003,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2834268
                    },
                    "tid": 1816883
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:52:40.459690+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:52:40.459691+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:52:40.459690+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:52:40.459694+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:52:40.459704+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:52:40.459722+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2834268:1816884 getattr AsLsFs #0x100000005eb 2024-03-12T03:52:59.763439+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:52:59.763937+0000",
            "age": 263.67664239300001,
            "duration": 263.67681670000002,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2834268
                    },
                    "tid": 1816884
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:52:59.763937+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:52:59.763937+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:52:59.763937+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:52:59.763945+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:52:59.763969+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:52:59.764013+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920008 getattr AsLsFs #0x100000005eb 2024-03-12T03:47:14.654799+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:47:14.655225+0000",
            "age": 608.78535396999996,
            "duration": 608.785554757,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920008
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:47:14.655225+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:47:14.655226+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:47:14.655225+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:47:14.655235+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:47:14.655251+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:47:14.655295+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920009 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:47:19.602808+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:47:19.603301+0000",
            "age": 603.83727808900005,
            "duration": 603.83750533499995,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920009
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:47:19.603301+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:47:19.603301+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:47:19.603301+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:47:19.603305+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:47:19.603370+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:47:19.603390+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920013 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:53:21.592507+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:53:21.593509+0000",
            "age": 241.84706984600001,
            "duration": 241.84732366200001,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920013
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:53:21.593509+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:53:21.593510+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:53:21.593509+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:53:21.593516+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:53:21.593580+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:53:21.593612+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920014 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:54:52.078687+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:54:52.079429+0000",
            "age": 151.36114999500001,
            "duration": 151.36143049099999,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920014
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:54:52.079429+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:54:52.079429+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:54:52.079429+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:54:52.079435+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:54:52.079508+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:54:52.079537+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920015 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:56:22.600873+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:56:22.601181+0000",
            "age": 60.839397910999999,
            "duration": 60.839725416,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920015
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:56:22.601181+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:56:22.601181+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:56:22.601181+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:56:22.601189+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:56:22.601264+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:56:22.601294+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:8720628 getattr Fs #0x100000005eb 2024-03-12T02:47:12.232891+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T02:47:12.234710+0000",
            "age": 4211.2058691359998,
            "duration": 4211.2062251939997,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 8720628
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T02:47:12.234710+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T02:47:12.234711+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T02:47:12.234710+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T02:47:12.234721+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T02:47:12.234744+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T02:47:12.234800+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920011 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:50:20.601151+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:50:20.602598+0000",
            "age": 422.837981232,
            "duration": 422.83836455099998,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920011
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:50:20.602598+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:50:20.602598+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:50:20.602598+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:50:20.602607+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:50:20.602684+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:50:20.602716+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2784147:14245173 lookup #0x1000000027c/mycluster-node-4 2024-03-12T03:44:54.212496+0000 caller_uid=0, caller_gid=0{0,})",
            "initiated_at": "2024-03-12T03:44:54.213086+0000",
            "age": 749.22749317299997,
            "duration": 749.22790294200001,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2784147
                    },
                    "tid": 14245173
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:44:54.213086+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213086+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213086+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213090+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213100+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213112+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213113+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2024-03-12T03:44:54.213115+0000",
                        "event": "failed to rdlock, waiting" 
                    }
                ],
                "locks": null
            }
        },
        {
            "description": "client_request(client.2824133:9920010 getattr AsLsXsFs #0x100000005eb 2024-03-12T03:48:50.114980+0000 caller_uid=1000, caller_gid=65534{96,65534,})",
            "initiated_at": "2024-03-12T03:48:50.116677+0000",
            "age": 513.32390260299996,
            "duration": 513.32434409200005,
            "type_data": {
                "flag_point": "joining batch getattr",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 2824133
                    },
                    "tid": 9920010
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-03-12T03:48:50.116677+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-03-12T03:48:50.116677+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-03-12T03:48:50.116677+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-03-12T03:48:50.116682+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-03-12T03:48:50.116748+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-03-12T03:48:50.116772+0000",
                        "event": "joining batch getattr" 
                    }
                ],
                "locks": null
            }
        }
    ],
    "num_ops": 15
}

The two classes of hanging ops here are:

  • `failed to rdlock, waiting`
  • `joining batch getattr` * at the time of writing, this returns 0 results in Google * only possibly related issue is: https://tracker.ceph.com/issues/62847 `"mds: blogbench requests stuck (5mds+scrub+snaps-flush)"`

Interestingly, on `node-4` (the node matching the directory which is hanging during listing), the following works:

root@mycluster-node-4 # ls -1U /var/mycluster/files/logs/web-server
mycluster-node-6
mycluster-node-5
mycluster-node-4

But it's hanging on the other machines. It also hangs on `node-4` if I don't pass `-1U` to `ls`.

Equally hanging is `stat /var/mycluster/files/logs/web-server/mycluster-node-4`.

The syscall that's hanging in the `stat` is:

statx(AT_FDCWD</root>, "/var/mycluster/files/logs/web-server/mycluster-node-4", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_ALL, 

In such situations `systemctl stop mycluster-server` (e.g. on `node-4`) fails to work, the process is stuck in state `D` in `htop`.
If terminated, it can't come back up, supposedly also because it cannot start writing to the log file.

I observed

# ceph osd blacklist ls
10.0.0.6:6801/4168360951 2024-03-12T22:27:44.910773+0000
10.0.0.6:6800/4168360951 2024-03-12T22:27:44.910773+0000
listed 2 entries

but removing those blocks with `ceph osd blocklist rm` did not resolve the issue.

On node-6, after `systemctl stop mycluster-server.service nginx.service`, `umount /var/mycluster/files` hangs for ~1 minute. After that, it seems to succeed, and listing works again:

# ls -lh /var/mycluster/files/logs/web-server
total 374G
-rw-r--r-- 1 cam nogroup 137G Mar 12 04:17 mycluster-node-4
-rw-r--r-- 1 cam nogroup 118G Mar 12 04:16 mycluster-node-5
-rw-r--r-- 1 cam nogroup 119G Mar 12 04:14 mycluster-node-6

And `dump_ops_in_flight` is now empty.

Ideas:

  • Maybe the backup process reading large log file on node-6 while the log-writing process is writing to it from node-4.

Workarounds:

  • Generally rebooting the affected Ceph client machine listed in `ceph health detail` fixes the issue immediately.
  • `systemctl stop mycluster-server.service nginx.service && umount /var/mycluster/files` and ~1 minute waiting on the problematic machine seems to fix the issue.

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #62847: mds: blogbench requests stuck (5mds+scrub+snaps-flush)TriagedVenky Shankar

Actions
Actions #1

Updated by Niklas Hambuechen about 2 months ago

The log this is triggering is this:

https://github.com/ceph/ceph/commit/5cf60960b642f999ce08d404a6b6e14c1eb434ca

The log was added in Ceph 18.2.1.

Actions #2

Updated by Niklas Hambuechen about 2 months ago

Some logs from `/var/log/ceph/ceph-mds.mycluster-node-4.log` to show that the problematic op hung for multiple hours:

2024-03-11T22:28:35.732+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276678 from mon.0
2024-03-11T22:28:35.732+0000 7fb2205fe6c0  1 mds.0.276675 handle_mds_map i am now mds.0.276675
2024-03-11T22:28:35.732+0000 7fb2205fe6c0  1 mds.0.276675 handle_mds_map state change up:rejoin --> up:active
2024-03-11T22:28:35.732+0000 7fb2205fe6c0  1 mds.0.276675 recovery_done -- successful recovery!
2024-03-11T22:28:35.748+0000 7fb2205fe6c0  1 mds.0.276675 active_start
2024-03-11T22:28:35.759+0000 7fb2205fe6c0  1 mds.0.276675 cluster recovered.
2024-03-11T22:28:40.011+0000 7fb21d3ff6c0 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-03-11T22:35:35.109+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-11T22:49:06.669+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-11T22:49:10.669+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-11T23:49:26.733+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-12T01:36:14.785+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-12T01:36:18.786+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-12T01:56:22.804+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-12T02:24:06.620+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
2024-03-12T02:47:43.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.871760 secs
2024-03-12T02:47:43.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 30.871756 seconds old, received at 2024-03-12T02:47:12.234710+0000: client_request(client.2824133:8
720628 getattr Fs #0x100000005eb 2024-03-12T02:47:12.232891+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently failed to rdlock, waiting
2024-03-12T02:47:46.957+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276680 from mon.0
2024-03-12T02:47:48.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 35.871824 secs
2024-03-12T02:47:53.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 40.871891 secs
2024-03-12T02:47:58.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 45.871958 secs
2024-03-12T02:48:03.104+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 50.872031 secs
2024-03-12T02:48:08.105+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 55.872097 secs
2024-03-12T02:48:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 60.872171 secs
2024-03-12T02:48:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 60.872170 seconds old, received at 2024-03-12T02:47:12.234710+0000: client_request(client.2824133:8
720628 getattr Fs #0x100000005eb 2024-03-12T02:47:12.232891+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently failed to rdlock, waiting
2024-03-12T02:48:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : client.2834268 isn't responding to mclientcaps(revoke), ino 0x100000005eb pending pLs issued pAsLsXsFr, sent 61.
892039 seconds ago
2024-03-12T02:48:15.227+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276681 from mon.0
2024-03-12T02:48:18.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 65.872242 secs
2024-03-12T02:48:23.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 70.872310 secs
2024-03-12T02:48:28.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 75.872388 secs
2024-03-12T02:48:33.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 80.872459 secs
2024-03-12T02:48:38.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 85.872533 secs
2024-03-12T02:48:43.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 90.872602 secs
2024-03-12T02:48:48.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 95.872676 secs
2024-03-12T02:48:53.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 100.872743 secs
2024-03-12T02:48:58.107+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 105.872822 secs
2024-03-12T02:49:03.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 110.872891 secs
2024-03-12T02:49:08.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 115.872961 secs
2024-03-12T02:49:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 120.873030 secs
2024-03-12T02:49:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 120.873030 seconds old, received at 2024-03-12T02:47:12.234710+0000: client_request(client.2824133:
8720628 getattr Fs #0x100000005eb 2024-03-12T02:47:12.232891+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently failed to rdlock, waiting
2024-03-12T02:49:13.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : client.2834268 isn't responding to mclientcaps(revoke), ino 0x100000005eb pending pLs issued pAsLsXsFr, sent 121.892859 seconds ago
2024-03-12T02:49:18.106+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 125.873104 secs
2024-03-12T02:49:23.107+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 130.873180 secs
...
2024-03-12T03:18:48.131+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 1895.898001 secs
...
2024-03-12T03:36:47.408+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276705 from mon.0
2024-03-12T03:36:49.847+0000 7fb2205fe6c0  0 log_channel(cluster) log [WRN] : client.2824133 does not advance its oldest_client_tid (8720628), 100038 completed requests recorded in session
2024-03-12T03:36:51.419+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276706 from mon.0
...
2024-03-12T03:41:37.967+0000 7fb2205fe6c0  0 log_channel(cluster) log [WRN] : client.2824133 does not advance its oldest_client_tid (8720628), 800951 completed requests recorded in session
...
2024-03-12T04:14:48.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 30 slow requests, 0 included below; oldest blocked for > 5256.255658 secs
2024-03-12T04:14:53.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 30 slow requests, 0 included below; oldest blocked for > 5261.255727 secs
2024-03-12T04:14:58.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 30 slow requests, 3 included below; oldest blocked for > 5266.255805 secs
2024-03-12T04:14:58.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 483.781597 seconds old, received at 2024-03-12T04:06:54.708917+0000: client_request(client.2834268:1817050 getattr AsLsFs #0x100000005eb 2024-03-12T04:06:54.708065+0000 caller_uid=0, caller_gid=0{0,}) currently joining batch getattr
2024-03-12T04:14:58.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 481.391703 seconds old, received at 2024-03-12T04:06:57.098811+0000: client_request(client.2824133:9920042 getattr AsLsXsFs #0x100000005eb 2024-03-12T04:06:57.097201+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently joining batch getattr
2024-03-12T04:14:58.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 241.347309 seconds old, received at 2024-03-12T04:10:57.143205+0000: client_request(client.2824133:10012034 getattr AsLsFs #0x100000005eb 2024-03-12T04:10:57.142711+0000 caller_uid=0, caller_gid=0{0,}) currently joining batch getattr
2024-03-12T04:15:03.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 48 slow requests, 3 included below; oldest blocked for > 5271.255879 secs
2024-03-12T04:15:03.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 123.898023 seconds old, received at 2024-03-12T04:12:59.592565+0000: client_request(client.2824133:10051904 getattr AsLsXsFs #0x100000005eb 2024-03-12T04:12:59.590983+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently joining batch getattr
2024-03-12T04:15:03.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 33.398968 seconds old, received at 2024-03-12T04:14:30.091620+0000: client_request(client.2824133:10051905 getattr AsLsXsFs #0x100000005eb 2024-03-12T04:14:30.090203+0000 caller_uid=1000, caller_gid=65534{96,65534,}) currently joining batch getattr
2024-03-12T04:15:03.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 514.678319 seconds old, received at 2024-03-12T04:06:28.812269+0000: client_request(client.2834268:1817047 getattr AsLsFs #0x100000005eb 2024-03-12T04:06:28.810869+0000 caller_uid=0, caller_gid=0{0,}) currently joining batch getattr
2024-03-12T04:15:06.681+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276826 from mon.0
2024-03-12T04:15:08.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 48 slow requests, 1 included below; oldest blocked for > 5276.255957 secs
2024-03-12T04:15:08.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : slow request 480.173570 seconds old, received at 2024-03-12T04:07:08.317096+0000: client_request(client.2784147:14245464 lookup #0x1000000027c/mycluster-node-4 2024-03-12T04:07:08.315784+0000 caller_uid=0, caller_gid=0{0,}) currently joining batch lookup
2024-03-12T04:15:13.489+0000 7fb21edfb6c0  0 log_channel(cluster) log [WRN] : 48 slow requests, 0 included below; oldest blocked for > 5281.256021 secs
2024-03-12T04:15:18.947+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276827 from mon.0
2024-03-12T04:15:55.099+0000 7fb2205fe6c0  1 mds.mycluster-node-4 Updating MDS map to version 276828 from mon.0
2024-03-12T04:17:35.197+0000 7fb222fff6c0  1 mds.mycluster-node-4 asok_command: dump_ops_in_flight {prefix=dump_ops_in_flight} (starting...)
2024-03-12T04:26:38.595+0000 7fb21e5fa6c0  0 mds.beacon.mycluster-node-4 missed beacon ack from the monitors
...
Actions #3

Updated by Niklas Hambuechen about 2 months ago

Are these possibly related?

I found these via https://github.com/ceph/ceph-client/commit/6df89bf220fdac9f40b0d35cd132eef54cf99d4b

Actions #4

Updated by Xiubo Li about 2 months ago

Niklas Hambuechen wrote:

Are these possibly related?

I found these via https://github.com/ceph/ceph-client/commit/6df89bf220fdac9f40b0d35cd132eef54cf99d4b

This is more like the same issue with:

https://tracker.ceph.com/issues/50223
https://tracker.ceph.com/issues/57244

Actions #5

Updated by Niklas Hambuechen about 2 months ago

@Xiubo Li Could you shortly summarise them for me? For the second one, it's marked as Resolved but there seems to be a PR and its revert in it. Which of that is that issue's fix, and can we expect that one of the commits, or recent kernel versions, also fixes my issue here, or does it just seem related but not definitely the same? Thanks!

Actions #6

Updated by Milind Changire about 2 months ago

  • Assignee set to Xiubo Li
Actions #7

Updated by Xiubo Li about 1 month ago

Niklas Hambuechen wrote:

@Xiubo Li Could you shortly summarise them for me? For the second one, it's marked as Resolved but there seems to be a PR and its revert in it. Which of that is that issue's fix, and can we expect that one of the commits, or recent kernel versions, also fixes my issue here, or does it just seem related but not definitely the same? Thanks!

For the ceph PR we have reverted it and then updated it again with a following fix:

Old: https://github.com/ceph/ceph/pull/47752
New: https://github.com/ceph/ceph/pull/51500

IMO it should be the same issue, because the cap revoking request stuck and then the following client requests.

In upstream we have fixed it since v6.8-rc5.

Actions #8

Updated by Venky Shankar about 1 month ago

  • Related to Bug #62847: mds: blogbench requests stuck (5mds+scrub+snaps-flush) added
Actions #9

Updated by Niklas Hambuechen about 1 month ago

In upstream we have fixed it since v6.8-rc5.

All clear. I will best be able to test that in ~2-3 months, when my distro has that available in its next stable release.

Actions

Also available in: Atom PDF