Project

General

Profile

Bug #58052

Empty Pool (zero objects) shows usage.

Added by Brian Woods 2 months ago. Updated 27 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a pool that was/is being used in a CephFS. I have migrated all of the files off of the pool and was preparing to remove it. But even after a day of the pool being empty (zero objects), it still shows as consuming space. It also shows statistics for objects under compression:

POOL_NAME                          USED  OBJECTS  CLONES    COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD    WR_OPS       WR  USED COMPR  UNDER COMPR
...
CephFS-Erasure-ByOSD-D4F2-Data  6.3 GiB        0       0         0                   0        0         0   5971837  5.9 TiB   2495150   39 GiB     365 MiB      730 MiB
...

I recently added about a dozen OSDs and it is going to take another day or so for that re-balancing to finish. Is it possible that this is because things are still re-balancing?

I tried a few things to see if I could figure out what was going on:

rbd -p CephFS-Erasure-ByOSD-D4F2-Data du
#No results.
rados -p CephFS-Erasure-ByOSD-D4F2-Data ls
#No results.

I also made sure that there was nothing leftover from any benchmarks:

rados -p CephFS-Erasure-ByOSD-D4F2-Data cleanup
#No change.

This pool was only used for CephFS and was not used for any block devices, nor have I used any snapshots.

The cluster started out as 17.2.3 but is currently running 17.2.4.

I would like to remove this pool, but I am not sure if it is safe to do so. I am not sure if I have run into some sort of an orphaning issue, or if this is just the statistics not calculating correctly.

Thanks!

Logs.zip (783 KB) Brian Woods, 12/02/2022 05:39 PM

server1.zip (349 KB) Brian Woods, 12/02/2022 06:09 PM

server2.zip (363 KB) Brian Woods, 12/02/2022 06:09 PM

History

#1 Updated by Radoslaw Zarzynski 2 months ago

  • Status changed from New to Need More Info

Could you please provide a log from an active mgr with debug_ms=1 and debug_mgr=20? We would like to see which OSDs participate in these numbers.

Also, could you please retake the stats for the CephFS-Erasure-ByOSD-D4F2-Data pool?

#2 Updated by Brian Woods 2 months ago

Radoslaw Zarzynski wrote:

Could you please provide a log from an active mgr with debug_ms=1 and debug_mgr=20?

I am not 100% sure I know how to do that, but this is what I have so far. I found this:
https://access.redhat.com/solutions/2085183

ceph --admin-daemon /var/run/ceph/ceph-client.rgw.<name>.asok config set debug_ms 1

So I:

# docker exec -it ceph-### bash
ceph --admin-daemon /var/run/ceph/ceph-client.rgw.rgw-default.### set debug_ms 1# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.rgw-default.###  config set debug_ms 1
{
    "success": "" 
}
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.rgw-default.###  config set debug_mgr 20
{
    "success": "" 
}

That seemed to work, but I am not sure what logs you want me to grab. Just cephadm.log? I will have to scrub it a bit, but can get that soon.

We would like to see which OSDs participate in these numbers.

Is there a specific command I can issue to figure out what OSDs are participating in the space usage?

Also, could you please retake the stats for the CephFS-Erasure-ByOSD-D4F2-Data pool?

This? It moved a bit this morning when I deleted some files from another pool, but not since then (about 24 hours).

POOL_NAME                          USED  OBJECTS  CLONES    COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD    WR_OPS       WR  USED COMPR  UNDER COMPR

CephFS-Erasure-ByOSD-D4F2-Data  6.2 GiB        0       0         0                   0        0         0   5971837  5.9 TiB   2495150   39 GiB     360 MiB      720 MiB

#3 Updated by Radoslaw Zarzynski 2 months ago

Well, I think the command you mentioned did effect for RGW, not MGR. I'm providing the commands increasing log verbosity on all mgrs below.

ceph config set mgr debug_ms 1
ceph config set mgr debug_mgr 20

Then, please collect the logs and revert the log settings to defaults.

#4 Updated by Brian Woods 2 months ago

Radoslaw Zarzynski wrote:

Well, I think the command you mentioned did effect for RGW, not MGR. I'm providing the commands increasing log verbosity on all mgrs below.

[...]

Then, please collect the logs and revert the log settings to defaults.

Sorry for the delay. Logs attached.

What are the defaults?

Also, an update, I emptied another pool and have a similar case with it. Interesting side note for it, it took sending the "rados -p CephFS-Erasure-ByOSD-D5F1-Data cleanup --prefix benchmark_data" three times for it actually delete the blocks. On the third time it gave a "warning doing linear search" or something to the lines of that.

So we now look like this:

POOL_NAME                          USED   OBJECTS  CLONES     COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD    WR_OPS       WR  USED COMPR  UNDER COMPR
.mgr                             70 MiB        10       0         20                   0        0         0     16218   18 MiB     28266  592 MiB         0 B          0 B
.rgw.root                        48 KiB         6       0         12                   0        0         0       144  144 KiB         6    6 KiB         0 B          0 B
CephFS-Erasure-ByOSD-D4F2-Data  5.9 GiB         0       0          0                   0        0         0   5971837  5.9 TiB   2495150   39 GiB     353 MiB      706 MiB
CephFS-Erasure-ByOSD-D5F1-Data   74 GiB         0       0          0                   0        0         0  33833233   47 TiB  13492697  2.0 TiB     213 MiB      425 MiB
CephFS-Erasure-ByOSD-D5F3-Data   67 TiB  12757283       0  102058264                   0        0         0   3470975  2.5 TiB  19723113   43 TiB     145 GiB      291 GiB
CephFS-Meta                     5.3 GiB    762332       0    1524664                   0        0         0  35039654   71 GiB  41024292  195 GiB         0 B          0 B
CephFS-Root                      72 KiB   3218516       0    9655548                   0        0         0    258607  253 MiB  11827192  289 MiB         0 B          0 B
SSDStorage                      135 GiB    595698       0    1191396                   0        0         0   6160114  3.7 TiB  31158766  734 GiB      37 GiB       92 GiB
default.rgw.control                 0 B         8       0         16                   0        0         0         0      0 B         0      0 B         0 B          0 B
default.rgw.log                 272 KiB       209       0        418                   0        0         0   5750159  5.5 GiB   3821043   34 KiB         0 B          0 B
default.rgw.meta                 19 KiB         3       0          6                   0        0         0        90   69 KiB         6    2 KiB         0 B          0 B

#5 Updated by Brian Woods 2 months ago

I am realizing those logs are from a single host (server4).
server3 got removed today.
Attaching server1 to this message.

#6 Updated by Brian Woods 2 months ago

Attaching server2 to this message.

#7 Updated by Radoslaw Zarzynski about 2 months ago

Hello. Thanks for response and the files.

Archive:  server2.zip
  inflating: ceph-client.admin.log   
  inflating: ceph-volume.log         
  inflating: cephadm.log             
  inflating: cephadm.log.1           
  inflating: cephadm.log.2           
  inflating: cephadm.log.3           
  inflating: cephadm.log.4           
  inflating: cephadm.log.5           
  inflating: cephadm.log.6           
  inflating: cephadm.log.7 

Unfortunately, they lack the ceph-mgr's logs. Could you please provide the `ceph-mgr.*` log files?

#8 Updated by Brian Woods about 2 months ago

That is every log file from every node. There are no ceph-mgr* logs. :/

Even from inside the docker on the adm node:

# docker exec -u root -it ceph-cdffec50-412e-11ed-a95e-8be7dd44d8f3-mgr-server1-domain-local-miydsy bash
[root@55772 /]# cd /var/log/ceph
[root@55772 log]# cd ceph/
[root@server1 ceph]# ls -l
total 55772
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.0.log
-rw-r--r-- 1 ceph ceph    24839 Oct  1 02:30 ceph-osd.0.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.1.log
-rw-r--r-- 1 ceph ceph    50855 Oct  2 19:44 ceph-osd.1.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct 23 00:00 ceph-osd.11.log
-rw-r--r-- 1 ceph ceph    26394 Oct 21 19:57 ceph-osd.11.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.2.log
-rw-r--r-- 1 ceph ceph    50852 Oct  2 19:45 ceph-osd.2.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.3.log
-rw-r--r-- 1 ceph ceph    50824 Oct  2 19:45 ceph-osd.3.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.4.log
-rw-r--r-- 1 ceph ceph    51142 Oct  2 19:46 ceph-osd.4.log.1.gz
-rw-r--r-- 1 ceph ceph        0 Oct  3 00:00 ceph-osd.5.log
-rw-r--r-- 1 ceph ceph    51450 Oct  2 19:47 ceph-osd.5.log.1.gz
-rw-r--r-- 1 root root        0 Dec  5 00:00 ceph-volume.log
-rw-r--r-- 1 root root  4175959 Dec  4 15:33 ceph-volume.log.1.gz
-rw-r--r-- 1 root root  6519855 Dec  3 23:40 ceph-volume.log.2.gz
-rw-r--r-- 1 root root  8749846 Dec  2 23:49 ceph-volume.log.3.gz
-rw-r--r-- 1 root root 10504801 Dec  1 23:59 ceph-volume.log.4.gz
-rw-r--r-- 1 root root  9538837 Nov 30 23:51 ceph-volume.log.5.gz
-rw-r--r-- 1 root root  8649327 Nov 29 23:42 ceph-volume.log.6.gz
-rw-r--r-- 1 root root  8631026 Nov 28 23:41 ceph-volume.log.7.gz
-rw-r--r-- 1 root root        0 Oct 27 00:00 cephadm.log
-rw-r--r-- 1 root root      188 Oct 25 14:02 cephadm.log.1.gz
[root@server1 ceph]# 

Ideas? I don't recall seeing mgr specific logs on any of my deployments (I have a few).

#9 Updated by Brian Woods about 2 months ago

Alright, I found the logs can be accessed from docker itself. In the process of pulling them, but I am already at 5GBs of logs and growing.... I will compress and post a Google Drive link as soon as they are complete..

#10 Updated by Brian Woods about 2 months ago

Alright, please let me know when you have the file so I can remove it from my drive:
https://drive.google.com/file/d/1i--yt4tcOn5augmpmRbQmS8pg4yyAPFU/view?usp=sharing

#11 Updated by Radoslaw Zarzynski about 1 month ago

Glad you've found it! Would mind uploading via the ceph-post-file (https://docs.ceph.com/en/quincy/man/8/ceph-post-file/)?

#12 Updated by Brian Woods about 1 month ago

Radoslaw Zarzynski wrote:

Glad you've found it! Would mind uploading via the ceph-post-file (https://docs.ceph.com/en/quincy/man/8/ceph-post-file/)?

First attempt for an unknown RSA key error. Odd as I have never used the command, but meh...

$ ceph-post-file  --description "All MGR Logs" -u "BRIANWOODS" mgr.zip 
args: --description All MGR Logs -u BRIANWOODS -- mgr.zip
args: -u BRIANWOODS -- mgr.zip
args: -- mgr.zip
/usr/bin/ceph-post-file: upload tag f252e192-a62d-4ccd-846a-61906260aae3
/usr/bin/ceph-post-file: user: BRIANWOODS
/usr/bin/ceph-post-file: description: All MGR Logs
/usr/bin/ceph-post-file: will upload file mgr.zip
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:cgG0rSk0W8I4sC+LZ8un9R0D5FKOwxwL5XVBt7PeW8Y.
Please contact your system administrator.
Add correct host key in /home/bpwoods/.ssh/known_hosts to get rid of this message.
Offending RSA key in /usr/share/ceph/known_hosts_drop.ceph.com:1
  remove with:
  ssh-keygen -f "/usr/share/ceph/known_hosts_drop.ceph.com" -R "drop.ceph.com" 
RSA host key for drop.ceph.com has changed and you have requested strict checking.
Host key verification failed.
Connection closed.  
Connection closed

I cleared the key and attempted the upload again and got a permission denied error:

$ ceph-post-file  --description "All MGR Logs" -u "BRIANWOODS" mgr.zip 
args: --description All MGR Logs -u BRIANWOODS -- mgr.zip
args: -u BRIANWOODS -- mgr.zip
args: -- mgr.zip
/usr/bin/ceph-post-file: upload tag cf4d360c-5007-4119-a244-1855a1519ebb
/usr/bin/ceph-post-file: user: BRIANWOODS
/usr/bin/ceph-post-file: description: All MGR Logs
/usr/bin/ceph-post-file: will upload file mgr.zip
The authenticity of host 'drop.ceph.com (8.43.84.129)' can't be established.
ECDSA key fingerprint is SHA256:NQf3i0LwF3NK4drP9lDudf1u0UaA9G0uqdWxDefhhrU.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'drop.ceph.com' (ECDSA) to the list of known hosts.
postfile@drop.ceph.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Connection closed.  
Connection closed

#13 Updated by Brian Woods about 1 month ago

Any thoughts on this?

#14 Updated by Radoslaw Zarzynski 27 days ago

Downloading manually. Neha is testing ceph-post-file.

Also available in: Atom PDF