Project

General

Profile

Actions

Bug #13543

closed

one host down lead to all virtual machine filesystem in cluster can't be written,

Added by zcc icy over 8 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

015-10-20 06:56:56.518939 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000233: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1733 kB/s rd, 113 MB/s wr, 1338 op/s
2015-10-20 06:56:57.531459 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000234: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1692 kB/s rd, 97300 kB/s wr, 1636 op/s
2015-10-20 06:56:58.555693 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000235: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1446 kB/s rd, 72906 kB/s wr, 1182 op/s
2015-10-20 06:56:59.566033 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000236: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1322 kB/s rd, 66304 kB/s wr, 711 op/s
2015-10-20 06:57:00.579269 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000237: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1294 kB/s rd, 92126 kB/s wr, 1011 op/s
2015-10-20 06:57:01.612804 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000238: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1667 kB/s rd, 91459 kB/s wr, 1422 op/s
2015-10-20 06:57:02.642488 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000239: 2064 pgs: 2064 active+clean; 4833 GB data, 14701 GB used, 19669 GB / 35013 GB avail; 1982 kB/s rd, 106 MB/s wr, 1632 op/s
2015-10-20 06:57:03.652631 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000240: 2064 pgs: 2064 active+clean; 4833 GB data, 14519 GB used, 5894 TB / 34577 GB avail; 1751 kB/s rd, 89680 kB/s wr, 1285 op/s
2015-10-20 06:57:04.680728 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000241: 2064 pgs: 2064 active+clean; 4833 GB data, 14519 GB used, 5894 TB / 34577 GB avail; 2042 kB/s rd, 36731 kB/s wr, 1250 op/s
2015-10-20 06:57:05.690898 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000242: 2064 pgs: 2064 active+clean; 4833 GB data, 14519 GB used, 5894 TB / 34577 GB avail; 1951 kB/s rd, 13390 kB/s wr, 1352 op/s
2015-10-20 06:57:06.721109 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000243: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1604 kB/s rd, 3942 kB/s wr, 1429 op/s
2015-10-20 06:57:07.344647 7f0493534700 1 mon.shd-blade17@0(leader).osd e6497 New setting for CEPH_OSDMAP_FULL -- doing propose
2015-10-20 06:57:07.352737 7f049599c700 1 mon.shd-blade17@0(leader).osd e6498 e6498: 82 osds: 80 up, 80 in full
2015-10-20 06:57:07.365791 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6498: 82 osds: 80 up, 80 in full
2015-10-20 06:57:07.375651 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000244: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1581 kB/s rd, 4168 kB/s wr, 1704 op/s
2015-10-20 06:57:08.404488 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000245: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1139 kB/s rd, 4151 kB/s wr, 1278 op/s
2015-10-20 06:57:09.424324 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000246: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1220 kB/s rd, 3176 kB/s wr, 836 op/s
2015-10-20 06:57:10.489965 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000247: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1280 kB/s rd, 1781 kB/s wr, 643 op/s
2015-10-20 06:57:11.521710 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000248: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1444 kB/s rd, 1092 kB/s wr, 531 op/s
2015-10-20 06:57:12.546321 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000249: 2064 pgs: 2064 active+clean; 4833 GB data, 14334 GB used, 11410 TB / 34140 GB avail; 1089 kB/s rd, 467 kB/s wr, 255 op/s

as the log show, 06:57 mon.shd-blade17@0(leader).osd e6497 New setting for CEPH_OSDMAP_FULL -- doing propose, then all osd in storage cluster become full.

then the cluster find the unreachable osd . then mark them down

2015-10-20 06:57:24.622728 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.61 172.18.129.86:6801/3925 from osd.20 172.18.129.47:6800/6819 is reporting failure:1
2015-10-20 06:57:24.622769 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.61 172.18.129.86:6801/3925 reported failed by osd.20 172.18.129.47:6800/6819
2015-10-20 06:57:24.733943 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.61 172.18.129.86:6801/3925 from osd.27 172.18.129.48:6800/6856 is reporting failure:1
2015-10-20 06:57:24.733977 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.61 172.18.129.86:6801/3925 reported failed by osd.27 172.18.129.48:6800/6856
2015-10-20 06:57:24.734123 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.76 172.18.129.86:6807/4146 from osd.27 172.18.129.48:6800/6856 is reporting failure:1
2015-10-20 06:57:24.734147 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.76 172.18.129.86:6807/4146 reported failed by osd.27 172.18.129.48:6800/6856
2015-10-20 06:57:24.748197 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.61 172.18.129.86:6801/3925 from osd.67 172.18.129.85:6809/41201 is reporting failure:1
2015-10-20 06:57:24.748225 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.61 172.18.129.86:6801/3925 reported failed by osd.67 172.18.129.85:6809/41201
2015-10-20 06:57:24.748359 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.76 172.18.129.86:6807/4146 from osd.67 172.18.129.85:6809/41201 is reporting failure:1
2015-10-20 06:57:24.748382 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.76 172.18.129.86:6807/4146 reported failed by osd.67 172.18.129.85:6809/41201
2015-10-20 06:57:24.750212 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.61 172.18.129.86:6801/3925 from osd.75 172.18.129.85:6803/40784 is reporting failure:1
2015-10-20 06:57:24.750238 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.61 172.18.129.86:6801/3925 reported failed by osd.75 172.18.129.85:6803/40784
2015-10-20 06:57:24.750373 7f0492d33700 1 mon.shd-@0(leader).osd e6498 prepare_failure osd.76 172.18.129.86:6807/4146 from osd.75 172.18.129.85:6803/40784 is reporting failure:1
2015-10-20 06:57:24.750404 7f0492d33700 0 log_channel(cluster) log [DBG] : osd.76 172.18.129.86:6807/4146 reported failed by osd.75 172.18.129.85:6803/40784
2015-10-20 06:57:24.750463 7f0492d33700 1 mon.shd-@0(leader).osd e6498 we have enough reports/reporters to mark osd.76 down

then the mon out the two unreachable osd

2015-10-20 07:02:27.385072 7f0493534700 0 log_channel(cluster) log [INF] : osd.76 out (down for 301.752468)
2015-10-20 07:02:27.390610 7f049599c700 1 mon.shd-blade17@0(leader).osd e6503 e6503: 82 osds: 78 up, 79 in full
2015-10-20 07:02:27.400414 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6503: 82 osds: 78 up, 79 in full
2015-10-20 07:02:27.404465 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000409: 2064 pgs: 1901 active+clean, 163 active+undersized+degraded; 4833 GB data, 14323 GB used, 5535 TB / 34129 GB avail; 98278/3726630 objects degraded (2.637%)
2015-10-20 07:02:28.394939 7f049599c700 1 mon.shd-blade17@0(leader).osd e6504 e6504: 82 osds: 78 up, 79 in full
2015-10-20 07:02:28.424993 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6504: 82 osds: 78 up, 79 in full
2015-10-20 07:02:28.448891 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000410: 2064 pgs: 1901 active+clean, 163 active+undersized+degraded; 4833 GB data, 14323 GB used, 5535 TB / 34129 GB avail; 98278/3726630 objects degraded (2.637%)
2015-10-20 07:02:29.403887 7f049599c700 1 mon.shd-blade17@0(leader).osd e6505 e6505: 82 osds: 78 up, 79 in full
2015-10-20 07:02:29.425632 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6505: 82 osds: 78 up, 79 in full
2015-10-20 07:02:29.449836 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000411: 2064 pgs: 1901 active+clean, 163 active+undersized+degraded; 4833 GB data, 14323 GB used, 5535 TB / 34129 GB avail; 98278/3726630 objects degraded (2.637%)
2015-10-20 07:02:31.517106 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000412: 2064 pgs: 1901 active+clean, 163 active+undersized+degraded; 4833 GB data, 14323 GB used, 5535 TB / 34129 GB avail; 98278/3726630 objects degraded (2.637%)
2015-10-20 07:02:32.386758 7f0493534700 0 log_channel(cluster) log [INF] : osd.61 out (down for 304.711040)
2015-10-20 07:02:32.394420 7f049599c700 1 mon.shd-blade17@0(leader).osd e6506 e6506: 82 osds: 78 up, 78 in full
2015-10-20 07:02:32.406160 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6506: 82 osds: 78 up, 78 in full
2015-10-20 07:02:32.426390 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000413: 2064 pgs: 1901 active+clean, 163 active+undersized+degraded; 4833 GB data, 14314 GB used, 19162 GB / 34120 GB avail; 98278/3726630 objects degraded (2.637%)
2015-10-20 07:02:33.404935 7f049599c700 1 mon.shd-blade17@0(leader).osd e6507 e6507: 82 osds: 78 up, 78 in full
2015-10-20 07:02:33.435666 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6507: 82 osds: 78 up, 78 in full
2015-10-20 07:02:33.452746 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000414: 2064 pgs: 1901 active+clean, 1 peering, 2 active+undersized+degraded+remapped+wait_backfill, 160 active+undersized+degraded; 4833 GB data, 14314 GB used, 19161 GB / 34120 GB avail; 97663/3727213 objects degraded (2.620%); 2339/3727213 objects misplaced (0.063%)
2015-10-20 07:02:34.434785 7f049599c700 1 mon.shd-blade17@0(leader).osd e6508 e6508: 82 osds: 78 up, 78 in full
2015-10-20 07:02:34.459206 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6508: 82 osds: 78 up, 78 in full
2015-10-20 07:02:34.501125 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000415: 2064 pgs: 1901 active+clean, 1 peering, 6 active+undersized+degraded+remapped+backfilling, 12 active+undersized+degraded+remapped+wait_backfill, 144 active+undersized+degraded; 4833 GB data, 14314 GB used, 19161 GB / 34120 GB avail; 97663/3732738 objects degraded (2.616%); 22691/3732738 objects misplaced (0.608%); 1003 MB/s, 251 objects/s recovering
2015-10-20 07:02:36.567323 7f049599c700 0 log_channel(cluster) log [INF] : pgmap v13000416: 2064 pgs: 1901 active+clean, 1 peering, 6 active+undersized+degraded+remapped+backfilling, 12 active+undersized+degraded+remapped+wait_backfill, 144 active+undersized+degraded; 4833 GB data, 14314 GB used, 19161 GB / 34120 GB avail; 97663/3732738 objects degraded (2.616%); 22691/3732738 objects misplaced (0.608%); 672 MB/s, 168 objects/s recovering
2015-10-20 07:02:37.390430 7f0493534700 1 mon.shd-blade17@0(leader).osd e6508 New setting for CEPH_OSDMAP_FULL -- doing propose
2015-10-20 07:02:37.395838 7f049599c700 1 mon.shd-blade17@0(leader).osd e6509 e6509: 82 osds: 78 up, 78 in
2015-10-20 07:02:37.424323 7f049599c700 0 log_channel(cluster) log [INF] : osdmap e6509: 82 osds: 78 up, 78 in

then the storage cluster become writable.

the problem is that why the cluster become unwritable. and how to avoid it happening.

Actions #1

Updated by Samuel Just over 8 years ago

I don't quite follow what happened, but it looks like your cluster is full and is not accepting writes.

Actions #2

Updated by zcc icy over 8 years ago

Samuel Just wrote:

I don't quite follow what happened, but it looks like your cluster is full and is not accepting writes.

ceph have 60% free space left, but CEPH_OSDMAP_FULL was setted in the cluster. then the cluster is not accepting writes.

Actions #3

Updated by Sage Weil over 8 years ago

zcc icy wrote:

Samuel Just wrote:

I don't quite follow what happened, but it looks like your cluster is full and is not accepting writes.

ceph have 60% free space left, but CEPH_OSDMAP_FULL was setted in the cluster. then the cluster is not accepting writes.

there is one osd that is full.. see 'ceph health detail' to see which one.

Actions #4

Updated by Samuel Just over 8 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF