Bug #8532: 0.80.1: OSD crash (domino effect), same as BUG #8229 - Ceph - Ceph

Actions

Copy link

Bug #8532

closed

0.80.1: OSD crash (domino effect), same as BUG #8229

Added by Markus Blank-Burian almost 10 years ago. Updated over 9 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I triggered the same issue as described in the bug http://tracker.ceph.com/issues/8229 on our HPC storage cluster. For more Information I added more detailed logfiles. The crash can be reproduced in the current state, and I cannot bring the whole cluster up again (only about ~20/70 OSDs).

Files

Download all files

kaa-101.log.gz (21.6 MB) kaa-101.log.gz	Log of crashing OSD	Markus Blank-Burian, 06/03/2014 03:42 PM
osd.46.log.gz (19.6 MB) osd.46.log.gz		Markus Blank-Burian, 06/04/2014 11:39 AM
osd.6.log.gz (20.8 MB) osd.6.log.gz		Markus Blank-Burian, 06/04/2014 11:39 AM
osd.46-2.log.gz (16.6 MB) osd.46-2.log.gz		Markus Blank-Burian, 06/04/2014 01:22 PM
osd.6-2.log.gz (18.4 MB) osd.6-2.log.gz		Markus Blank-Burian, 06/04/2014 01:22 PM
osd.64.log.gz (4.08 MB) osd.64.log.gz		Markus Blank-Burian, 06/04/2014 01:36 PM
osd.12.log.gz (9.34 MB) osd.12.log.gz		Markus Blank-Burian, 06/04/2014 02:08 PM
osd.17.log.gz (17.8 MB) osd.17.log.gz		Markus Blank-Burian, 06/04/2014 02:08 PM
pgdump.txt (937 KB) pgdump.txt		Markus Blank-Burian, 06/04/2014 04:42 PM
error.log.gz (644 KB) error.log.gz		Markus Blank-Burian, 06/06/2014 11:51 AM
kaa-14.err.log.gz (44.7 KB) kaa-14.err.log.gz		Markus Blank-Burian, 06/06/2014 11:51 AM
kaa-14-smart.log.gz (1.9 KB) kaa-14-smart.log.gz		Markus Blank-Burian, 06/06/2014 11:51 AM
pg07f1-errors.txt.gz (130 KB) pg07f1-errors.txt.gz		Markus Blank-Burian, 06/23/2014 02:11 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Samuel Just almost 10 years ago

Can you restart osd.6 and osd.46 with

debug osd = 20
debug filestore = 20
debug ms = 1

and attach both logs?

From what I see, it looks like the on-disk state of osd.46 is pretty wonky. Was there a powercycle of any kind?
-Sam

Actions

Copy link Download all files

Updated by Markus Blank-Burian almost 10 years ago

File osd.46.log.gz osd.46.log.gz added
File osd.6.log.gz osd.6.log.gz added

i think, i had to powercycle osd.12 and osd.17, but the other ones were rebooted normally (i rechecked that for osd.6 and osd.46 from the logs). i restarted a majority of the nodes at the same time without specifying noout, so i suppose the cluster started rebalancing while some nodes were still rebooting and there was also still some activity from cephfs during that time.
BUT: I just reread the log from mon that the cluster health was ok and all pg's were active for approx. 2.5h, before bad things happened:

2014-06-03 20:04:25.170876 7f9eeea56700 0 log [INF] : pgmap v18320477: 4416 pgs: 4414 active+clean, 1 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 11106 GB data, 33693 GB used, 33269 GB / 66974 GB avail; 1897 kB/s wr, 7 op/s
2014-06-03 20:04:26.184077 7f9eeea56700 0 log [INF] : pgmap v18320478: 4416 pgs: 4414 active+clean, 1 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 11106 GB data, 33693 GB used, 33269 GB / 66974 GB avail; 3284 kB/s wr, 7 op/s
2014-06-03 20:04:27.062163 7f9eee255700 1 mon.a@0(leader).osd e186719 prepare_failure osd.32 192.168.1.44:6800/57335 from osd.9 192.168.1.104:6800/6254 is reporting failure:1
2014-06-03 20:04:27.062193 7f9eee255700 0 log [DBG] : osd.32 192.168.1.44:6800/57335 reported failed by osd.9 192.168.1.104:6800/6254
2014-06-03 20:04:27.187728 7f9eeea56700 0 log [INF] : pgmap v18320479: 4416 pgs: 4414 active+clean, 1 active+clean+scrubbing, 1 active+clean+scrubbing+deep; 11106 GB data, 33693 GB used, 33269 GB / 66974 GB avail; 8596 kB/s wr, 10 op/s
2014-06-03 20:04:27.355050 7f9eee255700 1 mon.a@0(leader).osd e186719 prepare_failure osd.32 192.168.1.44:6800/57335 from osd.60 192.168.1.83:6800/14066 is reporting failure:1

i attached both logs from osd.6 and osd.46 on Startup with only the mon/mds and osd.50 running. in this case, both OSDs do not crash, so i suppose it needs one/some more.

Actions

Copy link

Updated by Samuel Just almost 10 years ago

Yeah, if it didn't crash you'll probably also have to start the other osds. I asked for osd.6 and osd.46 because osd.46 was the primary for the pg on which osd.6 crashed, and was responsible for sending a problematic message. I need logs on the crasher and the primary from startup until the crash for at least one problem pg.

Actions

Copy link Download all files

Updated by Markus Blank-Burian almost 10 years ago

File osd.46-2.log.gz osd.46-2.log.gz added
File osd.6-2.log.gz osd.6-2.log.gz added

ok, i now started other osds as well and reproduced the crash on osd.6 and attached the corresponding logs. i have saved the other logs from this run, just in case you need them.

Actions

Copy link

Updated by Samuel Just almost 10 years ago

This time it crashed on a message from osd 64, did you happen to have the logging on for that one as well? If not, can you restart 4, 46, 64 with logging, reproduce, and attach the logs?

Actions

Copy link

Updated by Markus Blank-Burian almost 10 years ago

File osd.64.log.gz osd.64.log.gz added

attached log from osd.64 for the same run

Actions

Copy link

Updated by Samuel Just almost 10 years ago

Can you post a recursive ls on osd.64's current/0.504_head directory?

Actions

Copy link

Updated by Samuel Just almost 10 years ago

Also, can you restart osd.12 and osd.17 and post the logs from that (same log levels).

Actions

Copy link

Updated by Markus Blank-Burian almost 10 years ago

on osd.64, the Directory current/0.504_head is empty.

Actions

Copy link Download all files

#10

Updated by Markus Blank-Burian almost 10 years ago

File osd.12.log.gz osd.12.log.gz added
File osd.17.log.gz osd.17.log.gz added

both osd.12 and osd.17 crashed, logs are attached.

Actions

Copy link

#11

Updated by Samuel Just almost 10 years ago

Is there a non-empty 504_head on any osd (perhaps 12 or 17?)

Actions

Copy link

#12

Updated by Markus Blank-Burian almost 10 years ago

i made a list with the directory content on the osds:

hostname: osd
---------------
kaa-63: osd.40, contains directory-tree (DIR_4/...) with data-files
kaa-64: osd.41, empty
kaa-55: osd.28, empty
kaa-59: osd.23, contains only data-files (no tree)
kaa-82: osd.59, empty
kaa-98: osd.3, contains directory-tree (DIR_4/...) with data-files
kaa-101: osd.6, contains only some data-files (no tree)
kaa-13: osd.14, empty
kaa-19: osd.64, empty
kaa-17: osd.18, contains directory-tree (DIR_4/...) with data-files
kaa-49: osd.38, empty

Actions

Copy link

#13

Updated by Samuel Just almost 10 years ago

Ok, it looks like these pgs had an osd (or osds) power cycled in a way that caused the pg metadata to go significantly back in time. That broken metadata was then propagated to other osds as part of recovery. I suspect that your controllers/drives/fs are not passing barriers correctly, and that is messing up the transaction system in the osd and leveldb. I don't see a simple way to recover the cluster.

Actions

Copy link

#14

Updated by Samuel Just almost 10 years ago

You can identify the broken pgs by looking at the logs on the crashing osds and tracing the thread which asserts back to a line with merge_log. That line should identify the broken pg. You can then look for that pg across the cluster and remove the empty ones. That might or might not improve the situation.

Actions

Copy link

#15

Updated by Markus Blank-Burian almost 10 years ago

Thanks for your help! After deleting some empty directories and removing a half full one, now all OSDs came up. I am proceeding with a complete deep-scrub and repair all pgs as neccessary.
Regarding barriers: I disabled the write-cache on all disks and used xfs default mount-options, so barriers should be active. Eventually i missed some controller settings, will recheck on this.

Actions

Copy link

#16

Updated by Samuel Just almost 10 years ago

Were you able to localize all of the affected pgs to one or two osds at some point in the past? Also, what controllers/firmware/fs/kernel version are you using (you aren't the first to see mysterious pg metadata misbehavior, I'm wondering whether we can extract some form of pattern from the noise)?

Actions

Copy link

#17

Updated by Samuel Just almost 10 years ago

The broken fs/controller/disk theory is the best I've got at the moment, but it's pretty much impossible to falsify making it a pretty lousy theory as theories go. Let me know if think of anything else that might be interesting. Particularly, it would be interesting if all of the affected pgs had an osd or two in common. Can you also post a ceph pg dump?

Actions

Copy link

#18

Updated by Markus Blank-Burian almost 10 years ago

File pgdump.txt pgdump.txt added

Kernel is 3.14.4 (reboot was update from 3.14.3)
osd.0-62: Controller: AMD SATA onboard (few IDE compat, but most AHCI), Filesystem XFS
osd.63-69: Controller: Intel SATA onboard (C600/X79), Filesystem BTRFS (new for testing, no reboot since installation, disabled snapshots)

i am not sure about your theory, since the inconsistent pgs seem to be distributed fairly even about all osds, as you can see from the current pgs needing repair:
pg 0.4e4 is active+clean+inconsistent, acting [17,27,43]
pg 0.362 is active+clean+inconsistent, acting [11,60,49]
pg 0.2e4 is active+clean+inconsistent, acting [7,27,12]
pg 0.2d1 is active+clean+inconsistent, acting [34,52,24]
pg 0.158 is active+clean+inconsistent, acting [1,12,48]

a current pgdump is attached.

Actions

Copy link

#19

Updated by Samuel Just almost 10 years ago

Assignee set to Samuel Just

attempting to reproduce

Actions

Copy link Download all files

#20

Updated by Markus Blank-Burian almost 10 years ago

File error.log.gz error.log.gz added
File kaa-14.err.log.gz kaa-14.err.log.gz added
File kaa-14-smart.log.gz kaa-14-smart.log.gz added

I ran a grep through all the osd logfiles searching for failed asserts and attached the output. The oldest logfiles are available from 2014-05-28. Eventually there might be a clue in there of what happened. The main problems appeared on 2014-06-03 at about 8pm, but the asserts in the logs started way before that. I sorted them by number of occurence:

    932 osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
        first appearence: 2014-05-31T22:26:56

    121 msg/Pipe.cc: 1070: FAILED assert(m)
        first appearence: 2014-05-30T19:02:37

     54 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
        first appearence: 2014-06-01T23:02:08

      7 osd/ReplicatedPG.cc: 7865: FAILED assert(peer_missing.count(fromshard))
        first appearence: 2014-06-03T20:52:59

      6 osd/ReplicatedPG.cc: 222: FAILED assert(is_primary())
        first appearence: 2014-06-03T22:02:38

      6 osd/PGLog.cc: 303: FAILED assert(i->prior_version == last)
        first appearence: 2014-06-03T22:01:51

      4 os/FileStore.cc: 2540: FAILED assert(0 == "unexpected error")
        first appearence: 2014-06-03T22:01:36

      2 osd/PG.h: 382: FAILED assert(i->second.need == j->second.need)
        first appearence: 2014-06-03T23:43:23

      2 common/Mutex.cc: 93: FAILED assert(r == 0)
        first appearence: 2014-02-06T14:02:37

      1 osd/PG.cc: 6137: FAILED assert(pg->want_acting.size())
        first appearence: 2014-06-03T21:53:31

So the most often assert is the on in merge_log and there was no other assert (besides assert(m), probably http://tracker.ceph.com/issues/8232) near it. The assert first occured on osd.15 at 2014-05-31T22:26:56. I added an error log of the corresponding node (kaa-14). This one shows regular L3 cache errors with "Error Status: Corrected error, no action required", but these have never before produced any error for running programs before (no unexpected segfaults in the logs for over a year). I hoped for some SMART errors, but values of the disk seem to be ok (see attached smartctl log). Nevertheless, i will move the osd to another host. Since the loglevel was too small, i can't see which other OSDs were involved at this first assert. The only other thing that i can see from the node logs is, that during this time, there was a lot of network activity, seen from NFS timeouts and ceph "heartbeat_check: no reply from ...".

Actions

Copy link

#21

Updated by Markus Blank-Burian almost 10 years ago

i am still having problems with the cluster, but this time there is a growing number of inconsistencies, which started with 3 pgs, all located on osd.1. the only crash involved was the "assert(m)". the list of inconsistencies kept growing over the last days and is now spread over more and more osds. i keep seeing messages like this for all pg's which have inconsistencies:

2014-06-16 08:37:29.388617 osd.60 192.168.1.83:6800/13195 231 : [ERR] 0.32e 52 tried to pull f7e4532e/100002162ae.00000000/head//0 but got (2) No such file or directory
2014-06-16 08:38:44.923609 osd.32 192.168.1.44:6800/10930 102 : [ERR] 0.b09 34 tried to pull ede50b09/100002162b1.0000000e/head//0 but got (2) No such file or directory

i need to mention, that OSDs are still crashing with the above mentioned asserts. today, i am going to apply the fix from http://tracker.ceph.com/issues/8232 and see if this reduces the crashes and therefore the appearing inconsistencies.

i am still not sure about how to work around the "hit suicide timeout". before the timeout, i see many messages like this one affecting connections to all the other OSDs:

submit_message osd_ping(ping e197257 stamp 2014-06-16 09:52:48.651395) v2 remote, 192.168.1.24:6803/3030718, failed lossy con, dropping message 0x7fdfec0028a0

but as far as i can see from logfiles and ganglia, there was no notable increase in disk or network activity, which could explaining this.

Actions

Copy link

#22

Updated by Markus Blank-Burian almost 10 years ago

there were more and more inconsistencies with missing files and i am unsure how to repair them. for example there were missing files from 2 shards on the primary osd for pg 0.7f1:

96 ceph-osd: 2014-06-16 10:45:20.300934 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 1 missing 3a5ff7f1/100002254e8.0000212d/head//0
96 ceph-osd: 2014-06-16 10:45:20.300942 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 66 missing 3a5ff7f1/100002254e8.0000212d/head//0
96 ceph-osd: 2014-06-16 10:45:20.300945 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 1 missing deaff7f1/10000213cae.00001924/head//0
96 ceph-osd: 2014-06-16 10:45:20.300949 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 66 missing deaff7f1/10000213cae.00001924/head//0
96 ceph-osd: 2014-06-16 10:45:20.300952 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 1 missing 9fdff7f1/100001fa418.000014ed/head//0
96 ceph-osd: 2014-06-16 10:45:20.300956 7f7c73ff7700  0 log [ERR] : 0.7f1 shard 66 missing 9fdff7f1/100001fa418.000014ed/head//0
96 ceph-osd: 2014-06-16 10:45:20.302293 7f7c73ff7700  0 log [ERR] : 0.7f1 deep-scrub 809 missing, 0 inconsistent objects
96 ceph-osd: 2014-06-16 10:45:20.302301 7f7c73ff7700  0 log [ERR] : 0.7f1 deep-scrub 1618 errors

i found another copy with the missing files, so i replaced the bad copy with the other one. this resolved the issue with shard 1, but it still reports files from shard 66 missing. so for example the file "./DIR_1/DIR_F/DIR_7/DIR_F/100001fa418.000014ed__head_9FDFF7F1__0" exists, but it still says "0.7f1 shard 66 missing 9fdff7f1/100001fa418.000014ed/head//0". is this the correct directory?

over the weekend, i will run some more stress-tests. eventually, i resolved the problem with the suicide timeouts (eventually resulting from temporary hangs in our nfs-root). lets see, if this stops new inconsistencies.

Actions

Copy link

#23

Updated by Samuel Just almost 10 years ago

shard 66 here is osd 66. You mean that ./DIR_1/DIR_F/DIR_7/DIR_F/100001fa418.000014ed__head_9FDFF7F1__0 exists on osd 66? Does ./DIR_1/DIR_F/DIR_7/DIR_F/DIR_F? What filesystem are you using?

Actions

Copy link

#24

Updated by Samuel Just almost 10 years ago

Can you dump the xattrs on that file?

Actions

Copy link

#25

Updated by Markus Blank-Burian almost 10 years ago

osd 66 has also a bad copy, only osd 28 had the copy containing the corresponding file. osd.66 is btrfs, while osd.28 and osd.1 are xfs. i copied the pg from osd.28 with rsync to osd.1 preserving xattrs. the xattr dump is as follows:

# file: DIR_1/DIR_F/DIR_7/DIR_F/100001fa418.000014ed__head_9FDFF7F1__0
user.ceph._=0sDQjhAAAABAM1AAAAAAAAABQAAAAxMDAwMDFmYTQxOC4wMDAwMTRlZP7/////////8fffnwAAAAAAAAAAAAAAAAAFAxQAAAAAAAAAAAAAAP////8AAAAAAAAAAAAAAACY7gMAAAAAAIyQAgCX7gMAAAAAAIyQAgACAhUAAAAIgCsrAAAAAAAcJWYAAAAAAAEAAAAAAEAAAAAAACwgblOUQt8dAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAmO4DAAAAAAAAAAAAAAAAAAAEAAAA
user.ceph.snapset=0sAgIZAAAAAQAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==

Actions

Copy link

#26

Updated by Markus Blank-Burian almost 10 years ago

osd.66 and osd.1 (before my rsync from osd.28) had no subdirectory structure. the few files left on these hosts were directly in the 0.7f1_head directory.

Actions

Copy link

#27

Updated by Samuel Just almost 10 years ago

Why was there an rsync? Can you explain the sequence of events?

Actions

Copy link

#28

Updated by Markus Blank-Burian almost 10 years ago

- pg 0.7f1 incomplete on [1,66,28]
- determined from logs, that there were missing files from shard 1 and 66 on osd.1
- the data size on osd.28 is 3.2GB compared to 453MB on osd.1 and osd.28 (why are both copies so small?)
- looked at directory structure on osds and found, that files only seemed to exist on osd.28
- ceph osd noout and stop ceph-osd on osd.1, osd.66 and osd.28
- made backup copy of 0.7f1_head on osd.1 and osd.28
- deleted directory 0.7f1_head from osd.1 and osd.28
- restarted osds
- directory size on osd.1 and osd.66 after recovery still 453MB
- stopped osds
- rsync ~~aX of 0.7f1_head from osd.28 to osd.1~~
start osds
- see in logs, that objects from shard 66 are still missing, but files are on disk

Actions

Copy link

#29

Updated by Samuel Just almost 10 years ago

When did those things happen?

Actions

Copy link

#30

Updated by Markus Blank-Burian almost 10 years ago

after the initial bugreport, i repaired all pgs made a deep scrub on all osds, after which all pgs were active+clean, which was on the june 5th i think. there were no disk errors or reboots until this inconsistent pg apperead, only some asserts, which i mentioned earlier. i am gathering the exact time and some information on asserts right now from our SQL logs, but it might take an hour or so.

another thing: just now, i noticed again some partial network congestion on our hosts, which led again to failed asserts (the one in merge_log, one assert(r == 0), the rest i did not check). But this time the OSDs restarted without error. i am waiting for the deep-scrub to finish, to see if this resulted in new inconsistencies.
we are running ceph on this system for over a year now and with 0.61 or 0.72 i never had inconsistent pgs (except of course with bad sectors).

Actions

Copy link

#31

Updated by Markus Blank-Burian almost 10 years ago

File pg07f1-errors.txt.gz pg07f1-errors.txt.gz added

as you can see from the attached logfile (pg07f1-errors.txt.gz), there was only one assert on the osds which are related to this particular pg:

2014-06-05: all pgs were clean and i started a deep scrub on all osds
2014-06-06: osd.1 FAILED assert(m)
2014-06-13: osd.1 pg 0.7f1 shard 1 and 66 missing

the deep-scrub should have been finished before the failed assert on 06/06, so eventually the pg broke then and only got checked again a week later.

Actions

Copy link

#32

Updated by Samuel Just almost 10 years ago

Have you been running the xfs/btrfs mix the whole time?

Actions

Copy link

#33

Updated by Markus Blank-Burian almost 10 years ago

for testing purposes, i installed btrfs on osd.63 to osd.69 since 2014-05-28 and on osd.57 since 2014-05-13. snapshots were disabled with "filestore btrfs snap = false" due to kernel errors with snapshots.

the network congestion yesterday seems to have caused another chain effect since this morning nearly all OSDs were down. i am looking into it. if you think, btrfs is a possible cause, i will disable These osds for now.

Actions

Copy link

#34

Updated by Samuel Just almost 10 years ago

I think your pg info and logs at some point became badly corrupted for some pgs. I think that is causing the asserts. I wonder whether it might also be causing the missing or inconsistent objects. Would it be possible to attempt to reproduce from a new cluster?

Actions

Copy link

#35

Updated by Markus Blank-Burian almost 10 years ago

can the pg info and logs be corrupt, even if all pgs are active+clean and a deep-scrub on all osds runs through without inconsistencies? if yes, then i may have no choice but to try and backup all our data. i am not sure yet, how to find enough free space for that.

after this downtime this morning, there are 16 new inconsistent pgs found after deep-scrubbing. reproducing this is quite simple here, since we have nfsroot and lowering the Server threads causes the Clients to hang waiting for nfsroot to responds. this causes lags in the OSDs, so some are marked auto out or some hit the assert "suicide timeout". following this, some OSDs crash due to other asserts and pgs are moved "wildly" and copied partly since an ever changing number of OSDs is up (since eventually unreachable) until issues are resolved. what happens if an osd is to be scheduled to receive a pg, but the original hosts are reachable again? what happens if one of the target hosts hit an assert during this temporary movement? what happens if data is written during this time? (i have size = 3 and min_size = 2). what if some partly copied pgs are then replicated to more and more hosts, until the main pg comes up again? i know, with correct logging, the whole cluster should replay the whole process to the main pgs, once it comes up, clean the temporary copies and resume. but what if during this replay the pg is unreachable again (or some assert occurs for whatever reason)?

still, i do not know, why ceph should hang at all in this case, since i cannot see any periodic file access besides the osds local storage or "/proc". eventually the os performs a cache refresh for the "/" dirent during some file open and this might hang. eventually a random long sleep at a file open might simulate the behavior.

Reproducing these network congestions via nfs threads on purpose is unwise, since some of our users have their programs abort when this happens.

Actions

Copy link

#36

Updated by Samuel Just almost 10 years ago

Possibly? I've never seen a ceph cluster behave anything like this. Are your osd journals on the nfs root?

Actions

Copy link

#37

Updated by Samuel Just almost 10 years ago

How did you deploy the cluster originally?

Actions

Copy link

#38

Updated by Samuel Just almost 10 years ago

Is it true that your inconsistencies are correlated with network congestion which would also cause the nfs root to hang?

Actions

Copy link

#39

Updated by Markus Blank-Burian almost 10 years ago

most journals are on local block devices /dev/sda -> /dev/vg/ceph-journal, some few are still on files, but also local storage /dev/sda -> /dev/vg/ceph (xfs or btrfs) -> /local/ceph/Journal
for journals i have journal aio and filestore fiemap enabled. can that be a problem? (from logs: "detect_features: FIEMAP ioctl is supported and appears to work"), so i though that usage might be safe.

the inconsistencies are strongly correlated with the hangs of the nfs root. and so are also the asserts. i am currently investigating how to work around these hangs.

what do you mean with "how did you deploy the cluster"?

Actions

Copy link

#40

Updated by Samuel Just almost 10 years ago

Was it ceph-deploy?

Actions

Copy link

#41

Updated by Samuel Just almost 10 years ago

Are you using journal_force_aio?

Actions

Copy link

#42

Updated by Samuel Just almost 10 years ago

I wonder whether the nfs hangs are interfering somehow with the journal aio stuff, you might try disabling aio.

Actions

Copy link

#43

Updated by Markus Blank-Burian almost 10 years ago

no, i am not using ceph-deploy, since there was no version for gentoo. so i used the following commands to init new osds:

UUID=$(ceph osd create)
ceph-osd -i $UUID --mkfs --mkkey
ceph auth add osd.$UUID osd \'allow *\' mon \'allow rwx\' -i /local/ceph/keyring
ceph osd in $UUID

Actions

Copy link

#44

Updated by Samuel Just almost 10 years ago

http://tracker.ceph.com/issues/2535 seems to be the reason why fiemap defaults to disabled. You may want to disable that as well.

Actions

Copy link

#45

Updated by Samuel Just almost 10 years ago

There probably is a ceph bug in here somewhere, but I think most of your trouble is related to your environment somehow breaking the OSD filesystem assumptions either through a kernel bug or otherwise.

Actions

Copy link

#46

Updated by Markus Blank-Burian almost 10 years ago

okay, so i am disabling aio and fiemap. regarding Journal files: i checked again and i am only using block devices as journals. i seem to have updated all hosts sometime ago.

Actions

Copy link

#47

Updated by Samuel Just almost 10 years ago

Each journal is on it's own block device? Or on a partition?

Actions

Copy link

#48

Updated by Markus Blank-Burian almost 10 years ago

Journal and storage are on the the same hard drive. used lvm to create a lv for the Journal and another lv for normal ceph storage (and a third for /etc, /var and /tmp).

Actions

Copy link

#49

Updated by Samuel Just almost 10 years ago

Interestingly, 3.14.4 appears to have some changes to fs/aio.cc.

Actions

Copy link

#50

Updated by Markus Blank-Burian almost 10 years ago

i am running 3.14.7 at the moment, since there were some bugs with the cephfs kernel client, which forced me to upgrade.
the last one, "aio: fix potential leak in aio_run_iocb()" looks good imho. cannot say anything about the older changes.

Actions

Copy link

#51

Updated by Samuel Just almost 10 years ago

Are you seeing different results now?

Actions

Copy link

#52

Updated by Markus Blank-Burian almost 10 years ago

the cluster is in really bad shape, which happend basically before i switched the config options. we are now trying to backup all data and then i will do tests again with a clean system. i am afraid this might take a few days.

health HEALTH_ERR 27 pgs degraded; 2 pgs down; 63 pgs inconsistent; 5 pgs peering; 1 pgs recovering; 61 pgs stale; 5 pgs stuck inactive; 61 pgs stuck stale; 33 pgs stuck unclean; 1 requests are blocked > 32 sec; recovery 3/8522468 objects degraded (0.000%); 1/2835474 unfound (0.000%); 31170 scrub Errors

Actions

Copy link

#53

Updated by Markus Blank-Burian almost 10 years ago

today, we had further network problems and inconsistent pg count is still increasing:

health HEALTH_ERR 312 pgs degraded; 2 pgs down; 3 pgs incomplete; 64 pgs inconsistent; 5 pgs peering; 1 pgs recovering; 61 pgs stale; 8 pgs stuck inactive; 61 pgs stuck stale; 321 pgs stuck unclean; 3 requests are blocked > 32 sec; recovery 187536/8520135 objects degraded (2.201%); 1/2835164 unfound (0.000%); 32010 scrub errors; 2/70 in osds are down; noout flag(s) set

Actions

Copy link

#54

Updated by Samuel Just almost 10 years ago

Yeah, reproducing on a clean cluster would probably be a good next step.

Actions

Copy link

#55

Updated by Markus Blank-Burian almost 10 years ago

I have recreated everything from scratch and restored our backups. Initial crash-stress-tests ran fine. ("killall -9 ceph-osd" on random 20% of all OSDs every 10 minutes or restart if not running) So balancing seems to work correctly. I have now all the nodes included (xfs as well as btrfs with snapshots disabled), aio enabled, fiemap disabled. The cluster has also survived one or two network congestions without incomplete PGs.

Actions

Copy link

#56

Updated by Samuel Just almost 10 years ago

Anything new? Would you consider reenabling fiemap to see whether that makes the issues come back?

Actions

Copy link

#57

Updated by Markus Blank-Burian almost 10 years ago

everything was still running good today, but i guess we had not much load the last days. i run my stress test again with fiemap enabled. not really sure if it can catch reproduce bug .. but we will see.

Actions

Copy link

#58

Updated by Markus Blank-Burian almost 10 years ago

still everything running ok, as indicated by a deep-scrub of all pgs. very strange.

i just read, that setting tunables to optimal could cause inconsistent pgs. i actually had set the tunables to optimal some time after upgrading to 0.80, to get better data distribution. the upgrade took place 2 weeks before the problems started and after setting optimal tunables the cluster definitely returned to HEALTH_OK. in what exact circumstances did this problem occur? i updated all nodes at the same time following the exact order in the release notes.

Actions

Copy link

#59

Updated by Samuel Just almost 10 years ago

Status changed from New to Need More Info

Actions

Copy link

#60

Updated by Markus Blank-Burian over 9 years ago

still no inconsistencies. everything is running fine now. but there is now another one, who might have similar problems (see original bug #8229)

Actions

Copy link

#61

Updated by Samuel Just over 9 years ago

Status changed from Need More Info to Can't reproduce

Let us know if anything interesting comes up.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8532

0.80.1: OSD crash (domino effect), same as BUG #8229

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Markus Blank-Burian almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Markus Blank-Burian over 9 years ago

Updated by Samuel Just over 9 years ago