Project

General

Profile

Bug #7914

osd: SEGV on ec write, ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096

Added by Sage Weil almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154379

 -1565> 2014-03-30 07:50:43.679454 7f4741ecb700 10 osd.0 pg_epoch: 9 pg[3.fs0( empty local-les=9 n=1 ec=8 les/c 9/9 8/8/8) [0,5,4] r=0 lpr=8 crt=0'0 mlcod 0'0 active+clean] submit_transaction: op Op(1ef9bbdf/benchmark_data_plana15_2897_object3/head//3 v=9'1 tt=0'0 tid=2 reqid=client.4123.0:4 client_op=osd_op(client
.4123.0:4 benchmark_data_plana15_2897_object3 [write 0~4194304] 3.1ef9bbdf ack+ondisk+write e8) pending_commit= pending_apply=) starting
 -1564> 2014-03-30 07:50:43.679492 7f4741ecb700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1563> 2014-03-30 07:50:43.679492 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1562> 2014-03-30 07:50:43.679530 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1561> 2014-03-30 07:50:43.679538 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1560> 2014-03-30 07:50:43.679544 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1559> 2014-03-30 07:50:43.679550 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1558> 2014-03-30 07:50:43.679556 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1557> 2014-03-30 07:50:43.679561 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1556> 2014-03-30 07:50:43.679566 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1555> 2014-03-30 07:50:43.679572 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1554> 2014-03-30 07:50:43.679577 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1553> 2014-03-30 07:50:43.679582 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1552> 2014-03-30 07:50:43.679588 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1551> 2014-03-30 07:50:43.679593 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
 -1550> 2014-03-30 07:50:43.679598 7f47416ca700 10 ErasureCodeJerasure: encode adjusted buffer length from 4096 to 4096
...
  -239> 2014-03-30 07:50:43.697849 7f4741ecb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f4741ecb700

 ceph version 0.78-459-ge672c52 (e672c52b4f8b945a516f2eec006e33665a08f045)
 1: ceph-osd() [0x9819fa]
 2: (()+0xfcb0) [0x7f4756779cb0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

local.txt View - teuthology log file (successfull) (1.21 MB) Loïc Dachary, 04/01/2014 06:26 AM

run2.txt View - successfull run (3.32 MB) Loïc Dachary, 04/01/2014 08:34 AM

ceph-osd.0.log.gz - Illegal Instruction (321 KB) Loïc Dachary, 04/06/2014 02:39 AM


Related issues

Related to Ceph - Feature #7928: erasure-code : no SSE3 specific code Rejected 04/01/2014
Related to Ceph - Feature #8071: erasure-code: simplify jerasure initialization Resolved 04/11/2014

Associated revisions

Revision 2ba00bd2 (diff)
Added by Loïc Dachary almost 10 years ago

erasure-code: thread-safe initialization of gf-complete

Instead of relying on an implicit initialization happening during
encoding/decoding with galois.c:galois_init_default_field, call
gf.c:gf_init_easy for each w values when the plugin is loaded.

Loading the plugin is protected against race conditions by a lock.

It does not cover all possible uses of gf-complete but it is enough for
the ceph jerasure plugin.

http://tracker.ceph.com/issues/7914 fixes #7914

Signed-off-by: Loic Dachary <>

History

#1 Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154491

#2 Updated by Loïc Dachary almost 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Loïc Dachary

#3 Updated by Loïc Dachary almost 10 years ago

archive_path: /var/lib/teuthworker/archive/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154491
description: rados/thrash/{clusters/fixed-2.yaml fs/ext4.yaml msgr-failures/few.yaml
  thrashers/pggrow.yaml workloads/ec-radosbench.yaml}
email: null
job_id: '154491'
kernel: &id001
  kdb: true
  sha1: adc8b2d4c270d7d1c04695ae856448dd11362618
last_in_suite: false
machine_type: plana
name: teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    fs: ext4
    log-whitelist:
    - slow request
    sha1: e672c52b4f8b945a516f2eec006e33665a08f045
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: e672c52b4f8b945a516f2eec006e33665a08f045
  s3tests:
    branch: master
  workunit:
    sha1: e672c52b4f8b945a516f2eec006e33665a08f045
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
tasks:
- internal.lock_machines:
  - 2
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    ec_pool: true
    time: 1800
    unique_pool: true
teuthology_branch: master
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.16939

#4 Updated by Loïc Dachary almost 10 years ago

latest jerasure / gf-complete submodule changes includes: bestrow = 0 for jerasure and gf-complete one and two which are on gf_w128.c which is not used by ceph (gf_w32.c is).

#5 Updated by Loïc Dachary almost 10 years ago

-msse3 was not used although it should have and this was fixed march 30th i.e. after the failed radosbench. There was SSE3 intrinsics used (#if at compile time) and no -msse3.

#6 Updated by Loïc Dachary almost 10 years ago

cpuinfo on plana17 (the machine on which the core dump occured) shows:

flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat dtherm tpr_shadow vnmi flexpriority ept vpid

#7 Updated by Loïc Dachary almost 10 years ago

Trying to reproduce the problem from the source tree on the current master with

rm -fr /tmp/dev /tmp/out ;  mkdir -p /tmp/dev ; CEPH_DIR=/tmp LC_ALL=C MON=1 OSD=6 bash -x ./vstart.sh -d -n -X -l mon osd
    ./ceph osd erasure-code-profile set myprofile \
        ruleset-failure-domain=osd || return 1
    ./ceph osd erasure-code-profile get myprofile
    ./ceph osd pool create ecpool 12 12 erasure myprofile \
        || return 1
rados --name client.0 -p ecpool bench 360 write

#8 Updated by Loïc Dachary almost 10 years ago

the /a/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154491 radosbench completed successfully

2014-03-30T09:59:12.768 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: 2014-03-30 09:59:12.767398min lat: 0.066051 max lat: 24.1214 avg lat: 1.35237
2014-03-30T09:59:12.768 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
2014-03-30T09:59:12.768 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:   1800      16     21280     21264   47.2462         0         -   1.35237
2014-03-30T09:59:13.768 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:   1801      16     21280     21264     47.22         0         -   1.35237
2014-03-30T09:59:14.768 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:   1802      16     21280     21264   47.1938         0         -   1.35237
2014-03-30T09:59:15.769 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:  Total time run:         1803.084996
2014-03-30T09:59:15.769 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Total writes made:      21281
2014-03-30T09:59:15.769 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Write size:             4194304
2014-03-30T09:59:15.769 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Bandwidth (MB/sec):     47.210
2014-03-30T09:59:15.769 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]:
2014-03-30T09:59:15.770 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Stddev Bandwidth:       45.9924
2014-03-30T09:59:15.770 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Max bandwidth (MB/sec): 148
2014-03-30T09:59:15.770 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Min bandwidth (MB/sec): 0
2014-03-30T09:59:15.770 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Average Latency:        1.3556
2014-03-30T09:59:15.771 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Stddev Latency:         1.82972
2014-03-30T09:59:15.771 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Max latency:            24.1214
2014-03-30T09:59:15.771 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.131.23]: Min latency:            0.066051

#9 Updated by Loïc Dachary almost 10 years ago

both errors show with "thrashers/pggrow.yaml" (one with btrfs, the other with ext4)

#10 Updated by Loïc Dachary almost 10 years ago

/a/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154491

2014-03-30T09:29:44.587 INFO:teuthology.task.thrashosds.thrasher:Growing pool unique_pool_0

#11 Updated by Loïc Dachary almost 10 years ago

Trying to trigger the problem again with:

machine_type: plana
os_type: ubuntu
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    fs: ext4
    log-whitelist:
    - slow request
    sha1: e672c52b4f8b945a516f2eec006e33665a08f045
  ceph-deploy:
    branch:
      dev: master
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: e672c52b4f8b945a516f2eec006e33665a08f045
  s3tests:
    branch: master
  workunit:
    sha1: e672c52b4f8b945a516f2eec006e33665a08f045
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    ec_pool: true
    time: 1800
    unique_pool: true
teuthology_branch: master
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.plana.16939

#12 Updated by Loïc Dachary almost 10 years ago

 ceph version 0.78-459-ge672c52 (e672c52b4f8b945a516f2eec006e33665a08f045)
 1: ceph-osd() [0x9819fa]
 2: (()+0xfcb0) [0x7f4756779cb0]

Maybe this is a problem in the static destructors ? Since radosbench completed successfully, it may have happened during the teardown phase. On the 27th March the ability to load multiple plugins was added and there now are two erasure code plugins loaded because the primary plugin selects another at run time and loads it

#13 Updated by Loïc Dachary almost 10 years ago

  • File local.txt View added
  • Status changed from In Progress to Can't reproduce
  • % Done changed from 0 to 10

The above teuthology workload ran on two plana machines and did not create a core dump.

#14 Updated by Loïc Dachary almost 10 years ago

I don't see why not locking the erasure-code plugin registry could be a problem in practice but it is a problem in theory, better fix it

#15 Updated by Loïc Dachary almost 10 years ago

Another run with success, complete logs in run2.txt

#16 Updated by Loïc Dachary almost 10 years ago

  • Status changed from Can't reproduce to In Progress

I will run the workload many times, hoping to reproduce the crash.

#17 Updated by Sage Weil almost 10 years ago

ran through 20 iterations of the above yaml and did not reproduce.

#18 Updated by Loïc Dachary almost 10 years ago

Contrary to what I thought (reading the wrong number), plana have SSE3 indeed and the jerasure plugin including SSE4 optimizations is being run as is shown in the logs

/var/log/ceph/ceph-osd.3.log:2014-04-01 14:45:27.582145 7fbe5b687700 10 ErasureCodePluginSelectJerasure: SSE4 plugin

#19 Updated by Loïc Dachary almost 10 years ago

Running 30 more, sequentially.

#20 Updated by Loïc Dachary almost 10 years ago

The only difference is that ext4 is used in teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154491 and btrfs is used in teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154379

#21 Updated by Sage Weil almost 10 years ago

also happened on ubuntu@teuthology:/a/teuthology-2014-03-30_02:30:11-rados-master-testing-basic-plana/154379 on commit e672c52b4f8b945a516f2eec006e33665a08f045

no osd crashes from teuthology-2014-03-28_02:30:01-rados-master-testing-basic-plana (which was complete, afaics), commit 7f4be9e9d040af443ca4cbe2f3fd2a44d01f71bf

2 osd crashes on the 27th, commit d9a2dea755a62e4f9fe0795410f37b68a15ae054, but neither are this.

#22 Updated by Loïc Dachary almost 10 years ago

/a/teuthology-2014-03-31_02:30:03-rados-master-testing-basic-plana/155875/teuthology.log is the only ec_pool test that failed and has no core

2014-03-31T17:51:23.798 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]:  Total time run:         1803.051784
2014-03-31T17:51:23.798 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Total writes made:      19399
2014-03-31T17:51:23.798 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Write size:             4194304
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Bandwidth (MB/sec):     43.036
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]:
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Stddev Bandwidth:       31.3604
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Max bandwidth (MB/sec): 156
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Min bandwidth (MB/sec): 0
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Average Latency:        1.48695
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Stddev Latency:         1.78114
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Max latency:            58.8693
2014-03-31T17:51:23.799 INFO:teuthology.task.radosbench.radosbench.0.out:[10.214.133.33]: Min latency:            0.062374

#24 Updated by Loïc Dachary almost 10 years ago

#0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x0000000000000000 in ?? ()
#5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
    size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
#6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
    at erasure-code/jerasure/jerasure/src/jerasure.c:310
#7  0x00007f47385ce884 in ErasureCodeJerasureReedSolomonVandermonde::jerasure_encode (this=<optimized out>, data=<optimized out>, coding=<optimized out>, 
    blocksize=<optimized out>) at erasure-code/jerasure/ErasureCodeJerasure.cc:230
#8  0x00007f47385d3806 in ErasureCodeJerasure::encode (this=0x34f4140, want_to_encode=..., in=..., encoded=0x7f4741ec7e50) at erasure-code/jerasure/ErasureCodeJerasure.cc:129
#9  0x000000000093d0f3 in ECUtil::encode (sinfo=..., ec_impl=..., in=..., want=..., out=0x7f4741ec80f0) at osd/ECUtil.cc:120
#10 0x000000000093b0c1 in TransGenerator::operator() (this=0x7f4741ec81c0, op=...) at osd/ECTransaction.cc:138
#11 0x0000000000939f6f in internal_visit<ECTransaction::AppendOp const> (operand=..., this=<synthetic pointer>) at /usr/include/boost/variant/variant.hpp:832
#12 visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<TransGenerator>, void const*, ECTransaction::AppendOp> (storage=<optimized out>, 
    visitor=<synthetic pointer>) at /usr/include/boost/variant/detail/visitation_impl.hpp:130
#13 visitation_impl_invoke<boost::detail::variant::invoke_visitor<TransGenerator>, void const*, ECTransaction::AppendOp, boost::variant<ECTransaction::AppendOp, ECTransaction::CloneOp, ECTransaction::RenameOp, ECTransaction::StashOp, ECTransaction::TouchOp, ECTransaction::RemoveOp, ECTransaction::SetAttrsOp, ECTransaction::RmAttrOp, ECTransaction::AllocHintOp, ECTransaction::NoOp>::has_fallback_type_> (storage=<optimized out>, visitor=<synthetic pointer>, internal_which=<optimized out>, t=<optimized out>)
    at /usr/include/boost/variant/detail/visitation_impl.hpp:173
#14 visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<10l>, ECTransaction::AppendOp, boost::mpl::l_item<mpl_::long_<9l>, ECTransaction::CloneOp, boost::mpl::l_item<mpl_::long_<8l>, ECTransaction::RenameOp, boost::mpl::l_item<mpl_::long_<7l>, ECTransaction::StashOp, boost::mpl::l_item<mpl_::long_<6l>, ECTransaction::TouchOp, boost::mpl::l_item<mpl_::long_<5l>, ECTransaction::RemoveOp, boost::mpl::l_item<mpl_::long_<4l>, ECTransaction::SetAttrsOp, boost::mpl::l_item<mpl_::long_<3l>, ECTransaction::RmAttrOp, boost::mpl::l_item<mpl_::long_<2l>, ECTransaction::AllocHintOp, boost::mpl::l_item<mpl_::long_<1l>, ECTransaction::NoOp, boost::mpl::l_end> > > > > > > > > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<TransGenerator>, void const*, boost::variant<ECTransaction::AppendOp, ECTransaction::CloneOp, ECTransaction::RenameOp, ECTransaction::StashOp, ECTransaction::TouchOp, ECTransaction::RemoveOp, ECTransaction::SetAttrsOp, ECTransaction::RmAttrOp, ECTransaction::AllocHintOp, ECTransaction::NoOp>::has_fallback_type_> (no_backup_flag=..., storage=<optimized out>, visitor=<synthetic pointer>, 
    logical_which=<optimized out>, internal_which=<optimized out>) at /usr/include/boost/variant/detail/visitation_impl.hpp:260
#15 internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<TransGenerator>, void const*> (storage=<optimized out>, logical_which=<optimized out>, 
    visitor=<synthetic pointer>, internal_which=<optimized out>) at /usr/include/boost/variant/variant.hpp:1776
#16 internal_apply_visitor<boost::detail::variant::invoke_visitor<TransGenerator> > (visitor=<synthetic pointer>, this=0x30cc830)
    at /usr/include/boost/variant/variant.hpp:1798
#17 apply_visitor<TransGenerator> (this=0x30cc830, visitor=...) at /usr/include/boost/variant/variant.hpp:1820
#18 apply_visitor<TransGenerator, boost::variant<ECTransaction::AppendOp, ECTransaction::CloneOp, ECTransaction::RenameOp, ECTransaction::StashOp, ECTransaction::TouchOp, ECTransaction::RemoveOp, ECTransaction::SetAttrsOp, ECTransaction::RmAttrOp, ECTransaction::AllocHintOp, ECTransaction::NoOp> const> (visitable=..., visitor=...)
    at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:60
#19 visit<TransGenerator> (vis=..., this=0x34198c0) at osd/ECTransaction.h:183
#20 ECTransaction::generate_transactions (this=0x34198c0, hash_infos=..., ecimpl=..., pgid=..., sinfo=..., transactions=0x7f4741ec8c10, temp_added=0x30c62e8, 
    temp_removed=0x30c6318, out=0x0) at osd/ECTransaction.cc:282
#21 0x000000000091588f in ECBackend::start_write (this=0x3422b40, op=0x30c6228) at osd/ECBackend.cc:1494
#22 0x00000000009238cb in ECBackend::submit_transaction (this=0x3422b40, hoid=..., at_version=..., _t=<optimized out>, trim_to=..., log_entries=..., 
    on_local_applied_sync=0x33b4300, on_all_applied=0x33cb860, on_all_commit=0x3132ee0, tid=2, reqid=..., client_op=...) at osd/ECBackend.cc:1260
#23 0x00000000007d48e7 in ReplicatedPG::issue_repop (this=0x35e0000, repop=0x3480300, now=...) at osd/ReplicatedPG.cc:6458
#24 0x000000000080f903 in ReplicatedPG::execute_ctx (this=0x35e0000, ctx=0x32bd000) at osd/ReplicatedPG.cc:1807
#25 0x000000000081adb2 in ReplicatedPG::do_op (this=0x35e0000, op=...) at osd/ReplicatedPG.cc:1464
#26 0x00000000007bbe22 in ReplicatedPG::do_request (this=0x35e0000, op=..., handle=...) at osd/ReplicatedPG.cc:1097
#27 0x0000000000619dfa in OSD::dequeue_op (this=0x3051000, pg=..., op=..., handle=...) at osd/OSD.cc:7633
#28 0x0000000000634e58 in OSD::OpWQ::_process (this=0x3051e28, pg=..., handle=...) at osd/OSD.cc:7603

#29 0x000000000067a48c in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (
    this=0x3051e28, handle=...) at ./common/WorkQueue.h:190
#30 0x0000000000a53476 in ThreadPool::worker (this=0x3051470, wt=0x305bac0) at common/WorkQueue.cc:125
#31 0x0000000000a55280 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#32 0x00007f4756771e9a in start_thread (arg=0x7f4741ecb700) at pthread_create.c:308
#33 0x00007f4754d323fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#34 0x0000000000000000 in ?? ()

#25 Updated by Sage Weil almost 10 years ago

mira087 has the fisrst 2 failures, mira089 has the earlier one. they all see to crash in teh same place:

#4  0x0000000000000000 in ?? ()
#5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
    at erasure-code/jerasure/jerasure/src/jerasure.c:607
#6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:310

#26 Updated by Loïc Dachary almost 10 years ago

(gdb) frame 5
(gdb) p *(gf_internal_t*)gfp_array[32].scratch
$2 = {
  mult_type = 0x0, 
  region_type = 0x0, 
  divide_type = 0x0, 
  w = 0x20, 
  prim_poly = 0x400007, 
  free_me = 0x1, 
  arg1 = 0x0, 
  arg2 = 0x0, 
  base_gf = 0x0, 
  private = 0x33b92b8
}

#27 Updated by Loïc Dachary almost 10 years ago

Compiled TestErasureCodeJerasure.cc with m=1 https://github.com/ceph/ceph/blob/master/src/test/erasure-code/TestErasureCodeJerasure.cc#L48 and all optimizations. Break where the code path is supposed to lead to verify it does. This is the same configuration as in the core. The ${INTEL_FLAGS} have to be added to https://github.com/ceph/ceph/blob/master/src/test/erasure-code/Makefile.am#L89 and https://github.com/ceph/ceph/blob/master/src/test/erasure-code/Makefile.am#L92 to activate the opimizations.

(gdb) bt
#0  gf_multby_one (src=0x1ce50a0, dest=0x1ce5140, bytes=160, xor=1) at erasure-code/jerasure/gf-complete/src/gf.c:921
#1  0x0000000000877c12 in gf_w32_split_4_32_lazy_sse_multiply_region (gf=0x1cbc370, src=0x1ce50a0, dest=0x1ce5140, val=1, bytes=160, xor=1) at erasure-code/jerasure/gf-complete/src/gf_w32.c:1895
#2  0x000000000084fb38 in galois_w32_region_xor (src=0x1ce50a0, dest=0x1ce5140, nbytes=160) at erasure-code/jerasure/jerasure/src/galois.c:332
#3  0x000000000084fb69 in galois_region_xor (src=0x1ce50a0 "klmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789", dest=0x1ce5140 "*..**66::>>::66*acagacaoac", ' ' <repeats 26 times>, "QSQWQSQ_QS*.CDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghij", nbytes=160) at erasure-code/jerasure/jerasure/src/galois.c:338
#4  0x00000000008518cc in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x1caa148, src_ids=0x0, dest_id=2, data_ptrs=0x7fffffffdb40, coding_ptrs=0x7fffffffdb50, size=160) at erasure-code/jerasure/jerasure/src/jerasure.c:607
#5  0x0000000000850767 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=0x1caa148, data_ptrs=0x7fffffffdb40, coding_ptrs=0x7fffffffdb50, size=160) at erasure-code/jerasure/jerasure/src/jerasure.c:310
#6  0x00000000008b793b in ErasureCodeJerasureReedSolomonVandermonde::jerasure_encode (this=0x7fffffffe080, data=0x7fffffffdb40, coding=0x7fffffffdb50, blocksize=160) at erasure-code/jerasure/ErasureCodeJerasure.cc:230
#7  0x00000000008b6aad in ErasureCodeJerasure::encode (this=0x7fffffffe080, want_to_encode=..., in=..., encoded=0x7fffffffdfc0) at erasure-code/jerasure/ErasureCodeJerasure.cc:129
#8  0x000000000084bf1a in ErasureCodeTest_encode_decode_Test<ErasureCodeJerasureReedSolomonVandermonde>::TestBody (this=0x1cae0d0) at test/erasure-code/TestErasureCodeJerasure.cc:62
#9  0x00000000008bfec6 in testing::Test::Run (this=0x1cae0d0) at ./src/gtest.cc:2095
#10 0x00000000008c0432 in testing::internal::TestInfoImpl::Run (this=0x1cc6000) at ./src/gtest.cc:2314
#11 0x00000000008c0945 in testing::TestCase::Run (this=0x1cce000) at ./src/gtest.cc:2420
#12 0x00000000008c4ee6 in testing::internal::UnitTestImpl::RunAllTests (this=0x1cca000) at ./src/gtest.cc:4024
#13 0x00000000008c3e32 in testing::UnitTest::Run (this=0xd02c20 <testing::UnitTest::GetInstance()::instance>) at ./src/gtest.cc:3687
#14 0x000000000083fd90 in main (argc=1, argv=0x7fffffffe3a8) at test/erasure-code/TestErasureCodeJerasure.cc:344

#28 Updated by Loïc Dachary almost 10 years ago

To find the failed ones:

grep -l ec_pool /a/teuthology-2014-04-02_*rados-master*/*/config.yaml | xargs grep -l radosbench | sed -e 's/config.yaml/teuthology.log/' > /tmp/files
 tail -n1 $(cat /tmp/files)

/a/teuthology-2014-04-02_02:30:02-rados-master-testing-basic-plana has no core

#29 Updated by Sage Weil almost 10 years ago

This didn't trigger at all in yesterday's run. :/

#30 Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/gregf-2014-04-04_22:05:49-rados-wip-7994-testing-basic-plana/170763

#31 Updated by Loïc Dachary almost 10 years ago

/var/lib/teuthworker/archive/gregf-2014-04-04_22:05:/log/ceph-osd.0.log.gz shows:

 *** Caught signal (Illegal instruction) **
 in thread 7fc8de7a1700

 ceph version 0.78-568-g232ac1a (232ac1a52a322d163d8d8dbc4a7da4b6a9acb709)
 1: ceph-osd() [0x982c2a]
 2: (()+0x364a0) [0x7fc8f1d4b4a0]
 3: [0x1901a20]

#32 Updated by Loïc Dachary almost 10 years ago

SSE4 plugin selected and ECX = 029ee3ff (this is consistent)

ubuntu@plana08:~$ cpuid
 eax in    eax      ebx      ecx      edx
00000000 0000000b 756e6547 6c65746e 49656e69
00000001 000206c2 21200800 029ee3ff bfebfbff
00000002 55035a01 00f0b2ff 00000000 00ca0000
00000003 00000000 00000000 00000000 00000000
00000004 3c004122 00c0003f 0000007f 00000000
00000005 00000040 00000040 00000003 00001120
00000006 00000007 00000002 00000001 00000000
00000007 00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000 00000000
0000000a 07300403 00000004 00000000 00000603
0000000b 00000005 00000008 00000201 00000021
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 2c100800
80000002 65746e49 2952286c 6f655820 2952286e
80000003 55504320 20202020 20202020 45202020
80000004 30323635 20402020 30342e32 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 01006040 00000000
80000007 00000000 00000000 00000000 00000100
80000008 00003028 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 11

Intel-specific functions:
Version 000206c2:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 12 - 
Stepping 2
Reserved 8

Extended brand string: "Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz" 
CLFLUSH instruction cache line size: 8
Initial APIC ID: 33
Hyper threading siblings: 32

Feature flags bfebfbff:
FPU    Floating Point Unit
VME    Virtual 8086 Mode Enhancements
DE     Debugging Extensions
PSE    Page Size Extensions
TSC    Time Stamp Counter
MSR    Model Specific Registers
PAE    Physical Address Extension
MCE    Machine Check Exception
CX8    COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEP    Fast System Call
MTRR   Memory Type Range Registers
PGE    PTE Global Flag
MCA    Machine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH  CFLUSH instruction
DS     Debug store
ACPI   Thermal Monitor and Clock Ctrl
MMX    MMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSE    Streaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS     Self Snoop
HT     Hyper Threading
TM     Thermal monitor
31     reserved

TLB and cache info:
5a: unknown TLB/cache descriptor
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
55: unknown TLB/cache descriptor
ff: unknown TLB/cache descriptor
b2: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
ca: unknown TLB/cache descriptor
Processor serial: 0002-06C2-0000-0000-0000-0000

#33 Updated by Loïc Dachary almost 10 years ago

The plana08 machine has been recycled and there is no core archived. See the ceph-osd.0.log.gz for the full logs. The commit ( 232ac1a52a322d163d8d8dbc4a7da4b6a9acb709 ) on which it runs is roughly firefly up to date (module 24h).

#34 Updated by Loïc Dachary almost 10 years ago

The teuthology config.yaml

description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/default.yaml workloads/ec-radosbench.yaml}
last_in_suite: false
nuke-on-error: true
os_type: ubuntu
overrides:
  admin_socket:
    branch: wip-7994
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 232ac1a52a322d163d8d8dbc4a7da4b6a9acb709
  ceph-deploy:
    branch:
      dev: wip-7994
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 232ac1a52a322d163d8d8dbc4a7da4b6a9acb709
  s3tests:
    branch: master
  workunit:
    sha1: 232ac1a52a322d163d8d8dbc4a7da4b6a9acb709
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    conf:
      osd:
        osd debug reject backfill probability: 0.3
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    ec_pool: true
    time: 1800
    unique_pool: true
teuthology_branch: master
verbose: true

#35 Updated by Loïc Dachary almost 10 years ago

The workload is slightly different from the previous one:

--- c.yaml    2014-04-06 12:02:28.606673232 +0200
+++ config.yaml    2014-04-06 12:02:29.082677334 +0200
@@ -17,13 +17,15 @@
         debug journal: 20
         debug ms: 1
         debug osd: 20
-    fs: ext4
+        osd op thread timeout: 60
+        osd sloppy crc: true
+    fs: btrfs
     log-whitelist:
     - slow request
@@ -57,11 +59,14 @@
 - clock.check: null
 - install: null
 - ceph:
+    conf:
+      osd:
+        osd debug reject backfill probability: 0.3
     log-whitelist:
     - wrongly marked me down
     - objects unfound and apparently lost
 - thrashosds:
-    chance_pgnum_grow: 2
+    chance_pgnum_grow: 1
     chance_pgpnum_fix: 1
     timeout: 1200
 - radosbench:

running 30 of them hoping it has a higher probability to trigger the problem.

#36 Updated by Loïc Dachary almost 10 years ago

Ran 27 times without triggering the problem. Will keep running it 100 times more.

#37 Updated by Sage Weil almost 10 years ago

     0> 2014-04-07 01:31:19.678469 7ff4d4cc5700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ff4d4cc5700

 ceph version 0.78-600-g19f50b9 (19f50b9d7bbbb2cce3b599f3ed8a9fa32c3d4e53)
 1: ceph-osd() [0x984aea]
 2: (()+0xfcb0) [0x7ff4e9a7bcb0]
 3: (galois_w32_region_xor()+0x4a) [0x7ff4db4fc2ea]
 4: (jerasure_matrix_dotprod()+0x1d1) [0x7ff4db4fd6b1]
 5: (jerasure_matrix_decode()+0x237) [0x7ff4db4fda57]
 6: (ErasureCodeJerasureReedSolomonVandermonde::jerasure_decode(int*, char**, char**, int)+0x36) [0x7ff4db51d826]
 7: (ErasureCodeJerasure::decode(std::set<int, std::less<int>, std::allocator<int> > const&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > > const&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list>
 > >*)+0x1369) [0x7ff4db521d69]
 8: (ECUtil::decode(ECUtil::stripe_info_t const&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >&, std::map<int, ceph::buffer::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::li
st*> > >&)+0x534) [0x940de4]
 9: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::nu
ll_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x78e) [0x91fb1e]
 10: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x86) [0x933e86]
 11: (GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x9) [0x926ff9]
 12: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x5b) [0x916a2b]
 13: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0xcd5) [0x920f15]
 14: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x506) [0x924c16]
 15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ee) [0x7bc5ae]
 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x34a) [0x61a1aa]
 17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x628) [0x635288]
 18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x67a88c]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa56db6]
 20: (ThreadPool::WorkThread::entry()+0x10) [0xa58bc0]
 21: (()+0x7e9a) [0x7ff4e9a73e9a]
 22: (clone()+0x6d) [0x7ff4e80343fd]

ubuntu@teuthology:/var/lib/teuthworker/archive/samuelj-2014-04-06_19:34:24-rados:thrash-wip-sam-testing-testing-basic-plana/175182

I didn't see the 'encode adjusted buffer length' this time, btw.

#38 Updated by Loïc Dachary almost 10 years ago

Interesting : it is in decode this time, not encode

#39 Updated by Loïc Dachary almost 10 years ago

gfp_array is a global variable . If galois_w16_region_xor is called from two different threads, there is a race condition.

#40 Updated by Loïc Dachary almost 10 years ago

Kevin Greenan writes:

Using the doc and a few special functions in jerasure, you can ensure that the underlying GF implementations are thread safe. This can be done in jerasure using galois_init_field() and galois_change_technique(). If these are explicitly used before starting any encoding/decoding (i.e. during init), there should never be a race condition.

http://web.eecs.utk.edu/~plank/plank/papers/GF-Complete-Manual-1.02.pdf

#41 Updated by Loïc Dachary almost 10 years ago

  • % Done changed from 10 to 50

#42 Updated by Loïc Dachary almost 10 years ago

  • Status changed from In Progress to Fix Under Review

#43 Updated by Loïc Dachary almost 10 years ago

running tests against the proposed change, using the previous config.yaml. Not really hoping that it will failing. Just hoping that it won't fail ;-)

description: rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
  thrashers/default.yaml workloads/ec-radosbench.yaml}
last_in_suite: false
nuke-on-error: true
os_type: ubuntu
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 2ba00bd2e8d9919cefb069cfd9dd7bdd3117e8b1
  ceph-deploy:
    branch:
      dev: wip-7914
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 2ba00bd2e8d9919cefb069cfd9dd7bdd3117e8b1
  s3tests:
    branch: master
  workunit:
    sha1: 2ba00bd2e8d9919cefb069cfd9dd7bdd3117e8b1
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.1
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    conf:
      osd:
        osd debug reject backfill probability: 0.3
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    ec_pool: true
    time: 1800
    unique_pool: true
teuthology_branch: master
verbose: true

#44 Updated by Loïc Dachary almost 10 years ago

the above config.yaml ran 6 times without failing

#45 Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to 7

#46 Updated by Loïc Dachary almost 10 years ago

Kevin Greenan proposes to expose a static function to reduce the amount of code required from the plugin : https://bitbucket.org/jimplank/jerasure/pull-request/17/make-galois_init_default_field-int-w

#47 Updated by Loïc Dachary almost 10 years ago

  • % Done changed from 50 to 90

The above workload ran 26 times without crashing. rados suite did not run yet on the latest master.

#49 Updated by Loïc Dachary almost 10 years ago

The above workload ran 51 times without crashing.

#50 Updated by Loïc Dachary almost 10 years ago

The above workload ran 82 times without crashing. The loop stopped because a DNS resolution failure made it impossible to retrieve the Ceph packages. The loop was restarted.

#51 Updated by Loïc Dachary almost 10 years ago

All changes have been accepted upstream ( jerasure / gf-complete ), backported and copied into http://github.com/ceph/{jerasure,gf-complete}

#52 Updated by Sage Weil almost 10 years ago

we haven't seen this in several days; i'm ready to call it resolved!

#53 Updated by Loïc Dachary almost 10 years ago

  • Status changed from 7 to Resolved
  • % Done changed from 90 to 100

The workload ran 30 more times without errors. The loop was interrupted and the plana machines released.

Also available in: Atom PDF