Project

General

Profile

Actions

Feature #5000

closed

Get Teuthology to run on ARM's

Added by Anonymous almost 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:

Description

This one is a tough one to estimate.

Actions #1

Updated by Anonymous almost 11 years ago

  • Tracker changed from Tasks to Feature
Actions #2

Updated by Sage Weil almost 11 years ago

  • Target version set to v0.64
Actions #3

Updated by Ian Colle almost 11 years ago

  • Translation missing: en.field_story_points changed from 20.00 to 21.00
Actions #4

Updated by Tamilarasi muthamizhan almost 11 years ago

  • Status changed from New to In Progress
  • Assignee changed from Anonymous to Tamilarasi muthamizhan

just started with this.

Actions #5

Updated by Anonymous almost 11 years ago

  • Target version changed from v0.64 to v0.65
Actions #6

Updated by Tamilarasi muthamizhan almost 11 years ago

  • Assignee changed from Tamilarasi muthamizhan to Anonymous
Actions #7

Updated by Anonymous almost 11 years ago

I am using : http://gitbuilder.ceph.com/kernel-deb-quantal-armv7l-basic/ref/master/linux-image-3.9.0-ceph-b5b09be3-highbank_3.9.0-ceph-b5b09be3-highbank-1_armhf.deb as my kernel right now. The most recent rgw suite tests failed because teuthology could not find /boot/grub/grub.cfg.

Actions #8

Updated by Sage Weil almost 11 years ago

  • Target version changed from v0.65 to v0.66
Actions #9

Updated by Anonymous almost 11 years ago

Note: The current version attempts the same grub operations that the x86_64 code does. This needs to be changed to not happen for ARMs.

Actions #10

Updated by Anonymous almost 11 years ago

I am now getting the following on an rbd kernel_untar_build test:

failure_reason: '"2013-06-19 14:41:28.266297 osd.2 10.214.143.3:6800/28925 14 :
[WRN] 1 slow requests, 1 included below; oldest blocked for > 30.744780 secs"
in cluster log', flavor: basic, mon.a-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f,
mon.b-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f, owner: wusui@aardvark,
success: false}

Actions #11

Updated by Anonymous almost 11 years ago

The yaml file is:

machine_type: tala
kernel:
branch: testing
roles:
- [mon.a, mds.a, osd.0, osd.1]
- [mon.b, mon.c, osd.2, osd.3]
- [client.0]
tasks:
- install:
branch: cuttlefish
- ceph:
- rbd:
all:
- workunit:
clients:
all:
- kernel_untar_build.sh

Actions #12

Updated by Anonymous almost 11 years ago

I also got the following:

failure_reason: '"2013-06-20 06:43:21.988558 osd.2 10.214.143.3:6800/456 124039
    : [WRN] 2 slow requests, 2 included below; oldest blocked for > 45483.204801 secs" 
    in cluster log', flavor: basic, mon.a-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f,
  mon.b-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f, owner: wusui@aardvark,
  success: false}

while running this yaml:
machine_type: tala
kernel:
  branch: testing
roles:
- [mon.a, mds.a, osd.0, osd.1,]
- [mon.b, mon.c, osd.2, osd.3,]
- [client.0]
tasks:
- install:
    branch: cuttlefish
- ceph:
- rbd:
    all:
- workunit:
    clients:
      all: [misc/trivial_sync.sh]

This test ran in less than 15 minutes (I have no idea how someting could be blocked for 45483 seconds).

Actions #13

Updated by Anonymous almost 11 years ago

Another run:

INFO:teuthology.run:Summary data:
{client.0-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f, duration: 193.49849891662598,
  flavor: basic, mon.a-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f, mon.b-kernel-sha1: 19bb6a83cb93383b363cc5956e304213f0f1b79f,
  owner: wusui@aardvark, success: true}

INFO:teuthology.run:pass

Actions #14

Updated by Anonymous almost 11 years ago

I can get this to run by artificially messing with timeout values in the need_to_install() code. However, the code here makes assumptions about the returns from uname -r that are incorrect for both ARM and virtual machines. I think that this needs to be fiugred out.

Is there a better linux command for getting the kernel version than uname -r ? The information returned does not seem correct for the teuthology kernel task.

Actions #15

Updated by Anonymous almost 11 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Anonymous to Sage Weil

Changes have been made in wip-teutharm-wusui.

Actions #16

Updated by Anonymous almost 11 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Sage Weil to Anonymous

uname -r returns 3.4.0-34-highbank or 3.4.0-1000-highbank. After I used ipmitool to power cycle
one of the machines, uname -r returned 3.9.0-ceph-19bb6a83-highbank

Actions #17

Updated by Anonymous almost 11 years ago

reboot also fixes /proc/version and uname -r. So I think that I need to reboot, and get rid of the bogus uname workaround code.

Actions #18

Updated by Anonymous almost 11 years ago

The kernel tests have generated the following crash:

[  957.905812] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/include/linux/ceph/decode.h:164!
[  957.914849] Internal error: Oops - BUG: 0 [#1] SMP ARM
[  957.919978] Modules linked in: rbd libceph libcrc32c ipmi_devintf ipmi_si ipmi_msghandler nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc
[  957.932547] CPU: 1    Tainted: G        W     (3.9.0-ceph-19bb6a83-highbank #1)
[  957.939881] PC is at ceph_osdc_build_request+0x8c/0x4f8 [libceph]
[  957.945967] LR is at 0xec520904
[  957.949103] pc : [<bf13e76c>]    lr : [<ec520904>]    psr: 20000153
[  957.949103] sp : ec753df8  ip : 00000001  fp : ec53e100
[  957.960571] r10: ebef25c0  r9 : ec5fa400  r8 : ecbcc000
[  957.965788] r7 : 00000000  r6 : 00000000  r5 : ffffffff  r4 : 00000020
[  957.972307] r3 : 51cc8143  r2 : ec520900  r1 : ec753e58  r0 : ec520908
[  957.978827] Flags: nzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment user
[  957.986039] Control: 10c5387d  Table: 2c59c04a  DAC: 00000015
[  957.991777] Process rbd (pid: 2138, stack limit = 0xec752238)
[  957.997514] Stack: (0xec753df8 to 0xec754000)
[  958.001864] 3de0:                                                       00000001 00000001
[  958.010032] 3e00: 00000001 bf139744 ecbcc000 ec55a0a0 00000024 00000000 ebef25c0 fffffffe
[  958.018204] 3e20: ffffffff 00000000 00000000 00000001 ec5fa400 ebef25c0 ec53e100 bf166b68
[  958.026377] 3e40: 00000000 0000220f fffffffe ffffffff ec753e58 bf13ff24 51cc8143 05b25ed2
[  958.034548] 3e60: 00000001 00000000 00000000 bf1688d4 00000001 00000000 00000000 00000000
[  958.042720] 3e80: 00000001 00000060 ec5fa400 ed53d200 ed439600 ed439300 00000001 00000060
[  958.050888] 3ea0: ec5fa400 ed53d200 00000000 bf16a320 00000000 ec53e100 00000040 ec753eb8
[  958.059059] 3ec0: ec51df00 ed53d7c0 ed53d200 ed53d7c0 00000000 ed53d7c0 ec5fa400 bf16ed70
[  958.067230] 3ee0: 00000000 00000060 00000002 ed53d200 00000000 bf16acf4 ed53d7c0 ec752000
[  958.075402] 3f00: ed980e50 e954f5d8 00000000 00000060 ed53d240 ed53d258 ec753f80 c04f44a8
[  958.083574] 3f20: edb7910c ec664700 01ade920 c02e4c44 00000060 c016b3dc ec51de40 01adfb84
[  958.091745] 3f40: 00000060 ec752000 ec753f80 ec752000 00000060 c0108444 00000007 ec51de48
[  958.099914] 3f60: ed0eb8c0 00000000 00000000 ec51de40 01adfb84 00000001 00000060 c0108858
[  958.108085] 3f80: 00000000 00000000 51cc8143 00000060 01adfb84 00000007 00000004 c000dd68
[  958.116257] 3fa0: 00000000 c000dbc0 00000060 01adfb84 00000007 01adfb84 00000060 01adfb80
[  958.124429] 3fc0: 00000060 01adfb84 00000007 00000004 beded1a8 00000000 01adf2f0 01ade920
[  958.132599] 3fe0: 00000000 beded180 b6811324 b6811334 800f0010 00000007 2e7f5821 2e7f5c21
[  958.140815] [<bf13e76c>] (ceph_osdc_build_request+0x8c/0x4f8 [libceph]) from [<bf166b68>] (rbd_osd_req_format_write+0x50/0x7c [rbd])
[  958.152739] [<bf166b68>] (rbd_osd_req_format_write+0x50/0x7c [rbd]) from [<bf1688d4>] (rbd_dev_header_watch_sync+0xe0/0x204 [rbd])
[  958.164486] [<bf1688d4>] (rbd_dev_header_watch_sync+0xe0/0x204 [rbd]) from [<bf16a320>] (rbd_dev_image_probe+0x23c/0x850 [rbd])
[  958.175967] [<bf16a320>] (rbd_dev_image_probe+0x23c/0x850 [rbd]) from [<bf16acf4>] (rbd_add+0x3c0/0x918 [rbd])
[  958.185975] [<bf16acf4>] (rbd_add+0x3c0/0x918 [rbd]) from [<c02e4c44>] (bus_attr_store+0x20/0x2c)
[  958.194850] [<c02e4c44>] (bus_attr_store+0x20/0x2c) from [<c016b3dc>] (sysfs_write_file+0x168/0x198)
[  958.203984] [<c016b3dc>] (sysfs_write_file+0x168/0x198) from [<c0108444>] (vfs_write+0x9c/0x170)
[  958.212768] [<c0108444>] (vfs_write+0x9c/0x170) from [<c0108858>] (sys_write+0x3c/0x70)
[  958.220768] [<c0108858>] (sys_write+0x3c/0x70) from [<c000dbc0>] (ret_fast_syscall+0x0/0x30)
[  958.229199] Code: e59d1058 e5913000 e3530000 ba000114 (e7f001f2) 
[  958.235300] ---[ end trace da227214a82491ba ]---
Actions #19

Updated by Anonymous almost 11 years ago

This problem can be reproduced by running teuthology with the following yaml file.

machine_type: tala
kernel:
  branch: testing
roles:
- [mon.a, mon.b, mon.c, mds.a, osd.0, osd.1, osd.2, osd.3,]
- [client.0]
tasks:
- install:
    branch: cuttlefish
- ceph:
- rbd:
    all:
- workunit:
    clients:
      all: [misc/trivial_sync.sh]
targets:
  ubuntu@tala002.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD5mFt7raxufuhfx3dxQDJg5mzJ4N+n94rHC/pEqCFvXSp5Fly9cZZxdmn6N5vNUerXIt7/ui2AlVii/bSNjBJrXGYwi+IK+tRPpHb1e5OaS1FdNeHHIeIofeTmUVC7wzsit7sWCcN0I+FjlVqWjXs4qsjI56MbAMC+YVAepbhOUT/j8tFFLXgMN4xFKx10G4TqGWJqsMA1+WD4DLHWI8GrqccGTdokzaotSFHH3uMJIzXfTpCLts1n6yX2iogmK2ayFyD7TmMPRI9ZQ2E5yvkMsYrAOyyPp7h3RVGRRYWR47mmdrENfjuVKQcK30tBSO3tl13BXxWNl1+rfMOk9Cqz
  ubuntu@tala004.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4XQUem3ze9TfBfsJ3pL8kPm+Y98TEJDQ76rOcdjMR4Rs8mte1Q1B93hT0CI8uRjFcv9uiKaOlweiqKXSx6N/20dsPQ2LN54FtXLB346vsxDmZH0RRzg7KfHja/AilEW3pN3nlLlYkCN/9yWuId3g1sN1L6Shylyc96OL2b++O5fZhZnzbbaHSvyngU73GY/sfRWWA6bB6suXRe/QMbHA/ge/+EvcjJ74nZynenujAchjcVmY6xzpXsXYtSSpYcdgkVh+7P1H0KkfWJwH8aRvsni7TE/6Zp8AtaROelCW1v5vMaLAUjjFtz2nVy2KSViktX3jIpwHDXoFd3eJumXxT

Actions #20

Updated by Anonymous almost 11 years ago

I also sometimes get these messages.


huh, entered softirq 3 NET_RX c03e7670 preempt_count 00000100, exited with 00000000?
[52975.023874] huh, entered softirq 4 BLOCK c0270bf8 preempt_count 00000100, exited with 00000000?
[53795.858433] huh, entered softirq 4 BLOCK c0270bf8 preempt_count 00000100, exited with 00000000?
[53819.409115] huh, entered softirq 4 BLOCK c0270bf8 preempt_count 00000100, exited with 00000000?
Actions #21

Updated by Anonymous almost 11 years ago

Another test generates the following message on the console:

BUG: scheduling while atomic: swaper/0/0/0xffff0000

I've gotten a few million of these messages.

Actions #22

Updated by Josh Durgin almost 11 years ago

Waiting for a fix to the first issue to build. What's the yaml that triggers the "BUG: scheduling while atomic: swaper/0/0/0xffff0000" Warren?

Actions #23

Updated by Anonymous almost 11 years ago

I think that the following causes the atomic swapper message...

machine_type: tala
kernel:
  branch: testing
roles:
- [mon.a, mon.b, mon.c, mds.a, osd.0, osd.1, osd.2, osd.3,]
- [client.0]
tasks:
- install:
    branch: cuttlefish
- ceph:
- rbd:
    all:
- workunit:
    clients:
      all:
        - kernel_untar_build.sh
targets:
  ubuntu@tala002.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD5mFt7raxufuhfx3dxQDJg5mzJ4N+n94rHC/pEqCFvXSp5Fly9cZZxdmn6N5vNUerXIt7/ui2AlVii/bSNjBJrXGYwi+IK+tRPpHb1e5OaS1FdNeHHIeIofeTmUVC7wzsit7sWCcN0I+FjlVqWjXs4qsjI56MbAMC+YVAepbhOUT/j8tFFLXgMN4xFKx10G4TqGWJqsMA1+WD4DLHWI8GrqccGTdokzaotSFHH3uMJIzXfTpCLts1n6yX2iogmK2ayFyD7TmMPRI9ZQ2E5yvkMsYrAOyyPp7h3RVGRRYWR47mmdrENfjuVKQcK30tBSO3tl13BXxWNl1+rfMOk9Cqz
  ubuntu@tala004.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4XQUem3ze9TfBfsJ3pL8kPm+Y98TEJDQ76rOcdjMR4Rs8mte1Q1B93hT0CI8uRjFcv9uiKaOlweiqKXSx6N/20dsPQ2LN54FtXLB346vsxDmZH0RRzg7KfHja/AilEW3pN3nlLlYkCN/9yWuId3g1sN1L6Shylyc96OL2b++O5fZhZnzbbaHSvyngU73GY/sfRWWA6bB6suXRe/QMbHA/ge/+EvcjJ74nZynenujAchjcVmY6xzpXsXYtSSpYcdgkVh+7P1H0KkfWJwH8aRvsni7TE/6Zp8AtaROelCW1v5vMaLAUjjFtz2nVy2KSViktX3jIpwHDXoFd3eJumXxT

Actions #24

Updated by Josh Durgin almost 11 years ago

The "scheduling while atomic" stuff that sometimes halts boot seems like it's not ceph-related. There's probably a driver problem or something wrong with our newer kernel's config since it doesn't boot reliably. Once it did boot, using wip-arm, I was able to map, do I/O, mkfs, and unmap an rbd image. We should figure out the kernel issue before trying to test more though, so we don't get false positives from it.

Actions #25

Updated by Anonymous almost 11 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Anonymous to Sage Weil

Using Josh's fixed kernel, I was, after several reboot attempts, able to run an rbd kernel test to completion. I used this kernel to repeat the test several times, and tried other rbd runs.

At this point. I believe that fixes for the ARM ceph issues (Josh'es changes in wip-arm)
and teuthology issues (my changes in wip-teutharm-wusui) have been implemented in the branches just mentioned. I will submit this for review, and open another item for the kernel problems with a more specific description of these bugs.

Actions #26

Updated by Sage Weil almost 11 years ago

  • Target version changed from v0.66 to v0.67rc
Actions #27

Updated by Sage Weil almost 11 years ago

  • Target version changed from v0.67rc to v0.67rc - continued
Actions #28

Updated by Anonymous almost 11 years ago

I have rebased this with the latest master version.

Actions #29

Updated by Sage Weil over 10 years ago

  • Target version changed from v0.67rc - continued to v0.68 - continued
Actions #30

Updated by Anonymous over 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF