Project

General

Profile

Actions

Bug #17415

closed

RHEV Network or Storage issues

Added by David Galloway over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

At about 00:27 UTC on 27SEP2016, the RHEV cluster either began having backend storage issues or overall network issues. So far all clues point to store01.front.sepia.ceph.com which is the main gluster backing node serving the RHEV VMs.

Symptoms

  • teuthology and pcp VMs were paused
  • RHEV logs show an "unknown storage error"
  • RHEV Manager VM was restarted (This VM's disk is served over NFS; not gluster)
  • Logs from different involved machines below
  • CPU Soft lockups in teuthology VM's kernel log indicating either network issues accessing LRC or its own disk (hosted on gluster)

NOTE: Times around 00:20 are UTC. Times around 20:20 are EDT.

===== hv01:/var/log/messages =====
Sep 26 20:27:00 hv01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: ERROR:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Connection closed: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Sleeping 60 seconds
Sep 26 20:27:07 hv01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Sep 26 20:27:07 hv01 journal: vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012  File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo#012    stats = instance.get_all_stats()#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats#012    self._configure_broker_conn(broker)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn#012    dom_type=dom_type)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain#012    .format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out

===== store01:/var/log/messages =====
Sep 27 00:27:01 old-gw pdns[2958]: Backend reported permanent error which prevented lookup (GSQLBackend lookup query:Failed to execute mysql_query, perhaps connection died? Err=1: Lost connection to MySQL server during query), aborting

===== store01:/var/log/glusterfs/bricks/srv-gluster-shardbrick1.log =====
[2016-09-26 03:27:08.778199] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2016-09-27 00:28:08.678750] W [socket.c:696:__socket_rwv] 0-tcp.shardvol1-server: writev on 172.21.0.5:49148 failed (Broken pipe)
[2016-09-27 00:28:15.872485] I [socket.c:2468:socket_event_handler] 0-transport: disconnecting now
[2016-09-27 00:28:17.803641] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-shardvol1-server: disconnecting connection from hv01.front.sepia.ceph.com-15267-2016/08/04-00:38:10:236127-shardvol1-client-0-0-0
[2016-09-27 00:28:30.945155] I [socket.c:3498:socket_submit_reply] 0-tcp.shardvol1-server: not connected (priv->connected = -1)
... Repeated ...
[2016-09-27 00:28:44.611277] E [rpcsvc.c:1310:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x255417e4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.shardvol1-server)
[2016-09-27 00:28:44.611286] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed
[2016-09-27 00:28:44.611331] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed

===== mgr01:/var/log/ovirt-engine/engine.log =====
2016-09-26 20:35:58,182 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4:iso_storage' in problem. vds: 'hv02'
2016-09-26 20:35:58,184 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700:shardvol1' in problem. vds: 'hv02'

2016-09-26 20:36:03,468 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-42) [4a281328] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: d6b1d38b-6f1f-4438-b04f-53619241a1ef Type: StoragePool
2016-09-26 20:36:03,480 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-43) [22db002d] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: d3ec2fae-de85-41d2-8cf4-b09c3fd77ca3 Type: StoragePool
2016-09-26 20:36:03,481 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [75431892] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 71a72320-daa6-4c1c-a9ee-c809a8f0ad9f Type: StoragePool
2016-09-26 20:36:03,510 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-42) [4a281328] Storage Pool 'd6b1d38b-6f1f-4438-b04f-53619241a1ef' - Updating Storage Domain '14b9a2fc-838c-46d4-8ba6-52f8d815f19e' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,511 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-43) [22db002d] Storage Pool 'd3ec2fae-de85-41d2-8cf4-b09c3fd77ca3' - Updating Storage Domain '1ecdcdf2-4301-4bba-a17c-270a85f11ade' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,523 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-44) [75431892] Storage Pool '71a72320-daa6-4c1c-a9ee-c809a8f0ad9f' - Updating Storage Domain 'b43d2eef-5cfa-4290-9e7b-f6ebc4e9c891' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,535 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-41) [7734731c] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool
2016-09-26 20:36:03,583 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,596 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,598 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,832 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-43) [22db002d] Correlation ID: 22db002d, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi08-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,853 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [75431892] Correlation ID: 75431892, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi02-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,871 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) [4a281328] Correlation ID: 4a281328, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi07-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,874 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [7734731c] Correlation ID: 7734731c, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting status to Non Responsive.

2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain.
2016-09-26 20:36:08,735 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02'
2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error.
2016-09-26 20:36:08,746 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused'
2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException
2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-09-26 20:36:08,825 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused.
2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.

2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with 
error. Exit message: Failed to find the libvirt domain.
2016-09-26 20:36:08,735 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02'
2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error.
2016-09-26 20:36:08,746 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused'
2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException
2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-09-26 20:36:08,825 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused.
2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.
Actions

Also available in: Atom PDF