Node stays UNHEALTHY cpu usage 95%

Asked by ITec

After starting ctdbd the node stays unhealthy and cpu usage rises periodicly up to 95%

Versions:
ubuntu 10.04.
ctdb 1.0.108-3ubuntu3

/etc/default/ctdb:
CTDB_RECOVERY_LOCK="/GFS5/.ctdb_recovery_lock"
CTDB_DEBUGLEVEL=ERR

/etc/ctdb/nodes:
192.168.224.221

/etc/ctdb/public_addresses:
192.168.6.245/32 eth1
192.168.1.245/32 eth2

ps faxl:
5 0 11068 1 -2 - 82804 1064 ep_pol Ss ? 2:50 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11069 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11070 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11071 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 11072 11068 20 0 82804 712 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

/var/log/log.ctdb:
2010/03/26 15:53:59.134494 [11068]: Starting CTDBD as pid : 11068
2010/03/26 15:54:00.144763 [11068]: Freeze priority 1
2010/03/26 15:54:00.144898 [11068]: Freeze priority 2
2010/03/26 15:54:00.144992 [11068]: Freeze priority 3
2010/03/26 15:54:04.163466 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:54:04.163668 [11072]: Take the recovery lock
2010/03/26 15:54:04.164430 [11072]: Recovery lock taken successfully
2010/03/26 15:54:04.164560 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:54:04.164699 [11068]: Freeze priority 1
2010/03/26 15:54:04.164807 [11068]: Freeze priority 2
2010/03/26 15:54:04.164900 [11068]: Freeze priority 3
2010/03/26 15:54:24.553645 [11072]: client/ctdb_client.c:771 control timed out. reqid:46 opcode:70 dstnode:0
2010/03/26 15:54:24.553889 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:54:24.553908 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:54:24.553939 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:54:24.553977 [11072]: Async wait failed - fail_count=1
2010/03/26 15:54:24.553992 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:54:24.554019 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:54:34.166958 [11068]: Event script timed out : startrecovery count : 0 pid : 11083
2010/03/26 15:54:34.167166 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11083
2010/03/26 15:54:34.167217 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:54:34.170867 [11083]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11083
2010/03/26 15:54:34.207295 [11083]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155434.11083
2010/03/26 15:54:34.208841 [11072]: Dropped orphaned reply control with reqid:46
2010/03/26 15:54:34.209332 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:54:34.209353 [11072]: Take the recovery lock
2010/03/26 15:54:34.210009 [11072]: Recovery lock taken successfully
2010/03/26 15:54:34.210113 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:54:34.210285 [11068]: Freeze priority 1
2010/03/26 15:54:34.210386 [11068]: Freeze priority 2
2010/03/26 15:54:34.210478 [11068]: Freeze priority 3
2010/03/26 15:54:54.554312 [11072]: client/ctdb_client.c:771 control timed out. reqid:67 opcode:70 dstnode:0
2010/03/26 15:54:54.554459 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:54:54.554477 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:54:54.554489 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:54:54.554535 [11072]: Async wait failed - fail_count=1
2010/03/26 15:54:54.554550 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:54:54.554562 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:55:04.211688 [11068]: Event script timed out : startrecovery count : 0 pid : 11099
2010/03/26 15:55:04.211833 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11099
2010/03/26 15:55:04.211861 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:55:04.215978 [11099]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11099
2010/03/26 15:55:04.251221 [11099]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155504.11099
2010/03/26 15:55:04.254282 [11072]: Dropped orphaned reply control with reqid:67
2010/03/26 15:55:04.254851 [11068]: Banning this node for 300 seconds
2010/03/26 15:55:04.254929 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:55:04.254983 [11072]: Take the recovery lock
2010/03/26 15:55:04.255687 [11072]: Recovery lock taken successfully
2010/03/26 15:55:04.255787 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:55:04.255921 [11068]: Freeze priority 1
2010/03/26 15:55:04.256019 [11068]: Freeze priority 2
2010/03/26 15:55:04.256111 [11068]: Freeze priority 3
2010/03/26 15:55:24.554764 [11072]: client/ctdb_client.c:771 control timed out. reqid:89 opcode:70 dstnode:0
2010/03/26 15:55:24.554898 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:55:24.554913 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:55:24.554930 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:55:24.555003 [11072]: Async wait failed - fail_count=1
2010/03/26 15:55:24.555019 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:55:24.555030 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:55:34.257258 [11068]: Event script timed out : startrecovery count : 0 pid : 11118
2010/03/26 15:55:34.257403 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11118
2010/03/26 15:55:34.257469 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:55:34.261315 [11118]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11118
2010/03/26 15:55:34.296027 [11118]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155534.11118
2010/03/26 15:55:34.297504 [11072]: Dropped orphaned reply control with reqid:89
2010/03/26 15:56:04.262626 [11068]: server/ctdb_recover.c:631 Been in recovery mode for too long. Dropping all IPS

After that cpu usage goes back to 0%
After a while cpu goes again on 95%

When cpu is on 95%:
ctdb status
Number of nodes:1
pnn:0 192.168.224.221 UNHEALTHY (THIS NODE)
Generation:INVALID
Size:1
hash:0 lmaster:0
Recovery mode:RECOVERY (1)
Recovery master:0

When cpu is on 0%:
ctdb status
Number of nodes:1
pnn:0 192.168.224.221 BANNED|UNHEALTHY (THIS NODE)
Generation:INVALID
Size:1
hash:0 lmaster:0
Recovery mode:RECOVERY (1)
Recovery master:0

/GFS5 is a GFS2 filesystem
RHCS is running and quorate

/GFS5/ping_pong/ping_pong /GFS5/ping_pong/test.dta 3
22850 locks/sec

What's wrong?

Question information

Language:
English Edit question
Status:
Expired
For:
Ubuntu ctdb Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
actionparsnip (andrew-woodhead666) said :
#1

Can you run:

top

and copy a snapshot of it for us to study.

Revision history for this message
ITec (itec) said :
#2

Hi!

Sorry for not posting directly in Launchpad. But the Web-Site has a problem and will not let me enter additional information.

So that's my answer:

Now I tried to add a second node, to see if it is a problem of the single node configuration. But it is not.
Here are the changes:

/etc/ctdb/nodes:
192.168.224.221
192.168.224.222

/GFS5 is a GFS2 filesystem mounted on both nodes
RHCS is running and quorate

onnode all /etc/init.d/ctdb start

>> NODE: 192.168.224.221 <<
root@192.168.224.221's password:
 * Starting Clustered TDB ctdb
[: 364: yes: unexpected operator
   ...done.

>> NODE: 192.168.224.222 <<
root@192.168.224.222's password:
 * Starting Clustered TDB ctdb
[: 364: yes: unexpected operator
   ...done.

top on 192.168.224.221:
top - 12:52:42 up 18 min, 3 users, load average: 4.44, 1.64, 1.18
Tasks: 150 total, 6 running, 144 sleeping, 0 stopped, 0 zombie
Cpu(s): 9.9%us, 12.0%sy, 0.0%ni, 76.7%id, 0.8%wa, 0.1%hi, 0.6%si, 0.0%st
Mem: 505492k total, 362484k used, 143008k free, 648k buffers
Swap: 498004k total, 0k used, 498004k free, 131364k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 4068 root -2 0 82804 896 672 R 96.5 0.2 0:14.60 ctdbd
 4205 root 20 0 19212 1304 928 R 2.0 0.3 0:00.03 top
    1 root 20 0 61868 2944 1908 S 0.0 0.6 0:00.87 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0.0 0.0 0:00.15 events/0
    7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.02 netns
   10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
   12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 sync_supers
   13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default
   14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
   15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
   16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpid
   17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
   18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_hotplug
   19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata/0
   20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata_aux
   21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksuspend_usbd
   22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khubd
   23 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kseriod
   24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmmcd
   26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
   28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
   29 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
   30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/0
   31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ecryptfs-kthrea
   32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/0
   35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
   36 root 20 0 0 0 0 S 0.0 0.0 0:00.02 scsi_eh_1
   37 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kstriped
   40 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
   41 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
   42 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksnapd
   43 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kondemand/0
   44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kconservative/0
  163 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt_poll_0
  164 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt/0
  165 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
  194 root 20 0 0 0 0 S 0.0 0.0 0:00.08 flush-8:0
  196 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfs_mru_cache
  197 root 20 0 0 0 0 S 0.0 0.0 0:00.03 xfslogd/0
  198 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsdatad/0
  199 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsconvertd/0
  201 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xfsbufd
  202 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsaild
  203 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfssyncd
  246 root 20 0 17028 1000 608 S 0.0 0.2 0:00.18 upstart-udev-br
  263 root 16 -4 17068 836 308 S 0.0 0.2 0:00.19 udevd

top on 192.168.224.222:
top - 12:53:57 up 13 min, 1 user, load average: 2.27, 2.60, 2.05
Tasks: 147 total, 7 running, 140 sleeping, 0 stopped, 0 zombie
Cpu(s): 19.7%us, 22.6%sy, 0.0%ni, 56.8%id, 0.5%wa, 0.2%hi, 0.3%si, 0.0%st
Mem: 505492k total, 355156k used, 150336k free, 640k buffers
Swap: 498004k total, 0k used, 498004k free, 128644k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3718 root -2 0 82804 848 672 R 96.5 0.2 0:14.42 ctdbd
 3824 root 20 0 19212 1288 928 R 1.0 0.3 0:00.02 top
    1 root 20 0 61864 2944 1908 S 0.0 0.6 0:01.04 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0.0 0.0 0:00.01 events/0
    7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
   10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
   12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 sync_supers
   13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default
   14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
   15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
   16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpid
   17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
   18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_hotplug
   19 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ata/0
   20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata_aux
   21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksuspend_usbd
   22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khubd
   23 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kseriod
   24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmmcd
   26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
   28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
   29 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
   30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/0
   31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ecryptfs-kthrea
   32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/0
   35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
   36 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
   37 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kstriped
   40 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
   41 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
   42 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksnapd
   43 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kondemand/0
   44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kconservative/0
  167 root 20 0 0 0 0 S 0.0 0.0 0:00.01 mpt_poll_0
  168 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt/0
  169 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
  198 root 20 0 0 0 0 S 0.0 0.0 0:00.00 flush-8:0
  203 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfs_mru_cache
  205 root 20 0 0 0 0 S 0.0 0.0 0:00.02 xfslogd/0
  206 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsdatad/0
  207 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsconvertd/0
  209 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xfsbufd
  210 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsaild
  211 root 20 0 0 0 0 D 0.0 0.0 0:00.01 xfssyncd
  253 root 20 0 17028 964 600 S 0.0 0.2 0:00.15 upstart-udev-br
  274 root 16 -4 17096 876 308 S 0.0 0.2 0:00.35 udevd

ps faxl on 192.168.224.221:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 -100 - 0 0 migrat S ? 0:00 \_ [migration/0]
1 0 4 2 20 0 0 0 ksofti S ? 0:00 \_ [ksoftirqd/0]
5 0 5 2 -100 - 0 0 watchd S ? 0:00 \_ [watchdog/0]
1 0 6 2 20 0 0 0 worker S ? 0:00 \_ [events/0]
1 0 7 2 20 0 0 0 worker S ? 0:00 \_ [cpuset]
1 0 8 2 20 0 0 0 worker S ? 0:00 \_ [khelper]
1 0 9 2 20 0 0 0 worker S ? 0:00 \_ [netns]
1 0 10 2 20 0 0 0 async_ S ? 0:00 \_ [async/mgr]
1 0 11 2 20 0 0 0 worker S ? 0:00 \_ [pm]
1 0 12 2 20 0 0 0 bdi_sy S ? 0:00 \_ [sync_supers]
1 0 13 2 20 0 0 0 bdi_fo S ? 0:00 \_ [bdi-default]
1 0 14 2 20 0 0 0 worker S ? 0:00 \_ [kintegrityd/0]
1 0 15 2 20 0 0 0 worker S ? 0:00 \_ [kblockd/0]
1 0 16 2 20 0 0 0 worker S ? 0:00 \_ [kacpid]
1 0 17 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_notify]
1 0 18 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_hotplug]
1 0 19 2 20 0 0 0 worker S ? 0:00 \_ [ata/0]
1 0 20 2 20 0 0 0 worker S ? 0:00 \_ [ata_aux]
1 0 21 2 20 0 0 0 worker S ? 0:00 \_ [ksuspend_usbd]
1 0 22 2 20 0 0 0 hub_th S ? 0:00 \_ [khubd]
5 0 23 2 20 0 0 0 serio_ S ? 0:00 \_ [kseriod]
1 0 24 2 20 0 0 0 worker S ? 0:00 \_ [kmmcd]
1 0 26 2 20 0 0 0 watchd S ? 0:00 \_ [khungtaskd]
1 0 28 2 20 0 0 0 kswapd S ? 0:00 \_ [kswapd0]
1 0 29 2 25 5 0 0 ksm_sc SN ? 0:00 \_ [ksmd]
1 0 30 2 20 0 0 0 worker S ? 0:00 \_ [aio/0]
1 0 31 2 20 0 0 0 ecrypt S ? 0:00 \_ [ecryptfs-kthrea]
1 0 32 2 20 0 0 0 worker S ? 0:00 \_ [crypto/0]
1 0 35 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_0]
1 0 36 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_1]
1 0 37 2 20 0 0 0 worker S ? 0:00 \_ [kstriped]
1 0 40 2 20 0 0 0 worker S ? 0:00 \_ [kmpathd/0]
1 0 41 2 20 0 0 0 worker S ? 0:00 \_ [kmpath_handlerd]
1 0 42 2 20 0 0 0 worker S ? 0:00 \_ [ksnapd]
1 0 43 2 20 0 0 0 worker S ? 0:00 \_ [kondemand/0]
1 0 44 2 20 0 0 0 worker S ? 0:00 \_ [kconservative/0]
1 0 163 2 20 0 0 0 worker S ? 0:00 \_ [mpt_poll_0]
1 0 164 2 20 0 0 0 worker S ? 0:00 \_ [mpt/0]
1 0 165 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_2]
1 0 194 2 20 0 0 0 bdi_wr S ? 0:00 \_ [flush-8:0]
1 0 196 2 20 0 0 0 worker S ? 0:00 \_ [xfs_mru_cache]
1 0 197 2 20 0 0 0 worker S ? 0:00 \_ [xfslogd/0]
1 0 198 2 20 0 0 0 worker S ? 0:00 \_ [xfsdatad/0]
1 0 199 2 20 0 0 0 worker S ? 0:00 \_ [xfsconvertd/0]
1 0 201 2 20 0 0 0 xfsbuf S ? 0:00 \_ [xfsbufd]
1 0 202 2 20 0 0 0 xfsail S ? 0:00 \_ [xfsaild]
1 0 203 2 20 0 0 0 xfssyn S ? 0:00 \_ [xfssyncd]
1 0 331 2 20 0 0 0 worker S ? 0:00 \_ [kpsmoused]
1 0 793 2 20 0 0 0 kjourn S ? 0:00 \_ [jbd2/sda1-8]
1 0 794 2 20 0 0 0 worker S ? 0:00 \_ [ext4-dio-unwrit]
1 0 816 2 20 0 0 0 worker S ? 0:00 \_ [iscsi_eh]
1 0 908 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_3]
1 0 910 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_3]
1 0 911 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_3]
1 0 981 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1000 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1002 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1004 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1011 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1014 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
5 0 1657 2 20 0 0 0 worker S ? 0:00 \_ [rpciod/0]
1 0 1666 2 20 0 0 0 svc_re S ? 0:00 \_ [lockd]
1 0 1667 2 20 0 0 0 worker S ? 0:00 \_ [nfsd4]
1 0 1668 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1669 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1670 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1671 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1672 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1673 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1674 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1675 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1918 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_6]
1 0 1919 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_6]
1 0 1920 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_6]
1 0 2606 2 20 0 0 0 dlm_as S ? 0:00 \_ [dlm_astd]
1 0 2607 2 20 0 0 0 dlm_sc S ? 0:00 \_ [dlm_scand]
1 0 2608 2 20 0 0 0 worker S ? 0:00 \_ [dlm_recv/0]
1 0 2609 2 20 0 0 0 worker S ? 0:00 \_ [dlm_send]
1 0 2610 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2649 2 20 0 0 0 worker S ? 0:00 \_ [glock_workqueue]
1 0 2650 2 20 0 0 0 worker S ? 0:00 \_ [delete_workqueu]
1 0 2652 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd000]
1 0 2812 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2818 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2824 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd002]
1 0 2826 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2827 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2828 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2829 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2837 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2845 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2852 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2853 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2854 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2855 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2870 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
4 0 1 0 20 0 61868 2944 poll_s Ss ? 0:00 /sbin/init
1 0 246 1 20 0 17028 1000 poll_s S ? 0:00 upstart-udev-bridge --daemon
5 0 263 1 16 -4 17068 836 poll_s S<s ? 0:00 udevd --daemon
5 0 2000 263 18 -2 17064 876 poll_s S< ? 0:00 \_ udevd --daemon
5 0 2651 263 18 -2 17064 744 poll_s S< ? 0:00 \_ udevd --daemon
1 0 840 1 20 0 4200 524 hrtime Ss ? 0:00 /sbin/iscsid
5 0 841 1 10 -10 14028 3820 poll_s S<Ls ? 0:00 /sbin/iscsid
4 0 1234 1 20 0 96576 4948 poll_s Ss ? 0:00 smbd -F
1 0 1583 1234 20 0 96576 1544 poll_s S ? 0:00 \_ smbd -F
5 1 1253 1 20 0 8248 560 poll_s Ss ? 0:00 portmap
5 101 1262 1 20 0 60808 1836 poll_s Sl ? 0:00 rsyslogd -c4
5 110 1321 1 20 0 10384 816 poll_s Ss ? 0:00 rpc.statd -L
5 0 1338 1 20 0 60664 1856 poll_s Ss ? 0:00 nmbd -D
0 0 1409 1 20 0 6072 636 n_tty_ Ss+ tty4 0:00 /sbin/getty -8 38400 tty4
0 0 1410 1 20 0 6072 636 n_tty_ Ss+ tty5 0:00 /sbin/getty -8 38400 tty5
4 0 1418 1 20 0 118064 4224 wait Ss tty2 0:00 /bin/login --
4 0 2286 1418 20 0 19484 2216 n_tty_ S+ tty2 0:00 \_ -bash
0 0 1420 1 20 0 6072 632 n_tty_ Ss+ tty3 0:00 /sbin/getty -8 38400 tty3
0 0 1422 1 20 0 6072 632 n_tty_ Ss+ tty6 0:00 /sbin/getty -8 38400 tty6
1 0 1437 1 20 0 21068 884 hrtime Ss ? 0:00 cron
5 106 1443 1 20 0 59652 1472 poll_s Ss ? 0:00 dbus-daemon --system --fork
5 0 1510 1 20 0 151728 4472 futex_ Sl ? 0:00 /usr/sbin/libvirtd -d
5 65534 1609 1 20 0 21420 916 poll_s S ? 0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid
1 111 1610 1 20 0 113520 6308 futex_ SLsl ? 0:00 /usr/sbin/slapd -h ldap://127.0.0.1/ ldaps://127.0.0.1/
5 0 1647 1 -100 - 19856 3492 futex_ SLl ? 0:00 /sbin/multipathd
1 0 1681 1 20 0 14776 448 poll_s Ss ? 0:00 /usr/sbin/rpc.mountd --manage-gids
4 0 1759 1 20 0 37192 2224 ep_pol Ss ? 0:00 /usr/lib/postfix/master
4 104 2245 1759 20 0 68784 3048 ep_pol S ? 0:00 \_ pickup -l -t fifo -u -c
4 104 2263 1759 20 0 68940 3092 ep_pol S ? 0:00 \_ qmgr -l -t fifo -u
5 1 1762 1 20 0 14656 728 poll_s Ss ? 0:00 /usr/sbin/slpd
4 0 1821 1 -100 - 160772 87060 - RLsl ? 0:10 corosync -f
4 0 1915 1 20 0 114444 5316 poll_s Ss ? 0:00 sshd: root@pts/0
4 0 2352 1915 20 0 19460 2248 wait Ss pts/0 0:00 \_ -bash
4 0 4128 2352 20 0 6952 1032 - R+ pts/0 0:00 \_ ps faxl
5 0 2008 1 20 0 49252 1092 poll_s Ss ? 0:00 /usr/sbin/sshd
4 0 2687 2008 20 0 114444 5308 poll_s Ss ? 0:00 \_ sshd: root@pts/1
4 0 2754 2687 20 0 19460 2244 n_tty_ Ss+ pts/1 0:00 \_ -bash
4 0 2063 1 20 0 54956 3572 poll_s Sl ? 0:00 /usr/sbin/console-kit-daemon --no-daemon
5 103 2222 1 20 0 21584 1384 poll_s Ss ? 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 103:108
5 0 2439 1 20 0 61328 1776 poll_s Ssl ? 0:00 fenced
5 0 2464 1 -100 - 102476 2032 poll_s Ssl ? 0:00 dlm_controld
5 0 2510 1 -100 - 79972 2000 poll_s Ssl ? 0:00 gfs_controld
5 0 2605 1 20 0 39744 1196 poll_s Ssl ? 0:00 /sbin/clvmd -T20
5 0 2866 1 19 -1 32440 5848 wait S<Ls ? 0:00 rgmanager
5 0 2867 2866 19 -1 66312 2620 poll_s S<l ? 0:00 \_ rgmanager
0 0 2893 1 20 0 6072 636 n_tty_ Ss+ tty1 0:00 /sbin/getty -8 38400 tty1
5 0 4068 1 -2 - 82804 900 - Rs ? 1:22 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4069 4068 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4070 4068 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4071 4068 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 4072 4068 20 0 82804 636 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4122 4068 -2 - 82804 432 wait S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4123 4122 -2 - 82804 220 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

ps faxl on 192.168.224.222:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 -100 - 0 0 migrat S ? 0:00 \_ [migration/0]
1 0 4 2 20 0 0 0 ksofti S ? 0:00 \_ [ksoftirqd/0]
5 0 5 2 -100 - 0 0 watchd S ? 0:00 \_ [watchdog/0]
1 0 6 2 20 0 0 0 worker S ? 0:00 \_ [events/0]
1 0 7 2 20 0 0 0 worker S ? 0:00 \_ [cpuset]
1 0 8 2 20 0 0 0 worker S ? 0:00 \_ [khelper]
1 0 9 2 20 0 0 0 worker S ? 0:00 \_ [netns]
1 0 10 2 20 0 0 0 async_ S ? 0:00 \_ [async/mgr]
1 0 11 2 20 0 0 0 worker S ? 0:00 \_ [pm]
1 0 12 2 20 0 0 0 bdi_sy S ? 0:00 \_ [sync_supers]
1 0 13 2 20 0 0 0 bdi_fo S ? 0:00 \_ [bdi-default]
1 0 14 2 20 0 0 0 worker S ? 0:00 \_ [kintegrityd/0]
1 0 15 2 20 0 0 0 worker S ? 0:00 \_ [kblockd/0]
1 0 16 2 20 0 0 0 worker S ? 0:00 \_ [kacpid]
1 0 17 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_notify]
1 0 18 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_hotplug]
1 0 19 2 20 0 0 0 worker S ? 0:00 \_ [ata/0]
1 0 20 2 20 0 0 0 worker S ? 0:00 \_ [ata_aux]
1 0 21 2 20 0 0 0 worker S ? 0:00 \_ [ksuspend_usbd]
1 0 22 2 20 0 0 0 hub_th S ? 0:00 \_ [khubd]
5 0 23 2 20 0 0 0 serio_ S ? 0:00 \_ [kseriod]
1 0 24 2 20 0 0 0 worker S ? 0:00 \_ [kmmcd]
1 0 26 2 20 0 0 0 watchd S ? 0:00 \_ [khungtaskd]
1 0 28 2 20 0 0 0 kswapd S ? 0:00 \_ [kswapd0]
1 0 29 2 25 5 0 0 ksm_sc SN ? 0:00 \_ [ksmd]
1 0 30 2 20 0 0 0 worker S ? 0:00 \_ [aio/0]
1 0 31 2 20 0 0 0 ecrypt S ? 0:00 \_ [ecryptfs-kthrea]
1 0 32 2 20 0 0 0 worker S ? 0:00 \_ [crypto/0]
1 0 35 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_0]
1 0 36 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_1]
1 0 37 2 20 0 0 0 worker S ? 0:00 \_ [kstriped]
1 0 40 2 20 0 0 0 worker S ? 0:00 \_ [kmpathd/0]
1 0 41 2 20 0 0 0 worker S ? 0:00 \_ [kmpath_handlerd]
1 0 42 2 20 0 0 0 worker S ? 0:00 \_ [ksnapd]
1 0 43 2 20 0 0 0 worker S ? 0:00 \_ [kondemand/0]
1 0 44 2 20 0 0 0 worker S ? 0:00 \_ [kconservative/0]
1 0 167 2 20 0 0 0 worker S ? 0:00 \_ [mpt_poll_0]
1 0 168 2 20 0 0 0 worker S ? 0:00 \_ [mpt/0]
1 0 169 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_2]
1 0 198 2 20 0 0 0 bdi_wr S ? 0:00 \_ [flush-8:0]
1 0 203 2 20 0 0 0 worker S ? 0:00 \_ [xfs_mru_cache]
1 0 205 2 20 0 0 0 worker S ? 0:00 \_ [xfslogd/0]
1 0 206 2 20 0 0 0 worker S ? 0:00 \_ [xfsdatad/0]
1 0 207 2 20 0 0 0 worker S ? 0:00 \_ [xfsconvertd/0]
1 0 209 2 20 0 0 0 xfsbuf S ? 0:00 \_ [xfsbufd]
1 0 210 2 20 0 0 0 xfsail S ? 0:00 \_ [xfsaild]
1 0 211 2 20 0 0 0 xfssyn S ? 0:00 \_ [xfssyncd]
1 0 327 2 20 0 0 0 worker S ? 0:00 \_ [kpsmoused]
1 0 781 2 20 0 0 0 kjourn S ? 0:00 \_ [jbd2/sda1-8]
1 0 784 2 20 0 0 0 worker S ? 0:00 \_ [ext4-dio-unwrit]
1 0 828 2 20 0 0 0 worker S ? 0:00 \_ [iscsi_eh]
1 0 909 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_3]
1 0 910 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_3]
1 0 911 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_3]
1 0 992 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1020 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1022 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1024 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1026 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1030 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
5 0 1609 2 20 0 0 0 worker S ? 0:00 \_ [rpciod/0]
1 0 1618 2 20 0 0 0 svc_re S ? 0:00 \_ [lockd]
1 0 1619 2 20 0 0 0 worker S ? 0:00 \_ [nfsd4]
1 0 1620 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1621 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1622 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1623 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1624 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1625 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1626 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1627 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1853 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_6]
1 0 1854 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_6]
1 0 1855 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_6]
1 0 2496 2 20 0 0 0 dlm_as S ? 0:00 \_ [dlm_astd]
1 0 2497 2 20 0 0 0 dlm_sc S ? 0:00 \_ [dlm_scand]
1 0 2498 2 20 0 0 0 worker S ? 0:00 \_ [dlm_recv/0]
1 0 2499 2 20 0 0 0 worker S ? 0:00 \_ [dlm_send]
1 0 2500 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2524 2 20 0 0 0 worker S ? 0:00 \_ [glock_workqueue]
1 0 2525 2 20 0 0 0 worker S ? 0:00 \_ [delete_workqueu]
1 0 2527 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd000]
1 0 2528 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd001]
1 0 2529 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2535 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2536 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2541 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2553 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2559 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2560 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2561 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2562 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2567 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2579 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2580 2 20 0 0 0 gfs2_q D ? 0:00 \_ [gfs2_quotad]
1 0 2592 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
4 0 1 0 20 0 61864 2944 poll_s Ss ? 0:01 /sbin/init
1 0 253 1 20 0 17028 964 poll_s S ? 0:00 upstart-udev-bridge --daemon
5 0 274 1 16 -4 17096 876 poll_s S<s ? 0:00 udevd --daemon
5 0 1900 274 18 -2 17092 912 poll_s S< ? 0:00 \_ udevd --daemon
5 0 2526 274 18 -2 17092 784 poll_s S< ? 0:00 \_ udevd --daemon
1 0 848 1 20 0 4200 520 hrtime Ss ? 0:00 /sbin/iscsid
5 0 849 1 10 -10 14028 3820 poll_s S<Ls ? 0:00 /sbin/iscsid
4 0 1251 1 20 0 96596 4952 poll_s Ss ? 0:00 smbd -F
1 0 1540 1251 20 0 96596 1548 poll_s S ? 0:00 \_ smbd -F
5 1 1252 1 20 0 8248 556 poll_s Ss ? 0:00 portmap
5 101 1259 1 20 0 60676 1824 poll_s Sl ? 0:00 rsyslogd -c4
5 110 1329 1 20 0 10384 816 poll_s Ss ? 0:00 rpc.statd -L
5 0 1345 1 20 0 60660 1816 poll_s Ss ? 0:00 nmbd -D
0 0 1400 1 20 0 6072 632 n_tty_ Ss+ tty4 0:00 /sbin/getty -8 38400 tty4
0 0 1406 1 20 0 6072 632 n_tty_ Ss+ tty5 0:00 /sbin/getty -8 38400 tty5
0 0 1410 1 20 0 6072 632 n_tty_ Ss+ tty2 0:00 /sbin/getty -8 38400 tty2
0 0 1412 1 20 0 6072 636 n_tty_ Ss+ tty3 0:00 /sbin/getty -8 38400 tty3
0 0 1418 1 20 0 6072 632 n_tty_ Ss+ tty6 0:00 /sbin/getty -8 38400 tty6
5 106 1422 1 20 0 59652 1460 poll_s Ss ? 0:00 dbus-daemon --system --fork
1 0 1431 1 20 0 21068 884 hrtime Ss ? 0:00 cron
5 0 1480 1 20 0 151728 4472 futex_ Sl ? 0:00 /usr/sbin/libvirtd -d
5 65534 1561 1 20 0 21420 912 poll_s S ? 0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid
1 111 1562 1 20 0 113520 6304 futex_ SLsl ? 0:00 /usr/sbin/slapd -h ldap://127.0.0.1/ ldaps://127.0.0.1/
5 0 1599 1 -100 - 19856 3492 futex_ SLl ? 0:00 /sbin/multipathd
1 0 1633 1 20 0 14776 448 poll_s Ss ? 0:00 /usr/sbin/rpc.mountd --manage-gids
4 0 1711 1 20 0 37192 2224 ep_pol Ss ? 0:00 /usr/lib/postfix/master
4 104 2175 1711 20 0 68784 3048 ep_pol S ? 0:00 \_ pickup -l -t fifo -u -c
4 104 2176 1711 20 0 68940 3096 ep_pol S ? 0:00 \_ qmgr -l -t fifo -u
5 1 1716 1 20 0 14656 732 poll_s Ss ? 0:00 /usr/sbin/slpd
4 0 1769 1 -100 - 160964 87256 poll_s SLsl ? 0:05 corosync -f
4 0 1785 1 20 0 114464 5320 poll_s Ss ? 0:00 sshd: root@pts/0
4 0 2274 1785 20 0 19460 2248 wait Ss pts/0 0:00 \_ -bash
4 0 3755 2274 20 0 6952 1036 - R+ pts/0 0:00 \_ ps faxl
4 0 1904 1 20 0 54956 3552 poll_s Sl ? 0:00 /usr/sbin/console-kit-daemon --no-daemon
5 0 2150 1 20 0 49252 1096 poll_s Ss ? 0:00 /usr/sbin/sshd
5 103 2297 1 20 0 21584 1384 poll_s Ss ? 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 103:108
5 0 2356 1 20 0 61324 1708 poll_s Ssl ? 0:00 fenced
5 0 2381 1 -100 - 102476 2016 poll_s Ssl ? 0:00 dlm_controld
5 0 2424 1 -100 - 79972 1988 poll_s Ssl ? 0:00 gfs_controld
5 0 2495 1 20 0 39744 1192 poll_s Ssl ? 0:00 /sbin/clvmd -T20
5 0 2589 1 19 -1 32440 5848 wait S<Ls ? 0:00 rgmanager
5 0 2590 2589 19 -1 66312 2612 poll_s S<l ? 0:00 \_ rgmanager
0 0 2647 1 20 0 6072 632 n_tty_ Ss+ tty1 0:00 /sbin/getty -8 38400 tty1
5 0 3718 1 -2 - 82804 872 - Rs ? 0:45 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3719 3718 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3720 3718 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3721 3718 -2 - 82804 236 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 3722 3718 20 0 82804 524 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3733 3718 -2 - 82804 548 wait S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
0 0 3734 3733 -2 - 0 0 exit Z ? 0:00 | \_ [sh] <defunct>
0 0 3745 3733 -2 - 4088 592 - R ? 0:00 | \_ sh -c { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/t
0 0 3750 3745 -2 - 0 0 exit Z ? 0:00 | \_ [ls] <defunct>
1 0 3749 3718 -2 - 82804 416 wait S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3751 3749 -2 - 82804 200 - R ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

ctdb status on 192.168.224.221
Number of nodes:2
pnn:0 192.168.224.221 UNHEALTHY (THIS NODE)
pnn:1 192.168.224.222 UNHEALTHY
Generation:INVALID
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:RECOVERY (1)
Recovery master:0

ctdb status on 192.168.224.222
Number of nodes:2
pnn:0 192.168.224.221 UNHEALTHY
pnn:1 192.168.224.222 UNHEALTHY (THIS NODE)
Generation:INVALID
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:RECOVERY (1)
Recovery master:0

After a while:

top on 192.168.224.221:
top - 13:14:04 up 39 min, 3 users, load average: 1.09, 2.73, 2.48
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 505492k total, 363124k used, 142368k free, 648k buffers
Swap: 498004k total, 0k used, 498004k free, 131604k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 4068 root -2 0 82804 1128 676 S 0.3 0.2 4:44.79 ctdbd
 4264 root 20 0 19216 1408 1032 R 0.3 0.3 0:00.04 top
    1 root 20 0 61868 2944 1908 S 0.0 0.6 0:00.87 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0.0 0.0 0:00.17 events/0
    7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.02 netns
   10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
   12 root 20 0 0 0 0 S 0.0 0.0 0:00.01 sync_supers
   13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default
   14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
   15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
   16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpid
   17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
   18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_hotplug
   19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata/0
   20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata_aux
   21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksuspend_usbd
   22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khubd
   23 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kseriod
   24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmmcd
   26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
   28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
   29 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
   30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/0
   31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ecryptfs-kthrea
   32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/0
   35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
   36 root 20 0 0 0 0 S 0.0 0.0 0:00.02 scsi_eh_1
   37 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kstriped
   40 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
   41 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
   42 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksnapd
   43 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kondemand/0
   44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kconservative/0
  163 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt_poll_0
  164 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt/0
  165 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
  194 root 20 0 0 0 0 S 0.0 0.0 0:00.10 flush-8:0
  196 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfs_mru_cache
  197 root 20 0 0 0 0 S 0.0 0.0 0:00.04 xfslogd/0
  198 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsdatad/0
  199 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsconvertd/0
  201 root 20 0 0 0 0 S 0.0 0.0 0:00.02 xfsbufd
  202 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsaild
  203 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xfssyncd
  246 root 20 0 17028 1000 608 S 0.0 0.2 0:00.18 upstart-udev-br
  263 root 16 -4 17068 836 308 S 0.0 0.2 0:00.19 udevd

top on 192.168.224.222:
top - 13:20:09 up 39 min, 1 user, load average: 3.75, 4.04, 3.28
Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 505492k total, 356640k used, 148852k free, 640k buffers
Swap: 498004k total, 0k used, 498004k free, 128924k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    1 root 20 0 61864 2944 1908 S 0.0 0.6 0:01.04 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0.0 0.0 0:00.02 events/0
    7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
   10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
   12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 sync_supers
   13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default
   14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
   15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
   16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpid
   17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
   18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kacpi_hotplug
   19 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ata/0
   20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata_aux
   21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksuspend_usbd
   22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khubd
   23 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kseriod
   24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmmcd
   26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
   28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
   29 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
   30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 aio/0
   31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ecryptfs-kthrea
   32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 crypto/0
   35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
   36 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
   37 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kstriped
   40 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
   41 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
   42 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksnapd
   43 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kondemand/0
   44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kconservative/0
  167 root 20 0 0 0 0 S 0.0 0.0 0:00.01 mpt_poll_0
  168 root 20 0 0 0 0 S 0.0 0.0 0:00.00 mpt/0
  169 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
  198 root 20 0 0 0 0 S 0.0 0.0 0:00.02 flush-8:0
  203 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfs_mru_cache
  205 root 20 0 0 0 0 S 0.0 0.0 0:00.02 xfslogd/0
  206 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsdatad/0
  207 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsconvertd/0
  209 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xfsbufd
  210 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xfsaild
  211 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xfssyncd
  253 root 20 0 17028 964 600 S 0.0 0.2 0:00.15 upstart-udev-br
  274 root 16 -4 17096 876 308 S 0.0 0.2 0:00.35 udevd
  327 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
  781 root 20 0 0 0 0 S 0.0 0.0 0:00.00 jbd2/sda1-8

ps faxl on 192.168.224.221:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 -100 - 0 0 migrat S ? 0:00 \_ [migration/0]
1 0 4 2 20 0 0 0 ksofti S ? 0:00 \_ [ksoftirqd/0]
5 0 5 2 -100 - 0 0 watchd S ? 0:00 \_ [watchdog/0]
1 0 6 2 20 0 0 0 worker S ? 0:00 \_ [events/0]
1 0 7 2 20 0 0 0 worker S ? 0:00 \_ [cpuset]
1 0 8 2 20 0 0 0 worker S ? 0:00 \_ [khelper]
1 0 9 2 20 0 0 0 worker S ? 0:00 \_ [netns]
1 0 10 2 20 0 0 0 async_ S ? 0:00 \_ [async/mgr]
1 0 11 2 20 0 0 0 worker S ? 0:00 \_ [pm]
1 0 12 2 20 0 0 0 bdi_sy S ? 0:00 \_ [sync_supers]
1 0 13 2 20 0 0 0 bdi_fo S ? 0:00 \_ [bdi-default]
1 0 14 2 20 0 0 0 worker S ? 0:00 \_ [kintegrityd/0]
1 0 15 2 20 0 0 0 worker S ? 0:00 \_ [kblockd/0]
1 0 16 2 20 0 0 0 worker S ? 0:00 \_ [kacpid]
1 0 17 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_notify]
1 0 18 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_hotplug]
1 0 19 2 20 0 0 0 worker S ? 0:00 \_ [ata/0]
1 0 20 2 20 0 0 0 worker S ? 0:00 \_ [ata_aux]
1 0 21 2 20 0 0 0 worker S ? 0:00 \_ [ksuspend_usbd]
1 0 22 2 20 0 0 0 hub_th S ? 0:00 \_ [khubd]
5 0 23 2 20 0 0 0 serio_ S ? 0:00 \_ [kseriod]
1 0 24 2 20 0 0 0 worker S ? 0:00 \_ [kmmcd]
1 0 26 2 20 0 0 0 watchd S ? 0:00 \_ [khungtaskd]
1 0 28 2 20 0 0 0 kswapd S ? 0:00 \_ [kswapd0]
1 0 29 2 25 5 0 0 ksm_sc SN ? 0:00 \_ [ksmd]
1 0 30 2 20 0 0 0 worker S ? 0:00 \_ [aio/0]
1 0 31 2 20 0 0 0 ecrypt S ? 0:00 \_ [ecryptfs-kthrea]
1 0 32 2 20 0 0 0 worker S ? 0:00 \_ [crypto/0]
1 0 35 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_0]
1 0 36 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_1]
1 0 37 2 20 0 0 0 worker S ? 0:00 \_ [kstriped]
1 0 40 2 20 0 0 0 worker S ? 0:00 \_ [kmpathd/0]
1 0 41 2 20 0 0 0 worker S ? 0:00 \_ [kmpath_handlerd]
1 0 42 2 20 0 0 0 worker S ? 0:00 \_ [ksnapd]
1 0 43 2 20 0 0 0 worker S ? 0:00 \_ [kondemand/0]
1 0 44 2 20 0 0 0 worker S ? 0:00 \_ [kconservative/0]
1 0 163 2 20 0 0 0 worker S ? 0:00 \_ [mpt_poll_0]
1 0 164 2 20 0 0 0 worker S ? 0:00 \_ [mpt/0]
1 0 165 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_2]
1 0 194 2 20 0 0 0 bdi_wr S ? 0:00 \_ [flush-8:0]
1 0 196 2 20 0 0 0 worker S ? 0:00 \_ [xfs_mru_cache]
1 0 197 2 20 0 0 0 worker S ? 0:00 \_ [xfslogd/0]
1 0 198 2 20 0 0 0 worker S ? 0:00 \_ [xfsdatad/0]
1 0 199 2 20 0 0 0 worker S ? 0:00 \_ [xfsconvertd/0]
1 0 201 2 20 0 0 0 xfsbuf S ? 0:00 \_ [xfsbufd]
1 0 202 2 20 0 0 0 xfsail S ? 0:00 \_ [xfsaild]
1 0 203 2 20 0 0 0 xfssyn S ? 0:00 \_ [xfssyncd]
1 0 331 2 20 0 0 0 worker S ? 0:00 \_ [kpsmoused]
1 0 793 2 20 0 0 0 kjourn S ? 0:00 \_ [jbd2/sda1-8]
1 0 794 2 20 0 0 0 worker S ? 0:00 \_ [ext4-dio-unwrit]
1 0 816 2 20 0 0 0 worker S ? 0:00 \_ [iscsi_eh]
1 0 908 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_3]
1 0 910 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_3]
1 0 911 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_3]
1 0 981 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1000 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1002 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1004 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1011 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1014 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
5 0 1657 2 20 0 0 0 worker S ? 0:00 \_ [rpciod/0]
1 0 1666 2 20 0 0 0 svc_re S ? 0:00 \_ [lockd]
1 0 1667 2 20 0 0 0 worker S ? 0:00 \_ [nfsd4]
1 0 1668 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1669 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1670 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1671 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1672 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1673 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1674 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1675 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1918 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_6]
1 0 1919 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_6]
1 0 1920 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_6]
1 0 2606 2 20 0 0 0 dlm_as S ? 0:00 \_ [dlm_astd]
1 0 2607 2 20 0 0 0 dlm_sc S ? 0:00 \_ [dlm_scand]
1 0 2608 2 20 0 0 0 worker S ? 0:00 \_ [dlm_recv/0]
1 0 2609 2 20 0 0 0 worker S ? 0:00 \_ [dlm_send]
1 0 2610 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2649 2 20 0 0 0 worker S ? 0:00 \_ [glock_workqueue]
1 0 2650 2 20 0 0 0 worker S ? 0:00 \_ [delete_workqueu]
1 0 2652 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd000]
1 0 2812 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2818 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2824 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd002]
1 0 2826 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2827 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2828 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2829 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2837 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2845 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2852 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2853 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2854 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2855 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2870 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
4 0 1 0 20 0 61868 2944 poll_s Ss ? 0:00 /sbin/init
1 0 246 1 20 0 17028 1000 poll_s S ? 0:00 upstart-udev-bridge --daemon
5 0 263 1 16 -4 17068 836 poll_s S<s ? 0:00 udevd --daemon
5 0 2000 263 18 -2 17064 876 poll_s S< ? 0:00 \_ udevd --daemon
5 0 2651 263 18 -2 17064 744 poll_s S< ? 0:00 \_ udevd --daemon
1 0 840 1 20 0 4200 524 hrtime Ss ? 0:00 /sbin/iscsid
5 0 841 1 10 -10 14028 3820 poll_s S<Ls ? 0:00 /sbin/iscsid
4 0 1234 1 20 0 96576 4948 poll_s Ss ? 0:00 smbd -F
1 0 1583 1234 20 0 96576 1544 poll_s S ? 0:00 \_ smbd -F
5 1 1253 1 20 0 8248 560 poll_s Ss ? 0:00 portmap
5 101 1262 1 20 0 60808 1840 poll_s Sl ? 0:00 rsyslogd -c4
5 110 1321 1 20 0 10384 816 poll_s Ss ? 0:00 rpc.statd -L
5 0 1338 1 20 0 60664 1856 poll_s Ss ? 0:00 nmbd -D
0 0 1409 1 20 0 6072 636 n_tty_ Ss+ tty4 0:00 /sbin/getty -8 38400 tty4
0 0 1410 1 20 0 6072 636 n_tty_ Ss+ tty5 0:00 /sbin/getty -8 38400 tty5
4 0 1418 1 20 0 118064 4224 wait Ss tty2 0:00 /bin/login --
4 0 2286 1418 20 0 19484 2216 n_tty_ S+ tty2 0:00 \_ -bash
0 0 1420 1 20 0 6072 632 n_tty_ Ss+ tty3 0:00 /sbin/getty -8 38400 tty3
0 0 1422 1 20 0 6072 632 n_tty_ Ss+ tty6 0:00 /sbin/getty -8 38400 tty6
1 0 1437 1 20 0 21068 888 hrtime Ss ? 0:00 cron
5 106 1443 1 20 0 59652 1472 poll_s Ss ? 0:00 dbus-daemon --system --fork
5 0 1510 1 20 0 151728 4472 futex_ Sl ? 0:00 /usr/sbin/libvirtd -d
5 65534 1609 1 20 0 21420 916 poll_s S ? 0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid
1 111 1610 1 20 0 113520 6308 futex_ SLsl ? 0:00 /usr/sbin/slapd -h ldap://127.0.0.1/ ldaps://127.0.0.1/
5 0 1647 1 -100 - 19856 3492 futex_ SLl ? 0:00 /sbin/multipathd
1 0 1681 1 20 0 14776 448 poll_s Ss ? 0:00 /usr/sbin/rpc.mountd --manage-gids
4 0 1759 1 20 0 37192 2224 ep_pol Ss ? 0:00 /usr/lib/postfix/master
4 104 2245 1759 20 0 68784 3048 ep_pol S ? 0:00 \_ pickup -l -t fifo -u -c
4 104 2263 1759 20 0 68940 3092 ep_pol S ? 0:00 \_ qmgr -l -t fifo -u
5 1 1762 1 20 0 14656 728 poll_s Ss ? 0:00 /usr/sbin/slpd
4 0 1821 1 -100 - 160772 87060 poll_s SLsl ? 0:16 corosync -f
4 0 1915 1 20 0 114444 5316 poll_s Ss ? 0:00 sshd: root@pts/0
4 0 2352 1915 20 0 19460 2248 wait Ss pts/0 0:00 \_ -bash
4 0 4377 2352 20 0 6952 1032 - R+ pts/0 0:00 \_ ps faxl
5 0 2008 1 20 0 49252 1092 poll_s Ss ? 0:00 /usr/sbin/sshd
4 0 2687 2008 20 0 114444 5308 poll_s Ss ? 0:00 \_ sshd: root@pts/1
4 0 2754 2687 20 0 19460 2244 n_tty_ Ss+ pts/1 0:00 \_ -bash
4 0 2063 1 20 0 54956 3572 poll_s Sl ? 0:00 /usr/sbin/console-kit-daemon --no-daemon
5 103 2222 1 20 0 21584 1384 poll_s Ss ? 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 103:108
5 0 2439 1 20 0 61328 1776 poll_s Ssl ? 0:00 fenced
5 0 2464 1 -100 - 102476 2040 poll_s Ssl ? 0:00 dlm_controld
5 0 2510 1 -100 - 79972 2000 poll_s Ssl ? 0:00 gfs_controld
5 0 2605 1 20 0 39744 1196 poll_s Ssl ? 0:00 /sbin/clvmd -T20
5 0 2866 1 19 -1 32440 5848 wait S<Ls ? 0:00 rgmanager
5 0 2867 2866 19 -1 66312 2620 poll_s S<l ? 0:00 \_ rgmanager
0 0 2893 1 20 0 6072 636 n_tty_ Ss+ tty1 0:00 /sbin/getty -8 38400 tty1
5 0 4068 1 -2 - 82804 1344 ep_pol Ss ? 7:06 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4069 4068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4070 4068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 4071 4068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 4072 4068 20 0 82804 780 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

ps faxl on 192.168.224.222:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 -100 - 0 0 migrat S ? 0:00 \_ [migration/0]
1 0 4 2 20 0 0 0 ksofti S ? 0:00 \_ [ksoftirqd/0]
5 0 5 2 -100 - 0 0 watchd S ? 0:00 \_ [watchdog/0]
1 0 6 2 20 0 0 0 worker S ? 0:00 \_ [events/0]
1 0 7 2 20 0 0 0 worker S ? 0:00 \_ [cpuset]
1 0 8 2 20 0 0 0 worker S ? 0:00 \_ [khelper]
1 0 9 2 20 0 0 0 worker S ? 0:00 \_ [netns]
1 0 10 2 20 0 0 0 async_ S ? 0:00 \_ [async/mgr]
1 0 11 2 20 0 0 0 worker S ? 0:00 \_ [pm]
1 0 12 2 20 0 0 0 bdi_sy S ? 0:00 \_ [sync_supers]
1 0 13 2 20 0 0 0 bdi_fo S ? 0:00 \_ [bdi-default]
1 0 14 2 20 0 0 0 worker S ? 0:00 \_ [kintegrityd/0]
1 0 15 2 20 0 0 0 worker S ? 0:00 \_ [kblockd/0]
1 0 16 2 20 0 0 0 worker S ? 0:00 \_ [kacpid]
1 0 17 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_notify]
1 0 18 2 20 0 0 0 worker S ? 0:00 \_ [kacpi_hotplug]
1 0 19 2 20 0 0 0 worker S ? 0:00 \_ [ata/0]
1 0 20 2 20 0 0 0 worker S ? 0:00 \_ [ata_aux]
1 0 21 2 20 0 0 0 worker S ? 0:00 \_ [ksuspend_usbd]
1 0 22 2 20 0 0 0 hub_th S ? 0:00 \_ [khubd]
5 0 23 2 20 0 0 0 serio_ S ? 0:00 \_ [kseriod]
1 0 24 2 20 0 0 0 worker S ? 0:00 \_ [kmmcd]
1 0 26 2 20 0 0 0 watchd S ? 0:00 \_ [khungtaskd]
1 0 28 2 20 0 0 0 kswapd S ? 0:00 \_ [kswapd0]
1 0 29 2 25 5 0 0 ksm_sc SN ? 0:00 \_ [ksmd]
1 0 30 2 20 0 0 0 worker S ? 0:00 \_ [aio/0]
1 0 31 2 20 0 0 0 ecrypt S ? 0:00 \_ [ecryptfs-kthrea]
1 0 32 2 20 0 0 0 worker S ? 0:00 \_ [crypto/0]
1 0 35 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_0]
1 0 36 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_1]
1 0 37 2 20 0 0 0 worker S ? 0:00 \_ [kstriped]
1 0 40 2 20 0 0 0 worker S ? 0:00 \_ [kmpathd/0]
1 0 41 2 20 0 0 0 worker S ? 0:00 \_ [kmpath_handlerd]
1 0 42 2 20 0 0 0 worker S ? 0:00 \_ [ksnapd]
1 0 43 2 20 0 0 0 worker S ? 0:00 \_ [kondemand/0]
1 0 44 2 20 0 0 0 worker S ? 0:00 \_ [kconservative/0]
1 0 167 2 20 0 0 0 worker S ? 0:00 \_ [mpt_poll_0]
1 0 168 2 20 0 0 0 worker S ? 0:00 \_ [mpt/0]
1 0 169 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_2]
1 0 198 2 20 0 0 0 bdi_wr S ? 0:00 \_ [flush-8:0]
1 0 203 2 20 0 0 0 worker S ? 0:00 \_ [xfs_mru_cache]
1 0 205 2 20 0 0 0 worker S ? 0:00 \_ [xfslogd/0]
1 0 206 2 20 0 0 0 worker S ? 0:00 \_ [xfsdatad/0]
1 0 207 2 20 0 0 0 worker S ? 0:00 \_ [xfsconvertd/0]
1 0 209 2 20 0 0 0 xfsbuf S ? 0:00 \_ [xfsbufd]
1 0 210 2 20 0 0 0 xfsail S ? 0:00 \_ [xfsaild]
1 0 211 2 20 0 0 0 xfssyn S ? 0:00 \_ [xfssyncd]
1 0 327 2 20 0 0 0 worker S ? 0:00 \_ [kpsmoused]
1 0 781 2 20 0 0 0 kjourn S ? 0:00 \_ [jbd2/sda1-8]
1 0 784 2 20 0 0 0 worker S ? 0:00 \_ [ext4-dio-unwrit]
1 0 828 2 20 0 0 0 worker S ? 0:00 \_ [iscsi_eh]
1 0 909 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_3]
1 0 910 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_3]
1 0 911 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_3]
1 0 992 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1020 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1022 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1024 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1026 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
1 0 1030 2 20 0 0 0 worker S ? 0:00 \_ [kdmflush]
5 0 1609 2 20 0 0 0 worker S ? 0:00 \_ [rpciod/0]
1 0 1618 2 20 0 0 0 svc_re S ? 0:00 \_ [lockd]
1 0 1619 2 20 0 0 0 worker S ? 0:00 \_ [nfsd4]
1 0 1620 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1621 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1622 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1623 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1624 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1625 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1626 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1627 2 20 0 0 0 svc_re S ? 0:00 \_ [nfsd]
1 0 1853 2 20 0 0 0 scsi_e S ? 0:00 \_ [scsi_eh_6]
1 0 1854 2 0 -20 0 0 worker S< ? 0:00 \_ [iscsi_q_6]
1 0 1855 2 20 0 0 0 worker S ? 0:00 \_ [scsi_wq_6]
1 0 2496 2 20 0 0 0 dlm_as S ? 0:00 \_ [dlm_astd]
1 0 2497 2 20 0 0 0 dlm_sc S ? 0:00 \_ [dlm_scand]
1 0 2498 2 20 0 0 0 worker S ? 0:00 \_ [dlm_recv/0]
1 0 2499 2 20 0 0 0 worker S ? 0:00 \_ [dlm_send]
1 0 2500 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2524 2 20 0 0 0 worker S ? 0:00 \_ [glock_workqueue]
1 0 2525 2 20 0 0 0 worker S ? 0:00 \_ [delete_workqueu]
1 0 2527 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd000]
1 0 2528 2 15 -5 0 0 slow_w S< ? 0:00 \_ [kslowd001]
1 0 2529 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2535 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2536 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2541 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2553 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2559 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2560 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2561 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2562 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2567 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
1 0 2579 2 20 0 0 0 gfs2_l S ? 0:00 \_ [gfs2_logd]
1 0 2580 2 20 0 0 0 gfs2_q S ? 0:00 \_ [gfs2_quotad]
1 0 2592 2 20 0 0 0 dlm_re S ? 0:00 \_ [dlm_recoverd]
4 0 1 0 20 0 61864 2944 poll_s Ss ? 0:01 /sbin/init
1 0 253 1 20 0 17028 964 poll_s S ? 0:00 upstart-udev-bridge --daemon
5 0 274 1 16 -4 17096 876 poll_s S<s ? 0:00 udevd --daemon
5 0 1900 274 18 -2 17092 912 poll_s S< ? 0:00 \_ udevd --daemon
5 0 2526 274 18 -2 17092 784 poll_s S< ? 0:00 \_ udevd --daemon
1 0 848 1 20 0 4200 520 hrtime Ss ? 0:00 /sbin/iscsid
5 0 849 1 10 -10 14028 3820 poll_s S<Ls ? 0:00 /sbin/iscsid
4 0 1251 1 20 0 96596 4952 poll_s Ss ? 0:00 smbd -F
1 0 1540 1251 20 0 96596 1548 poll_s S ? 0:00 \_ smbd -F
5 1 1252 1 20 0 8248 556 poll_s Ss ? 0:00 portmap
5 101 1259 1 20 0 60676 1824 poll_s Sl ? 0:00 rsyslogd -c4
5 110 1329 1 20 0 10384 816 poll_s Ss ? 0:00 rpc.statd -L
5 0 1345 1 20 0 60660 1852 poll_s Ss ? 0:00 nmbd -D
0 0 1400 1 20 0 6072 632 n_tty_ Ss+ tty4 0:00 /sbin/getty -8 38400 tty4
0 0 1406 1 20 0 6072 632 n_tty_ Ss+ tty5 0:00 /sbin/getty -8 38400 tty5
0 0 1410 1 20 0 6072 632 n_tty_ Ss+ tty2 0:00 /sbin/getty -8 38400 tty2
0 0 1412 1 20 0 6072 636 n_tty_ Ss+ tty3 0:00 /sbin/getty -8 38400 tty3
0 0 1418 1 20 0 6072 632 n_tty_ Ss+ tty6 0:00 /sbin/getty -8 38400 tty6
5 106 1422 1 20 0 59652 1460 poll_s Ss ? 0:00 dbus-daemon --system --fork
1 0 1431 1 20 0 21068 888 hrtime Ss ? 0:00 cron
5 0 1480 1 20 0 151728 4472 futex_ Sl ? 0:00 /usr/sbin/libvirtd -d
5 65534 1561 1 20 0 21420 912 poll_s S ? 0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid
1 111 1562 1 20 0 113520 6304 futex_ SLsl ? 0:00 /usr/sbin/slapd -h ldap://127.0.0.1/ ldaps://127.0.0.1/
5 0 1599 1 -100 - 19856 3492 futex_ SLl ? 0:00 /sbin/multipathd
1 0 1633 1 20 0 14776 448 poll_s Ss ? 0:00 /usr/sbin/rpc.mountd --manage-gids
4 0 1711 1 20 0 37192 2224 ep_pol Ss ? 0:00 /usr/lib/postfix/master
4 104 2175 1711 20 0 68784 3048 ep_pol S ? 0:00 \_ pickup -l -t fifo -u -c
4 104 2176 1711 20 0 68940 3096 ep_pol S ? 0:00 \_ qmgr -l -t fifo -u
5 1 1716 1 20 0 14656 732 poll_s Ss ? 0:00 /usr/sbin/slpd
4 0 1769 1 -100 - 160964 87256 poll_s SLsl ? 0:06 corosync -f
4 0 1785 1 20 0 114464 5320 poll_s Ss ? 0:00 sshd: root@pts/0
4 0 2274 1785 20 0 19460 2248 wait Ss pts/0 0:00 \_ -bash
4 0 4017 2274 20 0 6952 1028 - R+ pts/0 0:00 \_ ps faxl
4 0 1904 1 20 0 54956 3552 poll_s Sl ? 0:00 /usr/sbin/console-kit-daemon --no-daemon
5 0 2150 1 20 0 49252 1096 poll_s Ss ? 0:00 /usr/sbin/sshd
5 103 2297 1 20 0 21584 1384 poll_s Ss ? 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 103:108
5 0 2356 1 20 0 61324 1708 poll_s Ssl ? 0:00 fenced
5 0 2381 1 -100 - 102476 2024 poll_s Ssl ? 0:00 dlm_controld
5 0 2424 1 -100 - 79972 1988 poll_s Ssl ? 0:00 gfs_controld
5 0 2495 1 20 0 39744 1192 poll_s Ssl ? 0:00 /sbin/clvmd -T20
5 0 2589 1 19 -1 32440 5848 wait S<Ls ? 0:00 rgmanager
5 0 2590 2589 19 -1 66312 2612 poll_s S<l ? 0:00 \_ rgmanager
0 0 2647 1 20 0 6072 632 n_tty_ Ss+ tty1 0:00 /sbin/getty -8 38400 tty1
5 0 3718 1 -2 - 82804 1352 ep_pol Ss ? 7:00 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3719 3718 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3720 3718 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 3721 3718 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 3722 3718 20 0 82804 624 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

ctdb status on 192.168.224.221:
Number of nodes:2
pnn:0 192.168.224.221 BANNED|UNHEALTHY (THIS NODE)
pnn:1 192.168.224.222 UNHEALTHY
Generation:INVALID
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:RECOVERY (1)
Recovery master:0

ctdb status on 192.168.224.222:
Number of nodes:2
pnn:0 192.168.224.221 UNHEALTHY
pnn:1 192.168.224.222 BANNED|UNHEALTHY (THIS NODE)
Generation:INVALID
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:RECOVERY (1)
Recovery master:1

cat log.ctdb on 192.168.224.221:
2010/03/29 13:02:38.191668 [ 4068]: Starting CTDBD as pid : 4068
2010/03/29 13:02:39.204880 [ 4068]: Freeze priority 1
2010/03/29 13:02:39.205016 [ 4068]: Freeze priority 2
2010/03/29 13:02:39.205110 [ 4068]: Freeze priority 3
2010/03/29 13:02:43.241245 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:02:43.241428 [ 4072]: Take the recovery lock
2010/03/29 13:02:43.243224 [ 4072]: Recovery lock taken successfully
2010/03/29 13:02:43.243591 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:02:43.245257 [ 4068]: Freeze priority 1
2010/03/29 13:02:43.246451 [ 4068]: Freeze priority 2
2010/03/29 13:02:43.247685 [ 4068]: Freeze priority 3
2010/03/29 13:03:03.405623 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:56 opcode:70 dstnode:0
2010/03/29 13:03:03.405707 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:03:03.405722 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:03:03.405734 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:03:03.405751 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:65593 opcode:70 dstnode:1
2010/03/29 13:03:03.405763 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:03:03.405773 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:03:03.405783 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:03:03.405796 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:03:03.405807 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:03:03.405817 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:03:13.251469 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4089
2010/03/29 13:03:13.251592 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4089
2010/03/29 13:03:13.251615 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:03:13.255105 [ 4089]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4089
2010/03/29 13:03:13.286627 [ 4089]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130313.4089
2010/03/29 13:03:13.288128 [ 4072]: Dropped orphaned reply control with reqid:56
2010/03/29 13:03:13.290556 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:03:13.290660 [ 4072]: Take the recovery lock
2010/03/29 13:03:13.293747 [ 4072]: Recovery lock taken successfully
2010/03/29 13:03:13.294370 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:03:13.296576 [ 4068]: Freeze priority 1
2010/03/29 13:03:13.297871 [ 4068]: Freeze priority 2
2010/03/29 13:03:13.298974 [ 4068]: Freeze priority 3
2010/03/29 13:03:15.405748 [ 4072]: Dropped orphaned reply control with reqid:65593
2010/03/29 13:03:33.412597 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:87 opcode:70 dstnode:0
2010/03/29 13:03:33.412757 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:03:33.412772 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:03:33.412787 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:03:33.412807 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:65624 opcode:70 dstnode:1
2010/03/29 13:03:33.412819 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:03:33.412830 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:03:33.412839 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:03:33.412852 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:03:33.412862 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:03:33.412873 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:03:43.301538 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4106
2010/03/29 13:03:43.301682 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4106
2010/03/29 13:03:43.301722 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:03:43.305339 [ 4106]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4106
2010/03/29 13:03:43.337079 [ 4106]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130343.4106
2010/03/29 13:03:43.339670 [ 4072]: Dropped orphaned reply control with reqid:87
2010/03/29 13:03:43.342006 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:03:43.342042 [ 4072]: Take the recovery lock
2010/03/29 13:03:43.344683 [ 4072]: Recovery lock taken successfully
2010/03/29 13:03:43.345302 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:03:43.347109 [ 4068]: Freeze priority 1
2010/03/29 13:03:43.348782 [ 4068]: Freeze priority 2
2010/03/29 13:03:43.351026 [ 4068]: Freeze priority 3
2010/03/29 13:03:44.406965 [ 4072]: Dropped orphaned reply control with reqid:65624
2010/03/29 13:04:03.395103 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:118 opcode:70 dstnode:0
2010/03/29 13:04:03.395264 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:04:03.395280 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:04:03.395299 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:04:03.395323 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:65655 opcode:70 dstnode:1
2010/03/29 13:04:03.395335 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:04:03.395345 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:04:03.395355 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:04:03.395368 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:04:03.395383 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:04:03.395396 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:04:13.355993 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4122
2010/03/29 13:04:13.356136 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4122
2010/03/29 13:04:13.356160 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:04:13.359687 [ 4122]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4122
2010/03/29 13:04:13.393713 [ 4122]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130413.4122
2010/03/29 13:04:13.395814 [ 4072]: Dropped orphaned reply control with reqid:118
2010/03/29 13:04:13.398145 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:04:13.398186 [ 4072]: Take the recovery lock
2010/03/29 13:04:13.457013 [ 4072]: Recovery lock taken successfully
2010/03/29 13:04:13.457673 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:04:13.459520 [ 4072]: client/ctdb_client.c:718 reqid 65655 not found
2010/03/29 13:04:13.459786 [ 4068]: Freeze priority 1
2010/03/29 13:04:13.468732 [ 4068]: Freeze priority 2
2010/03/29 13:04:13.477207 [ 4068]: Freeze priority 3
2010/03/29 13:04:34.407459 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:149 opcode:70 dstnode:0
2010/03/29 13:04:34.407578 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:04:34.407594 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:04:34.407605 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:04:34.407653 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:65686 opcode:70 dstnode:1
2010/03/29 13:04:34.407667 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:04:34.407677 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:04:34.407687 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:04:34.407699 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:04:34.407709 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:04:34.407720 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:04:43.499419 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4137
2010/03/29 13:04:43.499565 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4137
2010/03/29 13:04:43.499593 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:04:43.503510 [ 4137]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4137
2010/03/29 13:04:43.536404 [ 4137]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130443.4137
2010/03/29 13:04:43.537587 [ 4072]: Dropped orphaned reply control with reqid:149
2010/03/29 13:04:43.539184 [ 4068]: Banning this node for 300 seconds
2010/03/29 13:04:43.539768 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:04:43.539786 [ 4072]: Take the recovery lock
2010/03/29 13:04:43.542203 [ 4072]: Recovery lock taken successfully
2010/03/29 13:04:43.542787 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:04:43.544891 [ 4068]: Freeze priority 1
2010/03/29 13:04:43.545748 [ 4068]: Freeze priority 2
2010/03/29 13:04:43.546805 [ 4068]: Freeze priority 3
2010/03/29 13:04:45.403611 [ 4072]: Dropped orphaned reply control with reqid:65686
2010/03/29 13:04:51.177378 [ 4068]: Freeze priority 1
2010/03/29 13:04:52.177340 [ 4068]: Freeze priority 2
2010/03/29 13:04:53.177349 [ 4068]: Freeze priority 3
2010/03/29 13:05:04.404000 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:182 opcode:70 dstnode:0
2010/03/29 13:05:04.404071 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:05:04.404084 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:05:04.404095 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:05:04.404112 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:65719 opcode:70 dstnode:1
2010/03/29 13:05:04.404123 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:05:04.404133 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:05:04.404142 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:05:04.404155 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:05:04.404165 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:05:04.404176 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:05:10.178025 [ 4068]: Freeze priority 1
2010/03/29 13:05:11.177985 [ 4068]: Freeze priority 2
2010/03/29 13:05:12.178079 [ 4068]: Freeze priority 3
2010/03/29 13:05:13.549876 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4150
2010/03/29 13:05:13.549928 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4150
2010/03/29 13:05:13.549947 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:05:13.553916 [ 4150]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4150
2010/03/29 13:05:13.586760 [ 4150]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130513.4150
2010/03/29 13:05:13.587930 [ 4072]: Dropped orphaned reply control with reqid:182
2010/03/29 13:05:14.459191 [ 4072]: client/ctdb_client.c:718 reqid 65719 not found
2010/03/29 13:05:21.531947 [ 4068]: Freeze priority 1
2010/03/29 13:05:21.535199 [ 4068]: Freeze priority 2
2010/03/29 13:05:21.535907 [ 4068]: Freeze priority 3
... repeated a trillion times ...
2010/03/29 13:09:43.541414 [ 4068]: Banning timedout
2010/03/29 13:09:45.767566 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:09:45.767691 [ 4072]: Take the recovery lock
2010/03/29 13:09:45.769760 [ 4072]: Recovery lock taken successfully
2010/03/29 13:09:45.769929 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:09:45.777248 [ 4068]: Freeze priority 1
2010/03/29 13:09:45.778275 [ 4068]: Freeze priority 2
2010/03/29 13:09:45.779334 [ 4068]: Freeze priority 3
2010/03/29 13:10:06.406493 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:1298 opcode:70 dstnode:0
2010/03/29 13:10:06.406576 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:10:06.406593 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:10:06.406604 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:10:06.406622 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:66835 opcode:70 dstnode:1
2010/03/29 13:10:06.406634 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:10:06.406643 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:10:06.406686 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:10:06.406701 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:10:06.406712 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:10:06.406722 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:10:15.784993 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4190
2010/03/29 13:10:15.785166 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4190
2010/03/29 13:10:15.785194 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:10:15.789274 [ 4190]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4190
2010/03/29 13:10:15.827736 [ 4190]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131015.4190
2010/03/29 13:10:15.829889 [ 4072]: Dropped orphaned reply control with reqid:1298
2010/03/29 13:10:15.831340 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:10:15.831359 [ 4072]: Take the recovery lock
2010/03/29 13:10:15.834004 [ 4072]: Recovery lock taken successfully
2010/03/29 13:10:15.834596 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:10:15.836090 [ 4068]: Freeze priority 1
2010/03/29 13:10:15.837552 [ 4068]: Freeze priority 2
2010/03/29 13:10:15.838849 [ 4068]: Freeze priority 3
2010/03/29 13:10:18.406113 [ 4072]: Dropped orphaned reply control with reqid:66835
2010/03/29 13:10:36.406438 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:1329 opcode:70 dstnode:0
2010/03/29 13:10:36.406540 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:10:36.406555 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:10:36.406572 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:10:36.406593 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:66866 opcode:70 dstnode:1
2010/03/29 13:10:36.406605 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:10:36.406614 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:10:36.406624 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:10:36.406665 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:10:36.406681 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:10:36.406723 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:10:45.841599 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4203
2010/03/29 13:10:45.841755 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4203
2010/03/29 13:10:45.841778 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:10:45.845600 [ 4203]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4203
2010/03/29 13:10:45.898163 [ 4203]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131045.4203
2010/03/29 13:10:45.899889 [ 4072]: Dropped orphaned reply control with reqid:1329
2010/03/29 13:10:45.901577 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:10:45.901602 [ 4072]: Take the recovery lock
2010/03/29 13:10:45.904040 [ 4072]: Recovery lock taken successfully
2010/03/29 13:10:45.904622 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:10:45.917889 [ 4068]: Freeze priority 1
2010/03/29 13:10:45.921511 [ 4068]: Freeze priority 2
2010/03/29 13:10:45.922413 [ 4068]: Freeze priority 3
2010/03/29 13:10:47.406762 [ 4072]: Dropped orphaned reply control with reqid:66866
2010/03/29 13:11:06.407194 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:1360 opcode:70 dstnode:0
2010/03/29 13:11:06.407327 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:11:06.407344 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:11:06.407358 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:11:06.407375 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:66897 opcode:70 dstnode:1
2010/03/29 13:11:06.407386 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:11:06.407397 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:11:06.407407 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:11:06.407419 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:11:06.407430 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:11:06.407440 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:11:15.926629 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4217
2010/03/29 13:11:15.926843 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4217
2010/03/29 13:11:15.926875 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:11:15.930802 [ 4217]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4217
2010/03/29 13:11:15.964261 [ 4217]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131115.4217
2010/03/29 13:11:15.966026 [ 4072]: Dropped orphaned reply control with reqid:1360
2010/03/29 13:11:15.967511 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:11:15.967561 [ 4072]: Take the recovery lock
2010/03/29 13:11:15.971621 [ 4072]: Recovery lock taken successfully
2010/03/29 13:11:15.972213 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:11:15.974382 [ 4068]: Freeze priority 1
2010/03/29 13:11:15.975956 [ 4068]: Freeze priority 2
2010/03/29 13:11:15.977374 [ 4068]: Freeze priority 3
2010/03/29 13:11:17.407199 [ 4072]: Dropped orphaned reply control with reqid:66897
2010/03/29 13:11:36.395888 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:1391 opcode:70 dstnode:0
2010/03/29 13:11:36.396046 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:11:36.396063 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:11:36.396075 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:11:36.396091 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:66928 opcode:70 dstnode:1
2010/03/29 13:11:36.396102 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:11:36.396112 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:11:36.396121 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:11:36.396134 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:11:36.396144 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:11:36.396154 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:11:45.980287 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4230
2010/03/29 13:11:45.980444 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4230
2010/03/29 13:11:45.980473 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:11:45.984634 [ 4230]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4230
2010/03/29 13:11:46.017530 [ 4230]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131145.4230
2010/03/29 13:11:46.019194 [ 4072]: Dropped orphaned reply control with reqid:1391
2010/03/29 13:11:46.020813 [ 4068]: Banning this node for 300 seconds
2010/03/29 13:11:46.021347 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:11:46.021366 [ 4072]: Take the recovery lock
2010/03/29 13:11:46.023910 [ 4072]: Recovery lock taken successfully
2010/03/29 13:11:46.024553 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:11:46.026092 [ 4068]: Freeze priority 1
2010/03/29 13:11:46.027286 [ 4068]: Freeze priority 2
2010/03/29 13:11:46.028310 [ 4068]: Freeze priority 3
2010/03/29 13:11:47.403852 [ 4072]: Dropped orphaned reply control with reqid:66928
2010/03/29 13:11:55.180817 [ 4068]: Freeze priority 1
2010/03/29 13:11:56.181889 [ 4068]: Freeze priority 2
2010/03/29 13:11:57.181754 [ 4068]: Freeze priority 3
2010/03/29 13:12:06.404353 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:1424 opcode:70 dstnode:0
2010/03/29 13:12:06.404575 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:12:06.404589 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:12:06.404601 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:12:06.404615 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:66961 opcode:70 dstnode:1
2010/03/29 13:12:06.404626 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:12:06.404636 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:12:06.404646 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:12:07.404284 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:12:07.404336 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:12:07.404351 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:12:14.179726 [ 4068]: Freeze priority 1
2010/03/29 13:12:15.182130 [ 4068]: Freeze priority 2
2010/03/29 13:12:16.031421 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4243
2010/03/29 13:12:16.031569 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4243
2010/03/29 13:12:16.031615 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:12:16.035529 [ 4243]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4243
2010/03/29 13:12:16.068546 [ 4243]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131216.4243
2010/03/29 13:12:16.070264 [ 4072]: Dropped orphaned reply control with reqid:1424
2010/03/29 13:12:16.457302 [ 4072]: client/ctdb_client.c:718 reqid 66961 not found
2010/03/29 13:12:16.496875 [ 4068]: Freeze priority 3
2010/03/29 13:12:20.503948 [ 4068]: Freeze priority 1
2010/03/29 13:12:20.504355 [ 4068]: Freeze priority 2
... repeated a trillion times ...
2010/03/29 13:16:46.031944 [ 4068]: Banning timedout
2010/03/29 13:16:49.227965 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:16:49.228086 [ 4072]: Take the recovery lock
2010/03/29 13:16:49.230146 [ 4072]: Recovery lock taken successfully
2010/03/29 13:16:49.230453 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:16:49.232319 [ 4068]: Freeze priority 1
2010/03/29 13:16:49.233720 [ 4068]: Freeze priority 2
2010/03/29 13:16:49.234703 [ 4068]: Freeze priority 3
2010/03/29 13:17:09.405855 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:2346 opcode:70 dstnode:0
2010/03/29 13:17:09.405994 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:17:09.406011 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:17:09.406022 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:17:09.406039 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:67883 opcode:70 dstnode:1
2010/03/29 13:17:09.406051 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:17:09.406060 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:17:09.406070 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:17:09.406124 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:17:09.406137 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:17:09.406147 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:17:19.238485 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4285
2010/03/29 13:17:19.238622 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4285
2010/03/29 13:17:19.238648 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:17:19.242481 [ 4285]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4285
2010/03/29 13:17:19.276291 [ 4285]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131719.4285
2010/03/29 13:17:19.277725 [ 4072]: Dropped orphaned reply control with reqid:2346
2010/03/29 13:17:19.279326 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:17:19.279351 [ 4072]: Take the recovery lock
2010/03/29 13:17:19.281940 [ 4072]: Recovery lock taken successfully
2010/03/29 13:17:19.282370 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:17:19.294234 [ 4068]: Freeze priority 1
2010/03/29 13:17:19.295657 [ 4068]: Freeze priority 2
2010/03/29 13:17:19.296822 [ 4068]: Freeze priority 3
2010/03/29 13:17:21.406882 [ 4072]: Dropped orphaned reply control with reqid:67883
2010/03/29 13:17:39.407638 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:2377 opcode:70 dstnode:0
2010/03/29 13:17:39.407762 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:17:39.407778 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:17:39.407794 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:17:39.407816 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:67914 opcode:70 dstnode:1
2010/03/29 13:17:39.407828 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:17:39.407838 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:17:39.407848 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:17:39.407860 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:17:39.407871 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:17:39.407882 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:17:49.300931 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4302
2010/03/29 13:17:49.301306 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4302
2010/03/29 13:17:49.302280 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:17:49.307562 [ 4302]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4302
2010/03/29 13:17:49.345069 [ 4302]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131749.4302
2010/03/29 13:17:49.346456 [ 4072]: Dropped orphaned reply control with reqid:2377
2010/03/29 13:17:49.347964 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:17:49.347985 [ 4072]: Take the recovery lock
2010/03/29 13:17:49.350187 [ 4072]: Recovery lock taken successfully
2010/03/29 13:17:49.350654 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:17:49.352459 [ 4068]: Freeze priority 1
2010/03/29 13:17:49.353906 [ 4068]: Freeze priority 2
2010/03/29 13:17:49.354907 [ 4068]: Freeze priority 3
2010/03/29 13:17:50.404053 [ 4072]: Dropped orphaned reply control with reqid:67914
2010/03/29 13:18:09.406597 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:2408 opcode:70 dstnode:0
2010/03/29 13:18:09.406706 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:18:09.406748 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:18:09.406761 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:18:09.406778 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:67945 opcode:70 dstnode:1
2010/03/29 13:18:09.406790 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:18:09.406800 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:18:09.406809 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:18:09.406823 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:18:09.406833 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:18:09.406844 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:18:19.358080 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4315
2010/03/29 13:18:19.358304 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4315
2010/03/29 13:18:19.358344 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:18:19.362608 [ 4315]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4315
2010/03/29 13:18:19.395404 [ 4072]: Dropped orphaned reply control with reqid:2408
2010/03/29 13:18:19.458323 [ 4315]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131819.4315
2010/03/29 13:18:19.459796 [ 4072]: client/ctdb_client.c:718 reqid 67945 not found
2010/03/29 13:18:19.500019 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:18:19.500071 [ 4072]: Take the recovery lock
2010/03/29 13:18:19.502355 [ 4072]: Recovery lock taken successfully
2010/03/29 13:18:19.502502 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:18:19.516028 [ 4068]: Freeze priority 1
2010/03/29 13:18:19.523465 [ 4068]: Freeze priority 2
2010/03/29 13:18:19.529760 [ 4068]: Freeze priority 3
2010/03/29 13:18:40.407183 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:2439 opcode:70 dstnode:0
2010/03/29 13:18:40.407388 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:18:40.407405 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:18:40.407417 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:18:40.407437 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:67976 opcode:70 dstnode:1
2010/03/29 13:18:40.407449 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:18:40.407459 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:18:40.407469 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:18:40.407482 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:18:40.407493 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:18:40.407503 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:18:49.536144 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4328
2010/03/29 13:18:49.536310 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4328
2010/03/29 13:18:49.536353 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:18:49.540471 [ 4328]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4328
2010/03/29 13:18:49.574999 [ 4328]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131849.4328
2010/03/29 13:18:49.576372 [ 4072]: Dropped orphaned reply control with reqid:2439
2010/03/29 13:18:49.577864 [ 4068]: Banning this node for 300 seconds
2010/03/29 13:18:49.578420 [ 4072]: Taking out recovery lock from recovery daemon
2010/03/29 13:18:49.578461 [ 4072]: Take the recovery lock
2010/03/29 13:18:49.580985 [ 4072]: Recovery lock taken successfully
2010/03/29 13:18:49.581569 [ 4072]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:18:49.583159 [ 4068]: Freeze priority 1
2010/03/29 13:18:49.583923 [ 4068]: Freeze priority 2
2010/03/29 13:18:49.584964 [ 4068]: Freeze priority 3
2010/03/29 13:18:52.403474 [ 4072]: Dropped orphaned reply control with reqid:67976
2010/03/29 13:18:58.181464 [ 4068]: Freeze priority 1
2010/03/29 13:18:59.181320 [ 4068]: Freeze priority 2
2010/03/29 13:19:00.181373 [ 4068]: Freeze priority 3
2010/03/29 13:19:10.396117 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:2472 opcode:70 dstnode:0
2010/03/29 13:19:10.396183 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:19:10.396197 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:19:10.396208 [ 4072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:19:10.396223 [ 4072]: client/ctdb_client.c:771 control timed out. reqid:68009 opcode:70 dstnode:1
2010/03/29 13:19:10.396234 [ 4072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:19:10.396244 [ 4072]: Async operation failed with state 3, opcode:70
2010/03/29 13:19:10.396253 [ 4072]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:19:10.396266 [ 4072]: Async wait failed - fail_count=2
2010/03/29 13:19:10.396276 [ 4072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:19:10.396287 [ 4072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:19:17.181550 [ 4068]: Freeze priority 1
2010/03/29 13:19:18.181889 [ 4068]: Freeze priority 2
2010/03/29 13:19:19.181927 [ 4068]: Freeze priority 3
2010/03/29 13:19:19.588486 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4341
2010/03/29 13:19:19.588650 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4341
2010/03/29 13:19:19.588696 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:19:19.592916 [ 4341]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4341
2010/03/29 13:19:19.601901 [ 4072]: Dropped orphaned reply control with reqid:2472
2010/03/29 13:19:19.648327 [ 4341]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131919.4341
2010/03/29 13:19:20.459066 [ 4072]: client/ctdb_client.c:718 reqid 68009 not found
2010/03/29 13:19:27.525801 [ 4068]: Freeze priority 1
2010/03/29 13:19:27.528888 [ 4068]: Freeze priority 2
2010/03/29 13:19:27.529510 [ 4068]: Freeze priority 3
... repeated a trillion times ...
2010/03/29 13:23:49.582304 [ 4068]: Banning timedout
2010/03/29 13:23:52.992313 [ 4068]: Freeze priority 1
2010/03/29 13:23:52.992993 [ 4068]: Freeze priority 2
2010/03/29 13:23:52.993591 [ 4068]: Freeze priority 3
2010/03/29 13:24:23.017062 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4384
2010/03/29 13:24:23.017247 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4384
2010/03/29 13:24:23.017277 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:24:23.021698 [ 4384]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4384
2010/03/29 13:24:23.074976 [ 4384]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132423.4384
2010/03/29 13:24:23.075789 [ 4068]: Freeze priority 1
2010/03/29 13:24:23.076161 [ 4068]: Freeze priority 2
2010/03/29 13:24:23.076521 [ 4068]: Freeze priority 3
2010/03/29 13:24:53.078187 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4396
2010/03/29 13:24:53.078332 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4396
2010/03/29 13:24:53.078357 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:24:53.083262 [ 4396]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4396
2010/03/29 13:24:53.117516 [ 4396]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132453.4396
2010/03/29 13:24:53.131690 [ 4068]: Freeze priority 1
2010/03/29 13:24:53.132917 [ 4068]: Freeze priority 2
2010/03/29 13:24:53.133964 [ 4068]: Freeze priority 3
2010/03/29 13:25:23.144022 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4409
2010/03/29 13:25:23.144178 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4409
2010/03/29 13:25:23.144202 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:25:23.148637 [ 4409]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4409
2010/03/29 13:25:23.188240 [ 4409]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132523.4409
2010/03/29 13:25:23.233172 [ 4068]: Freeze priority 1
2010/03/29 13:25:23.234187 [ 4068]: Freeze priority 2
2010/03/29 13:25:23.235142 [ 4068]: Freeze priority 3
2010/03/29 13:25:53.238287 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4422
2010/03/29 13:25:53.238507 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4422
2010/03/29 13:25:53.238583 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:25:53.245123 [ 4422]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4422
2010/03/29 13:25:53.284057 [ 4422]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132553.4422
2010/03/29 13:25:53.321100 [ 4068]: Banning this node for 300 seconds
2010/03/29 13:25:53.330818 [ 4068]: Freeze priority 1
2010/03/29 13:25:53.331684 [ 4068]: Freeze priority 2
2010/03/29 13:25:53.332903 [ 4068]: Freeze priority 3
2010/03/29 13:26:23.335360 [ 4068]: Event script timed out : startrecovery count : 0 pid : 4435
2010/03/29 13:26:23.335544 [ 4068]: server/eventscript.c:508 Sending SIGTERM to child pid:4435
2010/03/29 13:26:23.335609 [ 4068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:26:23.340020 [ 4435]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4435
2010/03/29 13:26:23.378342 [ 4068]: Freeze priority 1
2010/03/29 13:26:23.379117 [ 4435]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132623.4435
2010/03/29 13:26:23.379856 [ 4068]: Freeze priority 2
2010/03/29 13:26:23.380306 [ 4068]: Freeze priority 3
2010/03/29 13:26:27.397589 [ 4068]: Freeze priority 1
... repeated a trillion times ...
2010/03/29 13:27:26.458919 [ 4533]: Timed out running script '/etc/ctdb/events.d/01.reclock shutdown ' after 1.2 seconds pid :4533
2010/03/29 13:27:26.461119 [ 4533]: Timed out running script '/etc/ctdb/events.d/01.reclock shutdown ' after 1.2 seconds pid :4533
2010/03/29 13:27:26.538304 [ 4533]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132726.4533

cat log.ctdb on 192.168.224.222:
2010/03/29 13:02:42.867632 [ 3718]: Starting CTDBD as pid : 3718
2010/03/29 13:02:43.252555 [ 3718]: Freeze priority 1
2010/03/29 13:02:43.253644 [ 3718]: Freeze priority 2
2010/03/29 13:02:43.254795 [ 3718]: Freeze priority 3
2010/03/29 13:03:13.303992 [ 3718]: Freeze priority 1
2010/03/29 13:03:13.305381 [ 3718]: Freeze priority 2
2010/03/29 13:03:13.306351 [ 3718]: Freeze priority 3
2010/03/29 13:03:14.468369 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3733
2010/03/29 13:03:14.468472 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3733
2010/03/29 13:03:14.468540 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:03:14.473278 [ 3733]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3733
2010/03/29 13:03:43.354880 [ 3718]: Freeze priority 1
2010/03/29 13:03:43.357643 [ 3718]: Freeze priority 2
2010/03/29 13:03:43.358702 [ 3718]: Freeze priority 3
2010/03/29 13:03:43.465965 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3749
2010/03/29 13:03:43.466060 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3749
2010/03/29 13:03:43.466133 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:03:43.472256 [ 3749]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 29.0 seconds pid :3749
2010/03/29 13:03:43.477552 [ 3733]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130314.3733
2010/03/29 13:04:13.465880 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3764
2010/03/29 13:04:13.466015 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3764
2010/03/29 13:04:13.466044 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:04:13.474976 [ 3764]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3764
2010/03/29 13:04:13.475853 [ 3718]: Freeze priority 1
2010/03/29 13:04:13.480937 [ 3749]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130343.3749
2010/03/29 13:04:13.484329 [ 3718]: Freeze priority 2
2010/03/29 13:04:13.498799 [ 3718]: Freeze priority 3
2010/03/29 13:04:13.510519 [ 3764]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130413.3764
2010/03/29 13:04:43.547187 [ 3718]: Banning this node for 300 seconds
2010/03/29 13:04:43.553129 [ 3718]: Freeze priority 1
2010/03/29 13:04:43.554051 [ 3718]: Freeze priority 2
2010/03/29 13:04:43.555016 [ 3718]: Freeze priority 3
2010/03/29 13:04:44.466442 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3778
2010/03/29 13:04:44.466621 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3778
2010/03/29 13:04:44.466794 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:04:44.472729 [ 3778]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3778
2010/03/29 13:05:13.596194 [ 3718]: Freeze priority 1
2010/03/29 13:05:13.596693 [ 3718]: Freeze priority 2
2010/03/29 13:05:13.597521 [ 3718]: Freeze priority 3
2010/03/29 13:05:14.466552 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3790
2010/03/29 13:05:14.466604 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3790
2010/03/29 13:05:14.466627 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:05:14.471644 [ 3790]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3790
2010/03/29 13:05:14.477317 [ 3778]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130444.3778
2010/03/29 13:05:14.508099 [ 3790]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329130514.3790
2010/03/29 13:05:18.515791 [ 3718]: Freeze priority 1
2010/03/29 13:05:18.516322 [ 3718]: Freeze priority 2
2010/03/29 13:05:18.516761 [ 3718]: Freeze priority 3
... repeated a trillion times ...
2010/03/29 13:09:43.548576 [ 3718]: Banning timedout
2010/03/29 13:09:45.782484 [ 3718]: Freeze priority 1
2010/03/29 13:09:45.783514 [ 3718]: Freeze priority 2
2010/03/29 13:09:45.784539 [ 3718]: Freeze priority 3
2010/03/29 13:10:15.841153 [ 3718]: Freeze priority 1
2010/03/29 13:10:15.842405 [ 3718]: Freeze priority 2
2010/03/29 13:10:15.843701 [ 3718]: Freeze priority 3
2010/03/29 13:10:17.463098 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3831
2010/03/29 13:10:17.463194 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3831
2010/03/29 13:10:17.463310 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:10:17.468496 [ 3831]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3831
2010/03/29 13:10:45.923063 [ 3718]: Freeze priority 1
2010/03/29 13:10:45.926392 [ 3718]: Freeze priority 2
2010/03/29 13:10:45.927442 [ 3718]: Freeze priority 3
2010/03/29 13:10:46.460682 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3844
2010/03/29 13:10:46.460836 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3844
2010/03/29 13:10:46.460935 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:10:46.466722 [ 3844]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 29.0 seconds pid :3844
2010/03/29 13:10:46.472477 [ 3831]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131017.3831
2010/03/29 13:11:15.979213 [ 3718]: Freeze priority 1
2010/03/29 13:11:15.981135 [ 3718]: Freeze priority 2
2010/03/29 13:11:15.981984 [ 3718]: Freeze priority 3
2010/03/29 13:11:16.460710 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3857
2010/03/29 13:11:16.460854 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3857
2010/03/29 13:11:16.461015 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:11:16.467334 [ 3857]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3857
2010/03/29 13:11:16.473104 [ 3844]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131046.3844
2010/03/29 13:11:46.025220 [ 3718]: Banning this node for 300 seconds
2010/03/29 13:11:46.030948 [ 3718]: Freeze priority 1
2010/03/29 13:11:46.031958 [ 3718]: Freeze priority 2
2010/03/29 13:11:46.033029 [ 3718]: Freeze priority 3
2010/03/29 13:11:46.461049 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3870
2010/03/29 13:11:46.461157 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3870
2010/03/29 13:11:46.461253 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:11:46.468274 [ 3870]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3870
2010/03/29 13:11:46.475603 [ 3857]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131116.3857
2010/03/29 13:12:16.075108 [ 3718]: Freeze priority 1
2010/03/29 13:12:16.075578 [ 3718]: Freeze priority 2
2010/03/29 13:12:16.075964 [ 3718]: Freeze priority 3
2010/03/29 13:12:16.460923 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3883
2010/03/29 13:12:16.461094 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3883
2010/03/29 13:12:16.461136 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:12:16.466258 [ 3883]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3883
2010/03/29 13:12:16.471621 [ 3870]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131146.3870
2010/03/29 13:12:16.499255 [ 3883]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131216.3883
2010/03/29 13:12:23.550917 [ 3718]: Freeze priority 1
2010/03/29 13:12:23.551444 [ 3718]: Freeze priority 2
2010/03/29 13:12:23.551885 [ 3718]: Freeze priority 3
... repeated a trillion times ...
2010/03/29 13:16:46.027045 [ 3718]: Banning timedout
2010/03/29 13:16:49.238066 [ 3718]: Freeze priority 1
2010/03/29 13:16:49.239297 [ 3718]: Freeze priority 2
2010/03/29 13:16:49.240433 [ 3718]: Freeze priority 3
2010/03/29 13:17:19.299505 [ 3718]: Freeze priority 1
2010/03/29 13:17:19.300733 [ 3718]: Freeze priority 2
2010/03/29 13:17:19.301805 [ 3718]: Freeze priority 3
2010/03/29 13:17:20.469640 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3926
2010/03/29 13:17:20.469739 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3926
2010/03/29 13:17:20.469760 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:17:20.474647 [ 3926]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3926
2010/03/29 13:17:49.356863 [ 3718]: Freeze priority 1
2010/03/29 13:17:49.358422 [ 3718]: Freeze priority 2
2010/03/29 13:17:49.359390 [ 3718]: Freeze priority 3
2010/03/29 13:17:49.463192 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3942
2010/03/29 13:17:49.463304 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3942
2010/03/29 13:17:49.463492 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:17:49.469847 [ 3942]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 29.0 seconds pid :3942
2010/03/29 13:17:49.475588 [ 3926]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131720.3926
2010/03/29 13:18:19.461981 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3953
2010/03/29 13:18:19.462143 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3953
2010/03/29 13:18:19.462183 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:18:19.467383 [ 3953]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3953
2010/03/29 13:18:19.526714 [ 3718]: Freeze priority 1
2010/03/29 13:18:19.533024 [ 3718]: Freeze priority 2
2010/03/29 13:18:19.533922 [ 3942]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131749.3942
2010/03/29 13:18:19.534909 [ 3718]: Freeze priority 3
2010/03/29 13:18:19.535322 [ 3953]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131819.3953
2010/03/29 13:18:49.581196 [ 3718]: Banning this node for 300 seconds
2010/03/29 13:18:49.586456 [ 3718]: Freeze priority 1
2010/03/29 13:18:49.587541 [ 3718]: Freeze priority 2
2010/03/29 13:18:49.588503 [ 3718]: Freeze priority 3
2010/03/29 13:18:51.458746 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3969
2010/03/29 13:18:51.458869 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3969
2010/03/29 13:18:51.458896 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:18:51.464921 [ 3969]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :3969
2010/03/29 13:19:19.604956 [ 3718]: Freeze priority 1
2010/03/29 13:19:19.605460 [ 3718]: Freeze priority 2
2010/03/29 13:19:19.605752 [ 3718]: Freeze priority 3
2010/03/29 13:19:20.459208 [ 3718]: Event script timed out : startrecovery count : 0 pid : 3981
2010/03/29 13:19:20.459265 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:3981
2010/03/29 13:19:20.459291 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:19:20.464984 [ 3981]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 29.0 seconds pid :3981
2010/03/29 13:19:20.470774 [ 3969]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131851.3969
2010/03/29 13:19:20.503980 [ 3981]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329131920.3981
2010/03/29 13:19:24.509575 [ 3718]: Freeze priority 1
2010/03/29 13:19:24.509918 [ 3718]: Freeze priority 2
2010/03/29 13:19:24.510218 [ 3718]: Freeze priority 3
... repeated a trillion times ...
2010/03/29 13:23:49.583719 [ 3718]: Banning timedout
2010/03/29 13:23:52.977761 [ 3722]: Taking out recovery lock from recovery daemon
2010/03/29 13:23:52.977996 [ 3722]: Take the recovery lock
2010/03/29 13:23:52.983145 [ 3722]: Recovery lock taken successfully
2010/03/29 13:23:52.983347 [ 3722]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:23:52.993309 [ 3718]: Freeze priority 1
2010/03/29 13:23:52.994341 [ 3718]: Freeze priority 2
2010/03/29 13:23:52.994927 [ 3718]: Freeze priority 3
2010/03/29 13:24:13.136669 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:3339 opcode:70 dstnode:0
2010/03/29 13:24:13.136804 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:24:13.136820 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:24:13.136832 [ 3722]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:24:13.136850 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:68876 opcode:70 dstnode:1
2010/03/29 13:24:13.136861 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:24:13.136871 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:24:13.136880 [ 3722]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:24:13.136918 [ 3722]: Async wait failed - fail_count=2
2010/03/29 13:24:13.136930 [ 3722]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:24:13.136940 [ 3722]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:24:22.997458 [ 3718]: Event script timed out : startrecovery count : 0 pid : 4024
2010/03/29 13:24:22.997621 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:4024
2010/03/29 13:24:22.997656 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:24:23.001886 [ 4024]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4024
2010/03/29 13:24:23.038842 [ 4024]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132423.4024
2010/03/29 13:24:23.040399 [ 3722]: client/ctdb_client.c:718 reqid 68876 not found
2010/03/29 13:24:23.040429 [ 3722]: Dropped orphaned reply control with reqid:3339
2010/03/29 13:24:23.059332 [ 3722]: Taking out recovery lock from recovery daemon
2010/03/29 13:24:23.059379 [ 3722]: Take the recovery lock
2010/03/29 13:24:23.063956 [ 3722]: Recovery lock taken successfully
2010/03/29 13:24:23.064460 [ 3722]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:24:23.076450 [ 3718]: Freeze priority 1
2010/03/29 13:24:23.077724 [ 3718]: Freeze priority 2
2010/03/29 13:24:23.078095 [ 3718]: Freeze priority 3
2010/03/29 13:24:43.133967 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:3370 opcode:70 dstnode:0
2010/03/29 13:24:43.134110 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:24:43.134126 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:24:43.134137 [ 3722]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:24:43.134154 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:68907 opcode:70 dstnode:1
2010/03/29 13:24:43.134165 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:24:43.134175 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:24:43.134184 [ 3722]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:24:43.134196 [ 3722]: Async wait failed - fail_count=2
2010/03/29 13:24:43.134221 [ 3722]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:24:43.134236 [ 3722]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:24:53.080108 [ 3718]: Event script timed out : startrecovery count : 0 pid : 4037
2010/03/29 13:24:53.080252 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:4037
2010/03/29 13:24:53.080277 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:24:53.086024 [ 4037]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4037
2010/03/29 13:24:53.119816 [ 4037]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132453.4037
2010/03/29 13:24:53.123049 [ 3722]: client/ctdb_client.c:718 reqid 68907 not found
2010/03/29 13:24:53.123095 [ 3722]: Dropped orphaned reply control with reqid:3370
2010/03/29 13:24:53.126833 [ 3722]: Taking out recovery lock from recovery daemon
2010/03/29 13:24:53.126877 [ 3722]: Take the recovery lock
2010/03/29 13:24:53.130454 [ 3722]: Recovery lock taken successfully
2010/03/29 13:24:53.130640 [ 3722]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:24:53.132786 [ 3718]: Freeze priority 1
2010/03/29 13:24:53.133973 [ 3718]: Freeze priority 2
2010/03/29 13:24:53.134986 [ 3718]: Freeze priority 3
2010/03/29 13:25:13.143946 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:3401 opcode:70 dstnode:0
2010/03/29 13:25:13.153347 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:25:13.153728 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:25:13.153750 [ 3722]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:25:13.153778 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:68938 opcode:70 dstnode:1
2010/03/29 13:25:13.153794 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:25:13.153804 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:25:13.153814 [ 3722]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:25:13.153826 [ 3722]: Async wait failed - fail_count=2
2010/03/29 13:25:13.153836 [ 3722]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:25:13.153862 [ 3722]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:25:23.183651 [ 3718]: Event script timed out : startrecovery count : 0 pid : 4050
2010/03/29 13:25:23.183783 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:4050
2010/03/29 13:25:23.183832 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:25:23.188079 [ 4050]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4050
2010/03/29 13:25:23.221547 [ 4050]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132523.4050
2010/03/29 13:25:23.223187 [ 3722]: client/ctdb_client.c:718 reqid 68938 not found
2010/03/29 13:25:23.223221 [ 3722]: Dropped orphaned reply control with reqid:3401
2010/03/29 13:25:23.229499 [ 3722]: Taking out recovery lock from recovery daemon
2010/03/29 13:25:23.229530 [ 3722]: Take the recovery lock
2010/03/29 13:25:23.232011 [ 3722]: Recovery lock taken successfully
2010/03/29 13:25:23.232303 [ 3722]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:25:23.234616 [ 3718]: Freeze priority 1
2010/03/29 13:25:23.235832 [ 3718]: Freeze priority 2
2010/03/29 13:25:23.236783 [ 3718]: Freeze priority 3
2010/03/29 13:25:44.135016 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:3432 opcode:70 dstnode:0
2010/03/29 13:25:44.138116 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:25:44.138167 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:25:44.138193 [ 3722]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:25:44.138219 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:68969 opcode:70 dstnode:1
2010/03/29 13:25:44.138238 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:25:44.138249 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:25:44.138258 [ 3722]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:25:44.138270 [ 3722]: Async wait failed - fail_count=2
2010/03/29 13:25:44.138284 [ 3722]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:25:44.138297 [ 3722]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:25:53.239529 [ 3718]: Event script timed out : startrecovery count : 0 pid : 4063
2010/03/29 13:25:53.240526 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:4063
2010/03/29 13:25:53.240590 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:25:53.262679 [ 4063]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4063
2010/03/29 13:25:53.317619 [ 4063]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132553.4063
2010/03/29 13:25:53.320639 [ 3722]: client/ctdb_client.c:718 reqid 68969 not found
2010/03/29 13:25:53.320671 [ 3722]: Dropped orphaned reply control with reqid:3432
2010/03/29 13:25:53.322731 [ 3718]: Banning this node for 300 seconds
2010/03/29 13:25:53.322808 [ 3722]: Taking out recovery lock from recovery daemon
2010/03/29 13:25:53.322831 [ 3722]: Take the recovery lock
2010/03/29 13:25:53.328911 [ 3722]: Recovery lock taken successfully
2010/03/29 13:25:53.329681 [ 3722]: Recovery lock taken successfully by recovery daemon
2010/03/29 13:25:53.332234 [ 3718]: Freeze priority 1
2010/03/29 13:25:53.333128 [ 3718]: Freeze priority 2
2010/03/29 13:25:53.334341 [ 3718]: Freeze priority 3
2010/03/29 13:26:03.455809 [ 3718]: Freeze priority 1
2010/03/29 13:26:04.455793 [ 3718]: Freeze priority 2
2010/03/29 13:26:05.455865 [ 3718]: Freeze priority 3
2010/03/29 13:26:14.145095 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:3465 opcode:70 dstnode:0
2010/03/29 13:26:14.145295 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:26:14.145311 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:26:14.145321 [ 3722]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:26:14.145336 [ 3722]: client/ctdb_client.c:771 control timed out. reqid:69002 opcode:70 dstnode:1
2010/03/29 13:26:14.145346 [ 3722]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/29 13:26:14.145355 [ 3722]: Async operation failed with state 3, opcode:70
2010/03/29 13:26:14.145364 [ 3722]: server/ctdb_recoverd.c:178 Node 1 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/29 13:26:16.135244 [ 3722]: Async wait failed - fail_count=2
2010/03/29 13:26:16.135309 [ 3722]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/29 13:26:16.135322 [ 3722]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/29 13:26:22.456041 [ 3718]: Freeze priority 1
2010/03/29 13:26:23.337748 [ 3718]: Event script timed out : startrecovery count : 0 pid : 4076
2010/03/29 13:26:23.337810 [ 3718]: server/eventscript.c:508 Sending SIGTERM to child pid:4076
2010/03/29 13:26:23.337863 [ 3718]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/29 13:26:23.341870 [ 4076]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :4076
2010/03/29 13:26:23.375245 [ 4076]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132623.4076
2010/03/29 13:26:23.377374 [ 3722]: Dropped orphaned reply control with reqid:3465
2010/03/29 13:26:23.377408 [ 3722]: client/ctdb_client.c:718 reqid 69002 not found
2010/03/29 13:26:23.383810 [ 3718]: Freeze priority 2
2010/03/29 13:26:23.384169 [ 3718]: Freeze priority 3
2010/03/29 13:26:30.417436 [ 3718]: Freeze priority 1
... repeated a trillion times ...
2010/03/29 13:27:32.188713 [ 4160]: Timed out running script '/etc/ctdb/events.d/01.reclock shutdown ' after 1.5 seconds pid :4160
2010/03/29 13:27:32.190825 [ 4160]: Timed out running script '/etc/ctdb/events.d/01.reclock shutdown ' after 1.5 seconds pid :4160
2010/03/29 13:27:32.248602 [ 4160]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100329132732.4160

/tmp/ctdb.event.20100329132732.4160 on 192.168.224.222:
init(1)-+-clvmd(2495)-+-{clvmd}(2506)
        | `-{clvmd}(2507)
        |-console-kit-dae(1904)-+-{console-kit-da}(1978)
        | |-{console-kit-da}(1980)
        | |-{console-kit-da}(1981)
        | |-{console-kit-da}(1982)
        | |-{console-kit-da}(1983)
        | |-{console-kit-da}(1984)
        | |-{console-kit-da}(1985)
        | |-{console-kit-da}(1986)
        | |-{console-kit-da}(1987)
        | |-{console-kit-da}(1988)
        | |-{console-kit-da}(1989)
        | |-{console-kit-da}(1990)
        | |-{console-kit-da}(1991)
        | |-{console-kit-da}(1992)
        | |-{console-kit-da}(1993)
        | |-{console-kit-da}(1994)
        | |-{console-kit-da}(1995)
        | |-{console-kit-da}(1996)
        | |-{console-kit-da}(1997)
        | |-{console-kit-da}(1998)
        | |-{console-kit-da}(1999)
        | |-{console-kit-da}(2000)
        | |-{console-kit-da}(2001)
        | |-{console-kit-da}(2002)
        | |-{console-kit-da}(2003)
        | |-{console-kit-da}(2004)
        | |-{console-kit-da}(2005)
        | |-{console-kit-da}(2006)
        | |-{console-kit-da}(2007)
        | |-{console-kit-da}(2008)
        | |-{console-kit-da}(2009)
        | |-{console-kit-da}(2010)
        | |-{console-kit-da}(2011)
        | |-{console-kit-da}(2012)
        | |-{console-kit-da}(2013)
        | |-{console-kit-da}(2014)
        | |-{console-kit-da}(2015)
        | |-{console-kit-da}(2016)
        | |-{console-kit-da}(2017)
        | |-{console-kit-da}(2018)
        | |-{console-kit-da}(2019)
        | |-{console-kit-da}(2020)
        | |-{console-kit-da}(2021)
        | |-{console-kit-da}(2022)
        | |-{console-kit-da}(2023)
        | |-{console-kit-da}(2024)
        | |-{console-kit-da}(2025)
        | |-{console-kit-da}(2026)
        | |-{console-kit-da}(2027)
        | |-{console-kit-da}(2028)
        | |-{console-kit-da}(2029)
        | |-{console-kit-da}(2030)
        | |-{console-kit-da}(2031)
        | |-{console-kit-da}(2032)
        | |-{console-kit-da}(2033)
        | |-{console-kit-da}(2034)
        | |-{console-kit-da}(2035)
        | |-{console-kit-da}(2036)
        | |-{console-kit-da}(2037)
        | |-{console-kit-da}(2038)
        | |-{console-kit-da}(2039)
        | |-{console-kit-da}(2040)
        | `-{console-kit-da}(2058)
        |-corosync(1769)-+-{corosync}(1778)
        | |-{corosync}(1779)
        | |-{corosync}(2380)
        | |-{corosync}(2420)
        | |-{corosync}(2421)
        | |-{corosync}(2461)
        | |-{corosync}(2481)
        | |-{corosync}(2501)
        | |-{corosync}(2520)
        | |-{corosync}(2530)
        | |-{corosync}(2539)
        | |-{corosync}(2542)
        | |-{corosync}(2551)
        | |-{corosync}(2554)
        | |-{corosync}(2565)
        | |-{corosync}(2568)
        | `-{corosync}(2593)
        |-cron(1431)
        |-ctdbd(3718)
        |-ctdbd(3719)
        |-ctdbd(3720)
        |-ctdbd(3721)
        |-ctdbd(3722)
        |-ctdbd(4160)-+-ctdbd(4161)---sh(4165)---pstree(4167)
        | `-sh(4166)---pstree(4168)
        |-dbus-daemon(1422)
        |-dlm_controld(2381)-+-{dlm_controld}(2384)
        | `-{dlm_controld}(2385)
        |-dnsmasq(1561)
        |-fenced(2356)-+-{fenced}(2358)
        | `-{fenced}(2359)
        |-getty(1400)
        |-getty(1406)
        |-getty(1410)
        |-getty(1412)
        |-getty(1418)
        |-getty(2647)
        |-gfs_controld(2424)-+-{gfs_controld}(2427)
        | `-{gfs_controld}(2431)
        |-iscsid(848)
        |-iscsid(849)
        |-libvirtd(1480)-+-{libvirtd}(1486)
        | |-{libvirtd}(1487)
        | |-{libvirtd}(1488)
        | |-{libvirtd}(1489)
        | |-{libvirtd}(1490)
        | `-{libvirtd}(1491)
        |-master(1711)-+-pickup(2175)
        | `-qmgr(2176)
        |-multipathd(1599)-+-{multipathd}(1600)
        | |-{multipathd}(1603)
        | |-{multipathd}(1604)
        | |-{multipathd}(1605)
        | |-{multipathd}(1606)
        | `-{multipathd}(1607)
        |-nmbd(1345)
        |-ntpd(2297)
        |-portmap(1252)
        |-rgmanager(2589)---rgmanager(2590)-+-{rgmanager}(2591)
        | |-{rgmanager}(3051)
        | `-{rgmanager}(3052)
        |-rpc.mountd(1633)
        |-rpc.statd(1329)
        |-rsyslogd(1259)-+-{rsyslogd}(1287)
        | |-{rsyslogd}(1288)
        | `-{rsyslogd}(4094)
        |-slapd(1562)-+-{slapd}(1576)
        | `-{slapd}(1577)
        |-slpd(1716)
        |-smbd(1251)---smbd(1540)
        |-sshd(1785)---bash(2274)
        |-sshd(2150)---sshd(4092)---ctdb(4152)---sleep(4164)
        |-udevd(274)-+-udevd(1900)
        | `-udevd(2526)
        `-upstart-udev-br(21: P1: POSIX ADVISORY WRITE 2424 00:11:8612 0 EOF
2: POSIX ADVISORY WRITE 2381 00:11:8360 0 EOF
3: POSIX ADVISORY WRITE 2356 00:11:8234 0 EOF
4: FLOCK ADVISORY WRITE 1711 08:03:25362730 0 EOF
5: FLOCK ADVISORY WRITE 1711 08:03:269114 0 EOF
6: POSIX ADVISORY WRITE 1599 00:11:5787 0 EOF
7: POSIX ADVISORY WRITE 1562 08:03:8770362 1024 2047
8: POSIX ADVISORY READ 1540 00:11:4870 4 4
9: FLOCK ADVISORY WRITE 1407 00:11:5208 0 EOF
10: POSIX ADVISORY READ 1345 00:11:4945 4 4
11: POSIX ADVISORY READ 1251 00:11:4940 4 4
12: POSIX ADVISORY READ 1251 00:11:4939 4 4
13: POSIX ADVISORY READ 1251 00:11:4913 4 4
14: POSIX ADVISORY READ 1251 00:11:4937 4 4
15: POSIX ADVISORY READ 1251 00:11:4870 4 4
16: POSIX ADVISORY WRITE 1251 00:11:4932 0 0
17: POSIX ADVISORY READ 1345 00:11:4913 4 4
18: POSIX ADVISORY READ 1345 00:11:4870 4 4
19: POSIX ADVISORY WRITE 1345 00:11:4911 0 0
20: POSIX ADVISORY WRITE 1252 00:11:4738 0 EOF
21: POSIX ADVISORY WRITE 849 00:11:3730 0 EOF

Revision history for this message
ITec (itec) said :
#3

Test comment for the launchpad problem!

Revision history for this message
ITec (itec) said :
#4

Hi all!

The problem is still there.
Can anyone help?

Best regards
Christian

Revision history for this message
Launchpad Janitor (janitor) said :
#5

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
ITec (itec) said :
#6

Hi there!

Since there is no reasonable explanation, I consider this as a bug.
Look: https://bugs.launchpad.net/ubuntu/+source/ctdb/+bug/558391

Best regards
Christian

Revision history for this message
Launchpad Janitor (janitor) said :
#7

This question was expired because it remained in the 'Open' state without activity for the last 15 days.