deployment stuck at 100% all nodes ready

Asked by Baboune

hello,

We are running two parallels environment using Mirantis 5.1 plus some patches.

3 nodes were added to the second environment(9) and now the deployment is stuck at 100%. It has been like this for 5 days.

All nodes report being ready in the nailgun DB:
nailgun=# SELECT id, status,error_type from nodes;
 id | status | error_type ----+--------+------------
50 | ready |
54 | ready |
68 | ready |
69 | ready |
70 | ready |
52 | ready |
73 | ready |
51 | ready |
71 | ready |
72 | ready |
48 | ready

No error in the logs.

$ fuel env
id | status | name | mode | release_id | changes | pending_release_id
---|-------------|------------------------|-----------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------
6 | operational | kds-cmc-openstack-2410 | multinode | 3 | [] | None
9 | deployment | kds-cmc-openstack-2511 | multinode | 3 | [{u'node_id': 71, u'name': u'interfaces'}, {u'node_id': 71, u'name': u'disks'}, {u'node_id': 72, u'name': u'interfaces'}, {u'node_id': 72, u'name': u'disks'}, {u'node_id': 73, u'name': u'interfaces'}, {u'node_id': 73, u'name': u'disks'}, {u'node_id': 74, u'name': u'interfaces'}, {u'node_id': 74, u'name': u'disks'}] | None

Tried:
 dockerctl shell postgres
 su - postgres
 psql nailgun
 update nodes set status = 'ready', error_type = NULL where id = <NODE_ID>

Tried:
 dockerctl stop nginx
 dockerctl start nginx

Can not upload snapshot the operation times out and the tar ball is never created.

We use icehouse, centos, neutron vlan, cinder on ceph, glance default. The openstack cluster is operational, ie the nodes report as compute nodes, and all services appear to run.

What can I do to move the state of the environment to "Operational" ?

Question information

Language:
English Edit question
Status:
Answered
For:
Fuel for OpenStack Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Fabrizio Soppelsa (fsoppelsa) said :
#1

Greetings Baboune,

you can try to force Fuel to stop the activity through `fuel task`.
Get the tasks with `fuel task list` and try to stop the stuck one with `fuel task delete -f -t <TASK_ID>`

This could be due anything, but potentially related to some wrong network configuration. Make sure that everythig is set appropriate in the Fuel UI under "Configure Interfaces", and that Verify Network exits with success.

Best,
Fabrizio /kaliya

Revision history for this message
Baboune (seyvet) said :
#2

Would deleting a task also roll back all the changes in the environment?

"Make sure that everythig is set appropriate in the Fuel UI under "Configure Interfaces", and that Verify Network exits with success."
I will verify this again.

Revision history for this message
Baboune (seyvet) said :
#3

Following tasks were stuck:

 fuel task list
id | status | name | cluster | progress | uuid
----|---------|------------|---------|----------|-------------------------------------
166 | error | dump | None | 100 | 6f5383ae-78b7-4e9a-bbf4-1790617966d4
178 | ready | provision | 9 | 100 | 22a56349-0fb6-4df8-84e2-ac6e55b396fe
179 | running | deployment | 9 | 100 | 32c12350-7336-4a67-b526-7ec2cfe92b21
175 | running | deploy | 9 | 100 | e6c5eeb8-ce0c-4684-bed5-46cf720f5a8a

Did:
# fuel task delete -f --task 166
Tasks with id's 166 deleted.
# fuel task delete -f --task 179
Tasks with id's 179 deleted.
# fuel task delete -f --task 175
Tasks with id's 175 deleted.

Then 170 disapeared. Environment is now at "deployment" status.

Launched "verify networks".

# fuel task list
id | status | name | cluster | progress | uuid
----|---------|-----------------|---------|----------|-------------------------------------
181 | running | check_dhcp | 9 | 0 | a702dc57-0b2f-4187-89d5-b86809e2c839
180 | running | verify_networks | 9 | 0 | 06458edb-503b-47f1-a92b-5098d2a1b90e

Stuck again.

Revision history for this message
Baboune (seyvet) said :
#4

Check container status.. It seems astute is down.

Can not restart it

[root@kds-cmc-fuel-02 node-74.rnd.ki.sw.ericsson.se]# dockerctl start astute
fuel-core-5.1-astute is already running.
checking container astute
checking with command "shell_container astute ps aux | grep -q 'astuted'"
try number 1
return code is 1
try number 2
return code is 1

Revision history for this message
Baboune (seyvet) said :
#5

Additional info: There are no logs coming from astute.

Revision history for this message
Baboune (seyvet) said :
#6

would it be safe to "revert" the astute container?

Revision history for this message
Baboune (seyvet) said :
#7

OK.. I rolled back https://bugs.launchpad.net/fuel/+bug/1387699. The astute container restarted...

Revision history for this message
Baboune (seyvet) said :
#8

Status in UI: Deployment
Astuste is now running, I can see logs.

But:
- Verify Networks is not completing.
- Environment is stuck on "deployment"
- Fix for https://bugs.launchpad.net/fuel/+bug/1387699 stopped astute.

Revision history for this message
Baboune (seyvet) said :
#9

And if I try to launch the "deploy Changes" then I get:
2014-12-02 16:17:32 ERROR
[7faa720e4740] (logger) Response code '500 Internal Server Error' for PUT /api/v1/clusters/9/changes from 172.17.42.1:41435
2014-12-02 16:17:32 ERROR
[7faa720e4740] (logger) Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/web/application.py", line 239, in process
    return self.handle()
  File "/usr/lib/python2.6/site-packages/web/application.py", line 230, in handle
    return self._delegate(fn, self.fvars, args)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 420, in _delegate
    return handle_class(cls)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 396, in handle_class
    return tocall(*args)
  File "<string>", line 2, in PUT
  File "/usr/lib/python2.6/site-packages/nailgun/api/v1/handlers/base.py", line 93, in content_json
    data = func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/nailgun/api/v1/handlers/base.py", line 364, in PUT
    task = task_manager.execute()
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 132, in execute
    self._remove_obsolete_tasks()
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 104, in _remove_obsolete_tasks
    raise errors.DeploymentAlreadyStarted()
DeploymentAlreadyStarted: Deployment already started

Revision history for this message
Baboune (seyvet) said :
#10

# fuel task list
id | status | name | cluster | progress | uuid
----|--------|-----------------|---------|----------|-------------------------------------
186 | error | verify_networks | 9 | 100 | 2fdc9dc7-b1c5-407a-a151-fb53554b19ff
187 | error | check_dhcp | 9 | 100 | 04d11ad6-d970-4d74-a93c-724d92ea7d06

Revision history for this message
Fabrizio Soppelsa (fsoppelsa) said :
#11

Are all your containers working fine?

dockerctl list -l

Maybe something went wrong with them, and they're corrupted. Please see http://docs.mirantis.com/openstack/fuel/master/operations.html#docker-disk-full-top-tshoot

Fabrizio.

Revision history for this message
Baboune (seyvet) said :
#12

After rolling back https://bugs.launchpad.net/fuel/+bug/1387699, all the containers appear to be working.

# dockerctl list -l
Name Image Status Full container name
nginx fuel/nginx_5.1 Running fuel-core-5.1-nginx
rabbitmq fuel/rabbitmq_5.1 Running fuel-core-5.1-rabbitmq
astute fuel/astute_5.1 Running fuel-core-5.1-astute
rsync fuel/rsync_5.1 Running fuel-core-5.1-rsync
keystone fuel/keystone_5.1 Running fuel-core-5.1-keystone
postgres fuel/postgres_5.1 Running fuel-core-5.1-postgres
rsyslog fuel/rsyslog_5.1 Running fuel-core-5.1-rsyslog
nailgun fuel/nailgun_5.1 Running fuel-core-5.1-nailgun
cobbler fuel/cobbler_5.1 Running fuel-core-5.1-cobbler
ostf fuel/ostf_5.1 Running fuel-core-5.1-ostf
mcollective fuel/mcollective_5.1 Running fuel-core-5.1-mcollective

But no task seems to complete. And the environment is stuck in "Deployment".

Revision history for this message
Fabrizio Soppelsa (fsoppelsa) said :
#13

So, apparently now seems that the containers are working at least.

Do you still have tasks in progress?
Do you mean that your environment is stuck in "Deployment" in the UI?
Did you try to reset the environment and readd nodes and then re-Deploy? Please elaborate your procedure.

Sorry for having been intermittent in IRC lately.

Fabrizio /kaliya

Revision history for this message
Baboune (seyvet) said :
#14

IRC: I appreciate the help.

"Do you have still tasks in progress?"
Yes. I can delete. Any operation (like "Verify Network") that is then launched reaches 100% and then the tasks remain. In otther words, it seems a 100% task is never removed from the task list.

"Do you mean that your environment is stuck in "Deployment" in the UI?"
Yes. Also in the UI, the "Deploy" button is available. If i click it it generates a 500 see comment #9.

Additionally in the CLI:
# fuel env
id | status | name | mode | release_id | changes | pending_release_id
---|-------------|------------------------|-----------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------
6 | operational | kds-cmc-openstack-2410 | multinode | 3 | [] | None
9 | deployment | kds-cmc-openstack-2511 | multinode | 3 | [{u'node_id': 71, u'name': u'interfaces'}, {u'node_id': 71, u'name': u'disks'}, {u'node_id': 72, u'name': u'interfaces'}, {u'node_id': 72, u'name': u'disks'}, {u'node_id': 73, u'name': u'interfaces'}, {u'node_id': 73, u'name': u'disks'}, {u'node_id': 74, u'name': u'interfaces'}, {u'node_id': 74, u'name': u'disks'}] | None

"Did you try to reset the environment and readd nodes and then re-Deploy? Please elaborate your procedure."
No. It does seem to be my next move.
I have different VMs and services running on the cluster. And resetting everything is not an option.
OpenStack is behaving fine. All OpenStack health tests seem to pass, and we are not experiencing any problems within the openstack machines.

One additional comment: There are 2 environments managed by a single Fuel machine. Only one of the two is stuck in "Deployment". Both experience the tasks stuck problem.

Hope this helps.

Revision history for this message
Fabrizio Soppelsa (fsoppelsa) said :
#15

Greetings Baboune.

Sorry for having kept on hold.
If you don't have fresh reinstalled yet, please share a diagnostic snapshot including the env that gets stuck.

Best regards,
Fabrizio

Revision history for this message
Baboune (seyvet) said :
#16

Hi

As Said in #1, "Can not upload snapshot the snapshot operation times out and the tar ball is never created.".

My environment is reset. Uppradad to 5.1.1.

Ponderng investing more time into fuel by reading more of the code, asking for a license and/or moving to an in house solution.

Can you help with this problem?

Provide an answer of your own, or ask Baboune for more information if necessary.

To post a message you must log in.