Fuel for OpenStack

deployment stuck at 100% all nodes ready

Asked by Baboune on 2014-12-03

hello,

We are running two parallels environment using Mirantis 5.1 plus some patches.

3 nodes were added to the second environment(9) and now the deployment is stuck at 100%. It has been like this for 5 days.

All nodes report being ready in the nailgun DB:
nailgun=# SELECT id, status,error_type from nodes;
id | status | error_type ----+--------+------------
50 | ready |
54 | ready |
68 | ready |
69 | ready |
70 | ready |
52 | ready |
73 | ready |
51 | ready |
71 | ready |
72 | ready |
48 | ready

No error in the logs.

$ fuel env
id | status | name | mode | release_id | changes | pending_release_id
---|-------------|------------------------|-----------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------
6 | operational | kds-cmc-openstack-2410 | multinode | 3 | [] | None
9 | deployment | kds-cmc-openstack-2511 | multinode | 3 | [{u'node_id': 71, u'name': u'interfaces'}, {u'node_id': 71, u'name': u'disks'}, {u'node_id': 72, u'name': u'interfaces'}, {u'node_id': 72, u'name': u'disks'}, {u'node_id': 73, u'name': u'interfaces'}, {u'node_id': 73, u'name': u'disks'}, {u'node_id': 74, u'name': u'interfaces'}, {u'node_id': 74, u'name': u'disks'}] | None

Tried:
dockerctl shell postgres
su - postgres
psql nailgun
update nodes set status = 'ready', error_type = NULL where id = <NODE_ID>

Tried:
dockerctl stop nginx
dockerctl start nginx

Can not upload snapshot the operation times out and the tar ball is never created.

We use icehouse, centos, neutron vlan, cinder on ceph, glance default. The openstack cluster is operational, ie the nodes report as compute nodes, and all services appear to run.

What can I do to move the state of the environment to "Operational" ?

Question information

Language:: English Edit question

Status:: Answered

For:: Fuel for OpenStack Edit question

Assignee:: No assignee Edit question

Last query:: 2014-12-06

Last reply:: 2014-12-13

Link existing bug

Revision history for this message

Fabrizio Soppelsa (fsoppelsa) said on 2014-12-04:

Greetings Baboune,

you can try to force Fuel to stop the activity through `fuel task`.
Get the tasks with `fuel task list` and try to stop the stuck one with `fuel task delete -f -t <TASK_ID>`

This could be due anything, but potentially related to some wrong network configuration. Make sure that everythig is set appropriate in the Fuel UI under "Configure Interfaces", and that Verify Network exits with success.

Best,
Fabrizio /kaliya

Revision history for this message

Baboune (seyvet) said on 2014-12-04:

Would deleting a task also roll back all the changes in the environment?

"Make sure that everythig is set appropriate in the Fuel UI under "Configure Interfaces", and that Verify Network exits with success."
I will verify this again.

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

Following tasks were stuck:

fuel task list
id | status | name | cluster | progress | uuid
----|---------|------------|---------|----------|-------------------------------------
166 | error | dump | None | 100 | 6f5383ae-78b7-4e9a-bbf4-1790617966d4
178 | ready | provision | 9 | 100 | 22a56349-0fb6-4df8-84e2-ac6e55b396fe
179 | running | deployment | 9 | 100 | 32c12350-7336-4a67-b526-7ec2cfe92b21
175 | running | deploy | 9 | 100 | e6c5eeb8-ce0c-4684-bed5-46cf720f5a8a

Did:
# fuel task delete -f --task 166
Tasks with id's 166 deleted.
# fuel task delete -f --task 179
Tasks with id's 179 deleted.
# fuel task delete -f --task 175
Tasks with id's 175 deleted.

Then 170 disapeared. Environment is now at "deployment" status.

Launched "verify networks".

# fuel task list
id | status | name | cluster | progress | uuid
----|---------|-----------------|---------|----------|-------------------------------------
181 | running | check_dhcp | 9 | 0 | a702dc57-0b2f-4187-89d5-b86809e2c839
180 | running | verify_networks | 9 | 0 | 06458edb-503b-47f1-a92b-5098d2a1b90e

Stuck again.

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

Check container status.. It seems astute is down.

Can not restart it

[root@kds-cmc-fuel-02 node-74.rnd.ki.sw.ericsson.se]# dockerctl start astute
fuel-core-5.1-astute is already running.
checking container astute
checking with command "shell_container astute ps aux | grep -q 'astuted'"
try number 1
return code is 1
try number 2
return code is 1

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

Additional info: There are no logs coming from astute.

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

would it be safe to "revert" the astute container?

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

OK.. I rolled back https://bugs.launchpad.net/fuel/+bug/1387699. The astute container restarted...

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

Status in UI: Deployment
Astuste is now running, I can see logs.

But:
- Verify Networks is not completing.
- Environment is stuck on "deployment"
- Fix for https://bugs.launchpad.net/fuel/+bug/1387699 stopped astute.

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

And if I try to launch the "deploy Changes" then I get:
2014-12-02 16:17:32 ERROR
[7faa720e4740] (logger) Response code '500 Internal Server Error' for PUT /api/v1/clusters/9/changes from 172.17.42.1:41435
2014-12-02 16:17:32 ERROR
[7faa720e4740] (logger) Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/web/application.py", line 239, in process
    return self.handle()
  File "/usr/lib/python2.6/site-packages/web/application.py", line 230, in handle
    return self._delegate(fn, self.fvars, args)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 420, in _delegate
    return handle_class(cls)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 396, in handle_class
    return tocall(*args)
  File "<string>", line 2, in PUT
  File "/usr/lib/python2.6/site-packages/nailgun/api/v1/handlers/base.py", line 93, in content_json
    data = func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/nailgun/api/v1/handlers/base.py", line 364, in PUT
    task = task_manager.execute()
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 132, in execute
    self._remove_obsolete_tasks()
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 104, in _remove_obsolete_tasks
    raise errors.DeploymentAlreadyStarted()
DeploymentAlreadyStarted: Deployment already started

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

#10

# fuel task list
id | status | name | cluster | progress | uuid
----|--------|-----------------|---------|----------|-------------------------------------
186 | error | verify_networks | 9 | 100 | 2fdc9dc7-b1c5-407a-a151-fb53554b19ff
187 | error | check_dhcp | 9 | 100 | 04d11ad6-d970-4d74-a93c-724d92ea7d06

Revision history for this message

Fabrizio Soppelsa (fsoppelsa) said on 2014-12-05:

#11

Are all your containers working fine?

dockerctl list -l

Maybe something went wrong with them, and they're corrupted. Please see http://docs.mirantis.com/openstack/fuel/master/operations.html#docker-disk-full-top-tshoot

Fabrizio.

Revision history for this message

Baboune (seyvet) said on 2014-12-05:

#12

After rolling back https://bugs.launchpad.net/fuel/+bug/1387699, all the containers appear to be working.

# dockerctl list -l
Name Image Status Full container name
nginx fuel/nginx_5.1 Running fuel-core-5.1-nginx
rabbitmq fuel/rabbitmq_5.1 Running fuel-core-5.1-rabbitmq
astute fuel/astute_5.1 Running fuel-core-5.1-astute
rsync fuel/rsync_5.1 Running fuel-core-5.1-rsync
keystone fuel/keystone_5.1 Running fuel-core-5.1-keystone
postgres fuel/postgres_5.1 Running fuel-core-5.1-postgres
rsyslog fuel/rsyslog_5.1 Running fuel-core-5.1-rsyslog
nailgun fuel/nailgun_5.1 Running fuel-core-5.1-nailgun
cobbler fuel/cobbler_5.1 Running fuel-core-5.1-cobbler
ostf fuel/ostf_5.1 Running fuel-core-5.1-ostf
mcollective fuel/mcollective_5.1 Running fuel-core-5.1-mcollective

But no task seems to complete. And the environment is stuck in "Deployment".

Revision history for this message

Fabrizio Soppelsa (fsoppelsa) said on 2014-12-05:

#13

So, apparently now seems that the containers are working at least.

Do you still have tasks in progress?
Do you mean that your environment is stuck in "Deployment" in the UI?
Did you try to reset the environment and readd nodes and then re-Deploy? Please elaborate your procedure.

Sorry for having been intermittent in IRC lately.

Fabrizio /kaliya

Revision history for this message

Baboune (seyvet) said on 2014-12-06:

#14

IRC: I appreciate the help.

"Do you have still tasks in progress?"
Yes. I can delete. Any operation (like "Verify Network") that is then launched reaches 100% and then the tasks remain. In otther words, it seems a 100% task is never removed from the task list.

"Do you mean that your environment is stuck in "Deployment" in the UI?"
Yes. Also in the UI, the "Deploy" button is available. If i click it it generates a 500 see comment #9.

Additionally in the CLI:
# fuel env
id | status | name | mode | release_id | changes | pending_release_id
---|-------------|------------------------|-----------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------
6 | operational | kds-cmc-openstack-2410 | multinode | 3 | [] | None
9 | deployment | kds-cmc-openstack-2511 | multinode | 3 | [{u'node_id': 71, u'name': u'interfaces'}, {u'node_id': 71, u'name': u'disks'}, {u'node_id': 72, u'name': u'interfaces'}, {u'node_id': 72, u'name': u'disks'}, {u'node_id': 73, u'name': u'interfaces'}, {u'node_id': 73, u'name': u'disks'}, {u'node_id': 74, u'name': u'interfaces'}, {u'node_id': 74, u'name': u'disks'}] | None

"Did you try to reset the environment and readd nodes and then re-Deploy? Please elaborate your procedure."
No. It does seem to be my next move.
I have different VMs and services running on the cluster. And resetting everything is not an option.
OpenStack is behaving fine. All OpenStack health tests seem to pass, and we are not experiencing any problems within the openstack machines.

One additional comment: There are 2 environments managed by a single Fuel machine. Only one of the two is stuck in "Deployment". Both experience the tasks stuck problem.

Hope this helps.

IRC: I appreciate the help.

"Do you have still tasks in progress?"
Yes.  I can delete.  Any operation (like "Verify Network") that is then launched reaches 100% and then the tasks remain.  In otther words, it seems a 100% task is never removed from the task list.

"Do you mean that your environment is stuck in "Deployment" in the UI?"
Yes.  Also in the UI, the "Deploy" button is available.  If i click it it generates a 500 see comment #9.

"Did you try to reset the environment and readd nodes and then re-Deploy? Please elaborate your procedure."
No. It does seem to be my next move.
I have different VMs and services running on the cluster.  And resetting everything is not an option.  
OpenStack is behaving fine.  All OpenStack health tests seem to pass, and we are not experiencing any problems within the openstack machines.

One additional comment: There are 2 environments managed by a single Fuel machine.  Only  one of the two is stuck in "Deployment".  Both experience the tasks stuck problem.

Hope this helps.

Revision history for this message

Fabrizio Soppelsa (fsoppelsa) said on 2014-12-13:

#15

Greetings Baboune.

Sorry for having kept on hold.
If you don't have fresh reinstalled yet, please share a diagnostic snapshot including the env that gets stuck.

Best regards,
Fabrizio

Revision history for this message

Baboune (seyvet) said on 2014-12-13:

#16

As Said in #1, "Can not upload snapshot the snapshot operation times out and the tar ball is never created.".

My environment is reset. Uppradad to 5.1.1.

Ponderng investing more time into fuel by reading more of the code, asking for a license and/or moving to an in house solution.

Can you help with this problem?

Provide an answer of your own, or ask Baboune for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

Fuel for OpenStack

deployment stuck at 100% all nodes ready

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers