Saturday, 22 March 2014

What if Ansible Run Hangs?

Running  Ansible against 1000s of nodes, not fully aware of some of the node status before the run, some were heavily loaded, and busy, some were down. such highly loaded of  OOM nodes or even some of the play-book tasks are prone to wait, and blockage, all of these conditions will cause Ansible to hang. below are some of the steps that I followed or were collected from Ansible mailing list* to help debug such a hang:

Is it the initial connection?

use -vvvv to trouble shoot the connection

What you call hung could be normal unless not intended:

from Ansible playbooks async :

By default tasks in play-books block, meaning the connections stay open until the task is done on each node. This may not always be desirable, or you may be running operations that take longer than the SSH timeout.

Is it the remote executed task ?

  •  Run ansible-playbook with ANSIBLE_KEEP_REMOTE_FILES=1
  • create a python tracefile

$ python -m trace --trace 
/home/jtanner/.ansible/tmp/ansible-1387469069.32-4132751518012/command 
2>&1 | head 
  --- modulename: command, funcname: <module> 
command(21): import sys 
command(22): import datetime 
command(23): import traceback 
command(24): import re 
command(25): import shlex 
  --- modulename: shlex, funcname: <module> 
shlex.py(2): """A lexical analyzer class for simple shell-like syntaxes.""" 
shlex.py(10): import os.path 
shlex.py(11): import sys 



Possible causes of hangs :

  • stale shared file system in the remote targeted node
  • if it is a yum related task, and another yum process is running already in targeted node
  • Module dependency such as requirement to add the host in advance to known_hosts or forwarding SSH credentials.
  • some issues with sudo, where the ssh user and the sudo user are the same but sudo_user is not specified.
  • some command module tasks are expecting input from stdin
  • setup module could hang due to hardware or os related issue, updated firmware, drivers could help
  • network, or firewall related, or change of network/firewall/load balancing caused by Ansible run
  • could it be a lookup issue (e.g DNS,  or user look up)


* Thanks to Michael Dehaan and Ansible developers for a an awesome code,  and thanks to James Tanner for his help and pointers in the Ansible users mailing list, and IRC.

* This was written at the time of Ansible 1.4.2 in RHEL/CENTOS based environment,  ssh connections could even be further improved by enabling ControlPersist nor pipelining mode