Saturday, 22 March 2014

What if Ansible Run Hangs?

Running  Ansible against 1000s of nodes, not fully aware of some of the node status before the run, some were heavily loaded, and busy, some were down. such highly loaded of  OOM nodes or even some of the play-book tasks are prone to wait, and blockage, all of these conditions will cause Ansible to hang. below are some of the steps that I followed or were collected from Ansible mailing list* to help debug such a hang:

Is it the initial connection?

use -vvvv to trouble shoot the connection

What you call hung could be normal unless not intended:

from Ansible playbooks async :

By default tasks in play-books block, meaning the connections stay open until the task is done on each node. This may not always be desirable, or you may be running operations that take longer than the SSH timeout.

Is it the remote executed task ?

  •  Run ansible-playbook with ANSIBLE_KEEP_REMOTE_FILES=1
  • create a python tracefile

$ python -m trace --trace 
2>&1 | head 
  --- modulename: command, funcname: <module> 
command(21): import sys 
command(22): import datetime 
command(23): import traceback 
command(24): import re 
command(25): import shlex 
  --- modulename: shlex, funcname: <module> """A lexical analyzer class for simple shell-like syntaxes.""" import os.path import sys 

Possible causes of hangs :

  • stale shared file system in the remote targeted node
  • if it is a yum related task, and another yum process is running already in targeted node
  • Module dependency such as requirement to add the host in advance to known_hosts or forwarding SSH credentials.
  • some issues with sudo, where the ssh user and the sudo user are the same but sudo_user is not specified.
  • some command module tasks are expecting input from stdin
  • setup module could hang due to hardware or os related issue, updated firmware, drivers could help
  • network, or firewall related, or change of network/firewall/load balancing caused by Ansible run
  • could it be a lookup issue (e.g DNS,  or user look up)

* Thanks to Michael Dehaan and Ansible developers for a an awesome code,  and thanks to James Tanner for his help and pointers in the Ansible users mailing list, and IRC.

* This was written at the time of Ansible 1.4.2 in RHEL/CENTOS based environment,  ssh connections could even be further improved by enabling ControlPersist nor pipelining mode 

Thursday, 30 January 2014

DFIR Dec. 2013 Memory Forensics Challenge notes :

This is my first memory forensics outside of SANS 508 SIFT workstation investigating Timothy Dungan
workstation "Stark Research Labs Intrusion case by Hydra" . So even though I believe that I have answered the questions that were asked in the SANS DFIR blog , there are lots still to learn and more skills to sharpen.  Using lots of  curiosity, volatility, redline, and SIFT workstation it is easy to run a memory investigation especially if one is quipped by SANS508 course material and volatility IRC channel.  Below are my scattered notes from three separate sessions, the overall time it took is over 7-8 hours, it could have been done in one session with more focus and less distraction form the kids.

[ note to oneself : collect reports and screenshots more next time, and write report as you go along ]

Using Mandiant Redline:

Used Redline white listing to filter out a large amount of data that is not likely to be interesting: data that corresponds to unaltered, known-good software components, however, I was not successful at finding red flags "rouge processes" straight away, There were  three suspicious processes i was targeting ,  however could not find the obvious anomaly malware introduced to systems,  so started looking for other low hanging fruits/signals that could give me a good pivot point to start using also the low frequency of occurrence technique and focusing on the DFIR challenge questions asked to keep me focused.

·        Suspicious untrusted  handle pork_bun associated with the explorer.exe process (pid:1672)

 Possible Gobal root kit cloaking activity via  System  Service Descriptor hook:
 The hooking module name looks suspicious irykmmww.sys hooked to ntoskrnl.exe with NtEnumerateKey  , and NtEnumerateValueKey , as well as NtQueryDirectoryFile which are used to hide things:
o   NtEnumerateValueKey : : Allows an application to identify and interact with registry values.   Malware use this insert itself between any registry value request and filter out what value it wants to hide.
o   NtEnumerateKey  : Allows an application to identify and interact with registry Keys.   Malware use this insert itself between any registry key request and filter out any registry keys it may want to hide its value.
o   NtQueryDirectoryFile : Gives the application the ability to perform a directory listing. By hooking this function a malware can hide directories or files from normal file managers as well as anti-malware tools
o   NtDeviceIoControlFile, the API Windows uses to do network related stuff and has been widely mentioned in malware behavior analysis papers. Malware can use it to replay network traffic, how cool is that?!

Not to mention my company campus ISP blocks me from doing some more research ;-)

Not that it cannot be overridden with any vpn connection.

Tried to acquire the driver for further analysis, however Redline couldn’t dump it, you will see later i was able to dump it with volatility which proves why you need to know more than one tool, as most likely one tool will not be fit for all situations and always tools will fail you most when you need them. 

Using Volatility to cross check and dig deeper:

Treating it as a real case, preserving the initial image as read only image and its hash value:

                 $ sudo chattr +i dfir-challenge/APT.img

To start processing we need to know more about the image file profile, so we run imaginfo

sansforensics@SIFT-Workstation:/cases/dfir-challenge$ -f ./APT.img imageinfo
Volatile Systems Volatility Framework 2.1_alpha
Determining profile based on KDBG search...

Suggested Profile(s) : WinXPSP3x86, WinXPSP2x86
AS Layer1 : JKIA32PagedMemoryPae (Kernel AS)
AS Layer2 : FileAddressSpace (/cases/dfir-challenge/APT.img)
PAE type : PAE
DTB : 0x319000
KDBG : 0x80545b60L
KPCR : 0xffdff000L
Image date and time : 2009-05-05 19:28:57
Image local date and time : 2009-05-05 19:28:57
Number of Processors : 1
Image Type : Service Pack 3


The normal process scan for the processes that are not supposedly hidden by unlinking the double linked list process structure.

Cross examining the processes seen normally via the doubly linked list vs. the ones scrapped from memory structures:

Scanning for network artifacts, since this is assumed to be an APT "advanced persistent threat" case, one good lead would if the box was infected at some time malware will have to connect with Covert Command-and-control (C2channels, or if this was not the one with the originally  infected malware, data exfiltration activity should leave some bread crumbs for us to trace.

  interestingly enough from the connection scan above we see port 443 which is usually firewall friendly port appears to be either inactive or stealth.  However it is from the same process to the same IP, the process is explorer.exe (1672). trying to find where is that ip using whois for the ip, as seen below we find out that the ip belongs to our friends in China state owned ISP in Beijing

Usually malware will set a mutant so that it does not cause issues again to the system or itself by trying to install or over configure itself,  that is done by checking if a certain mutant exists. one interesting mutant I have seen In both redline and volatility was: The pork_bun mutant

Now that I am quite confident that expolere.exe pid:1672 is the rouge process. Finding which process file have the malware  in case it was injected or hollowed is quite tedious task, however double cheking least frequent strange named unsigned handles starting with the executable DLLs , as well as SDT hooks,

both dll search, and  ssdt hooks via volatility arrived at the same conclusion as Redline, and this time I was able to dump the driver irykmmww.sys and confirmed its rouge using virustotal

Most of the virustotal findings point to a generic trojan/backdoor root kit installed using an exploit not spread like a virus, via social engineering, probably phishing as is the norm with APT, however i am not able to tell with the existing research so far.

virustotall also confirmed that an alternative of the notorious Poison Ivy Trojan was used, which famously was used to attack RSA's SecurID infrastructure in 2011, going strong after eight years and is being used in targeted attacks.

Other findings that the malware logs its findings or activity to :


So doing filescan and saving it to file for further analysis I can see a suspicious other files explorer file or two, for example 

'\\WINDOWS\\system32\\exploder.exe' does not make sense to be running under system32?!

and with that i have the 5 DFIR questions answered almost, the process was 1672 explorer.exe, thirykmmww.sys is what is hiding the malware artifacts from the system, and persistence  most likely achieved with dll injection  via the irykmmww.dll.

there is more for me to follow up, and research, and more notes that I should have collected real time and post. hopefully next investigation would prove more conclusive and complete, and I would be then more familiar with windows internals.

final note: SANS recommends highly that "Intrusion/Incident  reports" not  to state personal opinions and present facts only, however for my learning process I have put some of my opinions, and hopefully will validate them soon if SANS DFIR publish their  solution.