Saturday, 22 March 2014

What if Ansible Run Hangs?

Running  Ansible against 1000s of nodes, not fully aware of some of the node status before the run, some were heavily loaded, and busy, some were down. such highly loaded of  OOM nodes or even some of the play-book tasks are prone to wait, and blockage, all of these conditions will cause Ansible to hang. below are some of the steps that I followed or were collected from Ansible mailing list* to help debug such a hang:

Is it the initial connection?

use -vvvv to trouble shoot the connection

What you call hung could be normal unless not intended:

from Ansible playbooks async :

By default tasks in play-books block, meaning the connections stay open until the task is done on each node. This may not always be desirable, or you may be running operations that take longer than the SSH timeout.

Is it the remote executed task ?

  •  Run ansible-playbook with ANSIBLE_KEEP_REMOTE_FILES=1
  • create a python tracefile

$ python -m trace --trace 
/home/jtanner/.ansible/tmp/ansible-1387469069.32-4132751518012/command 
2>&1 | head 
  --- modulename: command, funcname: <module> 
command(21): import sys 
command(22): import datetime 
command(23): import traceback 
command(24): import re 
command(25): import shlex 
  --- modulename: shlex, funcname: <module> 
shlex.py(2): """A lexical analyzer class for simple shell-like syntaxes.""" 
shlex.py(10): import os.path 
shlex.py(11): import sys 



Possible causes of hangs :

  • stale shared file system in the remote targeted node
  • if it is a yum related task, and another yum process is running already in targeted node
  • Module dependency such as requirement to add the host in advance to known_hosts or forwarding SSH credentials.
  • some issues with sudo, where the ssh user and the sudo user are the same but sudo_user is not specified.
  • some command module tasks are expecting input from stdin
  • setup module could hang due to hardware or os related issue, updated firmware, drivers could help
  • network, or firewall related, or change of network/firewall/load balancing caused by Ansible run
  • could it be a lookup issue (e.g DNS,  or user look up)


* Thanks to Michael Dehaan and Ansible developers for a an awesome code,  and thanks to James Tanner for his help and pointers in the Ansible users mailing list, and IRC.

* This was written at the time of Ansible 1.4.2 in RHEL/CENTOS based environment,  ssh connections could even be further improved by enabling ControlPersist nor pipelining mode 

Thursday, 30 January 2014

DFIR Dec. 2013 Memory Forensics Challenge notes :

This is my first memory forensics outside of SANS 508 SIFT workstation investigating Timothy Dungan
workstation "Stark Research Labs Intrusion case by Hydra" . So even though I believe that I have answered the questions that were asked in the SANS DFIR blog , there are lots still to learn and more skills to sharpen.  Using lots of  curiosity, volatility, redline, and SIFT workstation it is easy to run a memory investigation especially if one is quipped by SANS508 course material and volatility IRC channel.  Below are my scattered notes from three separate sessions, the overall time it took is over 7-8 hours, it could have been done in one session with more focus and less distraction form the kids.

[ note to oneself : collect reports and screenshots more next time, and write report as you go along ]

Using Mandiant Redline:

Used Redline white listing to filter out a large amount of data that is not likely to be interesting: data that corresponds to unaltered, known-good software components, however, I was not successful at finding red flags "rouge processes" straight away, There were  three suspicious processes i was targeting ,  however could not find the obvious anomaly malware introduced to systems,  so started looking for other low hanging fruits/signals that could give me a good pivot point to start using also the low frequency of occurrence technique and focusing on the DFIR challenge questions asked to keep me focused.

·        Suspicious untrusted  handle pork_bun associated with the explorer.exe process (pid:1672)




 Possible Gobal root kit cloaking activity via  System  Service Descriptor hook:
 The hooking module name looks suspicious irykmmww.sys hooked to ntoskrnl.exe with NtEnumerateKey  , and NtEnumerateValueKey , as well as NtQueryDirectoryFile which are used to hide things:
o   NtEnumerateValueKey : : Allows an application to identify and interact with registry values.   Malware use this insert itself between any registry value request and filter out what value it wants to hide.
o   NtEnumerateKey  : Allows an application to identify and interact with registry Keys.   Malware use this insert itself between any registry key request and filter out any registry keys it may want to hide its value.
o   NtQueryDirectoryFile : Gives the application the ability to perform a directory listing. By hooking this function a malware can hide directories or files from normal file managers as well as anti-malware tools
o   NtDeviceIoControlFile, the API Windows uses to do network related stuff and has been widely mentioned in malware behavior analysis papers. Malware can use it to replay network traffic, how cool is that?!


Not to mention my company campus ISP blocks me from doing some more research ;-)




Not that it cannot be overridden with any vpn connection.

Tried to acquire the driver for further analysis, however Redline couldn’t dump it, you will see later i was able to dump it with volatility which proves why you need to know more than one tool, as most likely one tool will not be fit for all situations and always tools will fail you most when you need them. 

Using Volatility to cross check and dig deeper:

Treating it as a real case, preserving the initial image as read only image and its hash value:

                 $ sudo chattr +i dfir-challenge/APT.img

To start processing we need to know more about the image file profile, so we run imaginfo

sansforensics@SIFT-Workstation:/cases/dfir-challenge$ vol32.py -f ./APT.img imageinfo
Volatile Systems Volatility Framework 2.1_alpha
Determining profile based on KDBG search...

Suggested Profile(s) : WinXPSP3x86, WinXPSP2x86
AS Layer1 : JKIA32PagedMemoryPae (Kernel AS)
AS Layer2 : FileAddressSpace (/cases/dfir-challenge/APT.img)
PAE type : PAE
DTB : 0x319000
KDBG : 0x80545b60L
KPCR : 0xffdff000L
KUSER_SHARED_DATA : 0xffdf0000L
Image date and time : 2009-05-05 19:28:57
Image local date and time : 2009-05-05 19:28:57
Number of Processors : 1
Image Type : Service Pack 3

PROFILE : WinXPSP3x86

The normal process scan for the processes that are not supposedly hidden by unlinking the double linked list process structure.


Cross examining the processes seen normally via the doubly linked list vs. the ones scrapped from memory structures:


Scanning for network artifacts, since this is assumed to be an APT "advanced persistent threat" case, one good lead would if the box was infected at some time malware will have to connect with Covert Command-and-control (C2channels, or if this was not the one with the originally  infected malware, data exfiltration activity should leave some bread crumbs for us to trace.



  interestingly enough from the connection scan above we see port 443 which is usually firewall friendly port appears to be either inactive or stealth.  However it is from the same process to the same IP, the process is explorer.exe (1672). trying to find where is that ip using whois for the ip 222.128.1.2, as seen below we find out that the ip belongs to our friends in China state owned ISP in Beijing

Usually malware will set a mutant so that it does not cause issues again to the system or itself by trying to install or over configure itself,  that is done by checking if a certain mutant exists. one interesting mutant I have seen In both redline and volatility was: The pork_bun mutant


Now that I am quite confident that expolere.exe pid:1672 is the rouge process. Finding which process file have the malware  in case it was injected or hollowed is quite tedious task, however double cheking least frequent strange named unsigned handles starting with the executable DLLs , as well as SDT hooks,




both dll search, and  ssdt hooks via volatility arrived at the same conclusion as Redline, and this time I was able to dump the driver irykmmww.sys and confirmed its rouge using virustotal




Most of the virustotal findings point to a generic trojan/backdoor root kit installed using an exploit not spread like a virus, via social engineering, probably phishing as is the norm with APT, however i am not able to tell with the existing research so far.




virustotall also confirmed that an alternative of the notorious Poison Ivy Trojan was used, which famously was used to attack RSA's SecurID infrastructure in 2011, going strong after eight years and is being used in targeted attacks.

Other findings that the malware logs its findings or activity to :

C:\DOCUME~1\demo\LOCALS~1\Temp\irykmmww.log

So doing filescan and saving it to file for further analysis I can see a suspicious other files explorer file or two, for example 


'\\WINDOWS\\system32\\exploder.exe' does not make sense to be running under system32?!



and with that i have the 5 DFIR questions answered almost, the process was 1672 explorer.exe, thirykmmww.sys is what is hiding the malware artifacts from the system, and persistence  most likely achieved with dll injection  via the irykmmww.dll.

there is more for me to follow up, and research, and more notes that I should have collected real time and post. hopefully next investigation would prove more conclusive and complete, and I would be then more familiar with windows internals.

final note: SANS recommends highly that "Intrusion/Incident  reports" not  to state personal opinions and present facts only, however for my learning process I have put some of my opinions, and hopefully will validate them soon if SANS DFIR publish their  solution.  

Saturday, 7 December 2013

Dynamic Test/Evaluation Environment 

 Vagrant, Ganeti , Openstack are great tools for a dynamic data-driven test environment. couple them with a configuration management CFEngine3, Chef, puppet, , Ansible, or Saltstack and you will start having more time on your hand, and appreciating life around you. The possibilities are endless if you are looking for a backend highly available infrastructure Ganeti is your solution, used already by "Open Source Labs", Google, Mozilla, Greek Research and Technology Network, among others to manage cluster of virtual environments with resilience in mind. if you are looking for flexibility and providing your users with a private cloud solution Openstack will do. for testing new administration tools, policies, cookbooks, manifests, play books and blue prints than Vagrant is the way to go add the combination of these three together and you have dynamic solutions that scale in your own laptop or workstation from few virtual nodes to Amazon EC2, or your own company private cluster environment. 

Devops afternoon in Khobar- Saudi Arabia


Devops, and web operations did not pick up in the middle-east as it did in US, Europe, China, and India. We had a chance to present at the HPC Saudi 2013 user group conference that was coordinated by our technology planning engineer Khalid Chatilla, and Intel/IDC. we decided to check with CFEngine, PuppetLabs, Ansibleworks, and Opscode if they can participate, and they showed interest even though it is already end of year, and budgets already consumed, not to mention the short notice , logistics and planning that needs to take in action to secure their coming to Saudi Arabia. at the end Ansibleworks, and Puppetlabs managed to come and delivered an awesome afternoon, my colleague and friend Ahmed bu Khamessin with his limited graphical resources was able to capture some of these moments by his video camera and even though the sound quality is not great, he made it public to the world.  you can see my intro slides, and Ahmed videos below

Prezi Introduction to Saudi Devops Days  with use cases from CFEngine, and Chef.

Ansible presentation :



 Puppetlabs presentation in youtube


Wednesday, 19 June 2013

Software packages and repositories

Software packages and repositories is my first stop in automating the OS life cycle, the OS image including all software stacks, os, middleware, management, and application should represent a fixed state. that would difficult to track if installs were done ad-hoc outside of a packaging system. so for us we use mainly RHEL based distros. so you think the answer would be use yum, and rpms!!! well there are Java applications as jars. there are Ruby gems, there are python eggs, and there are git clones and tarballs. one answer is use fpm to convert from any format to rpm.
  • so one challenge is the diversity of packaging types and how to standardise on one.  
  • Second, comes the Internet isolation and state, at work we are not allowed downloads directly from the net. 
So for this second problem i need to have a way to mirror publicly accessed or Enterprise provided repos to internal repos. the easiest choice is to mirror every thing and copy it/rysnc it over to work periodically.

for Ruby Gems here is the simplest way to do it :

http://stackoverflow.com/questions/8411045/how-to-build-a-rubygems-mirror-server

$ gem install rubygems-mirror

Edit the YAML configuration file ~/.gem/.mirrorrc:

---
- from: http://rubygems.org
  to: ~/.gem/mirror
the to: filed above can be better pointing to a usb storage, where ever it points at 
$ mkdir ~/.gem/mirror
Start mirroring:
$ gem mirror
Once mirroring finishes edit ~/.gem/mirror/config.ru:
require "rubygems"
require "geminabox"

Geminabox.data = "./"
run Geminabox
Install Gem in a box:
$ gem install geminabox
Start gem server:
$ cd ~/.gem/mirror
$ rackup
Edit your application's Gemfile to use your gem server:
source "http://your.servers.ip:9292"

Tuesday, 21 May 2013

Virtualbox guest host NATed


After installing CentOS6.4 as guest OS in Windows 8.0 and configuring the single network interface using NAT mode, I could not from first instance ssh using putty to the guest OS DHCP ip address given as 10.0.2.15.

I had to power off the Guest and enable port forwarding first as described in NATFORWARD section under NAT networking mode on chapter 6 of the users manual.

https://www.virtualbox.org/manual/ch06.html#natforward

Below are the commands i used to configure and check port forwarding

 .\VBoxManage  listvms
 .\VBoxManage modifyvm "CentOS01"  --natpf1 "guestssh,tcp,,2222,,22"
  .\VBoxManage.exe showvminfo CentOS01 |findstr "2222"
NIC 1 Rule(0):   name = guestssh, protocol = tcp, host ip = , host port = 2222, guest ip = , guest port = 22
in putty host = localhost and port will be in this case 2222

the above was done to test the ORD OpenStack Red Hat distribution, i had several failures before i was able to install it using the quickstart successfuly. first due the disk size, the disk size should be over 22 Gbyte so that Cinder can create 20Gbyte disk by default, second the selinux needs to be enabled. and every time it fails you need to remove cinder packages and logical volume manually before restarting the installation and cleaning up the bits and pieces from old installation.

a succesfful install should not take over 20-30 mins.


Monday, 28 January 2013

How many administrators do you need for your operations?


Several online resources are discussing this issue. it usually depends on several factors such as :

I- Factors that could reduce the number of admins needed:
  • Remote console/power and remote management tool availability
  • Vitalisation
  • Physical server, and rack technology (e.g. blades or skinless vs. U2 servers)
  • Availability of management tools  (rack management, api such as in EC2, and cloud providers)
  • Platform ( e.g. Unix, and Unix like vs. Windows )
  • Configuration management and automation tools.
  • Initial plan, and vision of business/data centre expansion
  • Organisation requirements, maturity, stability, and adoption of the devops culture.
II- Factors that could increase the number of admins needed
  • Size and diversity of data managed.
  • Number, and diversity of servers and server configurations
  • Number of users
  • Number and diversity of applications used and  supported*. 
  • Number of new technologies at the ground or acquired within  the data centre
  • Complexity of the solution and infrastructure.
* used by the administration  team, and supported on behalf of others within or external to the organisation.
So what is the best practice metric that should be used? it depends on what kind of operations the business is running and how messy, or diverse it is customer or application space, as well as management approach towards operations and support from the start. not to mention what we mean by best practise is the best try, best deployment plan, however as soon as it materialised it can be improved so it becomes good practise or  could be even bad if it did not evolve

CERN did not use virtualization to help deploy and run  their HPC codes, however have chosen to adopt virtualization to ease out administration and management costs

FaceBook  230 engineers supporting data for over than more three million users, at around 130 servers per admin  [1]

Microsoft automated data center operations at around 1000-2000 servers per admin, while its new container data center will be around 10,000 server/DC employee. 

IDC reports in large dominant  providers such as Google, it could be 10,000 servers/admin while in small to medium businesses it could be 30:1 for physical boxes and 80:1 for virtual machines. [2]

Gartner analyst, Errol Rasit says “We have observed that it can be, for example with a physical server, as low as 10 per admin, and for virtual servers as many as 500,”


resources :

[1] Data Center Knowledge Article "how many servers can one admin-manage"

[2] Computer World Australia IDC reference