Failing VMs
This document describes the usual actions and tools used to drill down failing VMs issues, and find a root cause.
For troubleshooting specific issues, see also those tips.
Troubleshooting a failed deployment¶
These are usual steps to do in order to drill down to the root cause for some VM instance failure.
-
Identify any failing VM instance with
bosh -d <deployment-name> instances, possibly focusing on failing instances with--failingor detailing failing jobs with--ps. -
bosh sshto some VM having an issue. -
Become superuser with
sudo -ifor fullrootlogin, providingmoniton the$PATH. -
Check failing Monit jobs with
monit summary. Whenever the failure has happened atpre-atartstage, this list is empty because Monit configuration is not yet assembled. -
Check for any full disk device with
df -h. -
Check for excessive memory consumption or anything suspicious in the process tree (like duplicate or zombie processes) with
top(pressVfor tree display,cfor command line arguments, double-Efor GiB memory units,efor MiB process mem units,Wfor persisting the current display,Lfor locating some process,&for next search result,kfor sending a signal to the process displayed in first line,qto quit) -
Check the logs for failing processes in
/var/vcap/sys/log/<job-name>/*.logand browse them withless(press>to go to the end of file, usefto follow latest logs in live mode, press^Cto stop following)
Troubleshooting the BOSH Agent¶
Troubleshooting the BOSH Agent is very unusual, but here we show how you can display some JSON metadata present on the VM instance, with tools that are available by default on stemcells.
-
Check the latest BOSH Agent logs with
less /var/vcap/bosh/log/current -
Check BOSH Agent initial configuration with
python3 -mjson.tool /var/vcap/bosh/agent.json -
Check BOSH Agent dynamic settings with
python3 -mjson.tool /var/vcap/bosh/settings.json | less -
Check VM instance role (as from the BOSH deployment manifest: jobs, packages, networks, etc) with
python3 -mjson.tool /var/vcap/bosh/spec.json