Incident Report Procedure
Linux Based Environment
Goal: identify faulty server to find the root cause, gather evidence, either resolve the issue or escalate it. Assumptions
- An incident has been declared and the server is suspected to be involved
- The operations person has login credentials(SSH and password)
- The existent of a ticketing or issue tracking system(Jira, ServiceNow)
phase1: preparation and initial assessment#
Acknowledge the assigned ticket:
- Take up the assigned ticket and read through the details
- From this point, document every action carried out, command run and the timestamp for better documentation in the postmortem Gather context:
- Understand the issues presented or associated with the server in the ticket(e.g ‘slow server’, ‘API returning 500’, ‘Users unable to login into the domain’ )
- Understand the issue reported so as to know what to tackle:
* what errors are presented
- what services are hosted in the server?
- Are any other services in the server experiencing the same issue?
- Check for monitoring dashboards before logging in(DataDog, Grafana, Nagios, ) look for obvious anomalies:
High CPU/Memory Usage Latency spikes Error rate increase Health check failures
Phase2: Basic Connectivity & system status#
Connection test
From your workstation or bastion host, attempt to host the server
Observervation: check for packet loss, latency or reachability
Login attempt :
try logingin in using SSH or RDP
observation: successful login? slow login? connection refused? Timeout? Authentication error?
Initial system Overview when logged in:
check system uptime uptime
observation: How long has the system been up? Has the system booted recently? what is the load average?
check for logged in users: who w or
Observation: Any unexpected user or session?
Recent Login History: last | head -n 20
Observation: check for any unusual logins around when the issue started
Phase 3: Resource Utilisation Analysis#
CPU Usage:
Check overall per-process CPU usage top or htop if installed
observation: is the CPU usage close to 100%, which processes are consuming most CPU?
Run for a short period to see trends vmstat 1 5
Observation: Look at us , sys , id , wa high wa (IO wait time ) means that there are bottle necks
Memory Usage:
check memory usage free -h
observation: is available memory very low? is swap heavily used?
use top or htop(sort by memory usage type M in top, or f6 in htop the select PERCENT_MEM)
observation: which processes are consuming alot of memory?
**Check OOM(out of memory) **
killer events dmesg | grep -i oom-killer or journalctl | grep -i oom
observation: Has the kernel killed processes due to low memory?
Disk I/O:
check for disk utilisation and wait times: iostat -dxz 1 5 (reports disk status every minute 5 times)
observation: Look for high %util and high wait times. which devices are affected?
check per process I/O iotop
observation: Check for processes with most disk writes
Disk space:
check file system usage: df -h
observation: any critical file mounts(/var , /temp , / ) full or almost full?(> 90-95)
Network I/O clonnection
Check network interface statistics ip -s link show or netstat -i
observation: Look for high error count errors , dropped
Check network throughput per interface/ process: iftop or nethogs
observation: is a particular interface saturated? is there a process sending high abnormal traffick?
Listening ports and established connections: ss -tulnp netstat -tulpn
observation: Are there any expected services listening?
Are there expected services listening?(ESTABLISHED, TIME_WAIT)
Any suspicious listening port?
Application and service level checks#
service status
- check the status of the primary running apps/services on the server
- systemd:
system status [servicename] - SysVinit:
service [servicename] statusor/etc/init.d/[servicename] statusobservation: is the servive active/running? Did the service exit with an error? check the recent logs provided by thestatuscommand
Process Check:
- Verrify the application proces is running:
ps aux | grep [proces-nam]orpgrep -lf [proces-name]Observation: is the process running? Are there any zombie processes? (z) - Application logs
- Locate application specific logs(
/var/log/nginx/error.log,/var/log/app/app/log) - You can tail the logs for live activity
tail -f /path/to/applogs - Search for logs around the time the error occured
grep -iE "error|warn|fatal|exception|timeout" path/to/log - You can specify the time and specific error:
journalctl --since "1 hour ago"orjournalctl --since "1 hour ago"| grep -f failedobservation: are there error messages correlating with the incident start time? ** Local Functionality test**- If its a webserve, try acccesing it localy:
curl -v http://localhost:port/path - If its a database, try connecting to it using a local client tool
observation: Does it work locally? any slow response? Are there any errors?
- This could help differentiate a network issue to external sources and an internal server issue
- If its a webserve, try acccesing it localy:
phase 5: Deeper system configuration check#
Kernel/system log deepdive
- Go back to
/var/log/messges,var/log/syslog/orjournalctl - Look at the timeframe of the incident
- search for specific errors related to hardware, storage, network and kernel modules.
configuration changes
- check if relevant configuration files
/etc/have been chabged - if you are using
gitthen check the version control history Hardware Health*
- check if relevant configuration files
- Check disk health if suspected:
smartctl -a /dev/sdX(requiredsmartmontools) - Checkn for hardware errors reported by the system (
dmesg, vendor specific tools)
Escalate and Resolution#
Synthesize findings
- Review all the findings gathered
- Formulate a hypothesis about the root cause; e.g Network connectivity led to Database outage Attempt remefiation
- if the the main cause of the issue is clear and within you scope, fix it e.g reboot the servers, clearing temporary files
- Adviceable to document anything before doing it.
- Verrify if the fix resolved the issue. Monitor closely. Escalate if necessary
- If the issue is beyond your scope, this is the best time to escalate it to a higher manager or another team(eg developers, networking team)
- This could also be as a result of an issue that require reboot and might affect other services in the same server, this might require authorisation
- When escalating the issue, make sure you have gathered enough data(evidence) to point out the exact issue with the server, do not escalate issues with vague diagnosis statement like(server X is broken)
Reference Security Incident survery cheat sheet for server Administrators