Logging & Diagnostics intermediate diagnostics troubleshooting incident-response playbook

When Things Go Wrong: A Mac Admin's Diagnostic Playbook

Systematic approach to diagnosing macOS issues — from initial observation through log analysis, root cause identification, and remediation

Published: Feb 14, 2026 20 min read

Introduction

Every Mac admin has experienced the moment: a user reports something is broken, or a monitoring alert fires, and the clock starts ticking. The difference between a 15-minute resolution and a 4-hour rabbit hole is usually not technical skill. It is methodology.

This playbook presents a structured six-step diagnostic process. It is not a checklist of fixes. It is a framework for systematically moving from “something is wrong” to “here is why, and here is the proof.” Follow these steps in order. Resist the urge to skip ahead to remediation before you understand the root cause.

Step 1: OBSERVE

The first five minutes determine the trajectory of your entire investigation. Capture system state before doing anything that changes it.

Capture the Report

Document what was reported and by whom. Get specifics:

What is the symptom? (application crash, slow performance, login failure, network timeout)
When did it start? (exact time if possible, or “after the last update”)
How often does it occur? (every time, intermittent, once)
What changed recently? (OS update, new software, MDM policy push, hardware swap)

Capture System State Immediately

Before touching anything, run these commands to snapshot the current state:

# System uptime (has it been rebooted recently?)
uptime

# Disk space (full disks cause cascading failures)
df -h /

# Memory pressure
vm_stat | head -5
memory_pressure

# CPU and process snapshot
ps aux --sort=-%cpu | head -20

# Recent crash reports
ls -lt /Library/Logs/DiagnosticReports/ | head -10
ls -lt ~/Library/Logs/DiagnosticReports/ | head -10

# Check for recent kernel panics
ls -lt /Library/Logs/DiagnosticReports/*panic* 2>/dev/null

Capture a Quick Log Snapshot

# Grab the last 10 minutes of errors and faults
log show --last 10m --predicate 'messageType == error OR messageType == fault' \
  --style compact > /tmp/recent_errors.txt

# Count errors by process to see what is noisiest
log show --last 10m --style ndjson --predicate 'messageType == error' | \
  python3 -c "
import sys, json
from collections import Counter
c = Counter()
for line in sys.stdin:
    try:
        c[json.loads(line).get('process','')] += 1
    except: pass
for proc, count in c.most_common(20):
    print(f'{count:6d}  {proc}')
"

Rule of thumb: If df -h / shows less than 10% free space, that is almost certainly your root cause regardless of what was reported. Low disk space causes log storage failures, application crashes, swap exhaustion, and update failures.

Step 2: ISOLATE

Determine the scope of the problem before diving into logs.

Fleet-Wide or Isolated?

# If you have SSH access across the fleet, check if the issue is widespread:
# (Adjust for your environment -- Jamf Pro, Munki, osquery, etc.)

# With osquery on multiple hosts:
# echo "SELECT * FROM uptime;" | osqueryi

# With Jamf Pro:
# Check Smart Group membership for the affected criteria

Ask yourself:

Is this one machine, one user, one building, or everyone?
Did something roll out recently (MDM profile, software update, config change)?
Is this correlated with a network or infrastructure change?

Reproduce the Issue

If the problem is intermittent, set up monitoring before attempting reproduction:

# Start a live log stream filtered to the relevant area
# Run this in a separate terminal window
log stream --predicate 'subsystem == "com.apple.mdmclient"' --level debug \
  --style compact | tee /tmp/live_capture.txt

Then ask the user to reproduce the issue while the stream is running.

Targeted Process Monitoring

Use fs_usage and opensnoop (where available) to watch what a specific process is doing at the filesystem and system call level:

# Watch file system activity for a specific process
sudo fs_usage -w -f filesys Safari 2>&1 | head -200

# Watch all file opens system-wide (requires Full Disk Access for Terminal)
sudo opensnoop -n Safari 2>&1 | head -50

# Watch network activity for a process
sudo nettop -p $(pgrep -x Safari) -J bytes_in,bytes_out -d

Step 3: COLLECT

Gather all relevant log data into a working set for analysis. Cast a wide net – you can always filter later, but you cannot recover logs that were not collected.

Unified Logs

# Collect a log archive covering the incident window
# This creates a portable bundle that can be analyzed offline
sudo log collect --last 2h --output ~/Desktop/incident_$(date +%Y%m%d_%H%M).logarchive

For targeted collection:

# MDM issues
log show --last 2h --predicate 'subsystem == "com.apple.mdmclient" OR subsystem == "com.apple.ManagedClient"' --style compact > /tmp/mdm_logs.txt

# Software update issues
log show --last 6h --predicate 'subsystem == "com.apple.SoftwareUpdate" OR process == "softwareupdated"' --style compact > /tmp/su_logs.txt

# Login / authentication issues
log show --last 4h --predicate 'subsystem == "com.apple.loginwindow" OR subsystem == "com.apple.Authorization" OR process == "authorizationhost"' --style compact > /tmp/auth_logs.txt

# Kernel and I/O issues
log show --last 4h --predicate 'process == "kernel" AND (messageType == error OR messageType == fault)' --style compact > /tmp/kernel_logs.txt

BSM Audit Logs

If you have BSM auditing enabled:

# Extract security events from the audit trail
sudo auditreduce -a $(date -v-2H +%Y%m%d%H%M%S) /var/audit/current | praudit -s > /tmp/audit_recent.txt

Application-Specific Logs

Some applications still write their own log files:

# Common application log locations
ls -la ~/Library/Logs/
ls -la /Library/Logs/
ls -la /var/log/

# Jamf Pro agent logs
ls -la /var/log/jamf.log 2>/dev/null
cat /var/log/jamf.log | tail -100

# Install.log (all pkg installations)
cat /var/log/install.log | tail -100

Crash Reports

# Copy recent crash reports to your working directory
mkdir -p /tmp/crash_reports
cp /Library/Logs/DiagnosticReports/*.crash /tmp/crash_reports/ 2>/dev/null
cp /Library/Logs/DiagnosticReports/*.ips /tmp/crash_reports/ 2>/dev/null
cp ~/Library/Logs/DiagnosticReports/*.crash /tmp/crash_reports/ 2>/dev/null

Network Capture

When the issue is network-related, capture traffic:

# Capture 60 seconds of traffic on the primary interface
sudo tcpdump -i en0 -w /tmp/capture_$(date +%Y%m%d_%H%M).pcap -G 60 -W 1

# Capture only traffic to/from a specific host (e.g., MDM server)
sudo tcpdump -i en0 host mdm.example.com -w /tmp/mdm_traffic.pcap -G 120 -W 1

Full System Diagnostic

When in doubt, collect everything:

# sysdiagnose captures logs, system state, network config, and much more
# This takes 5-10 minutes and creates a large archive
sudo sysdiagnose -f ~/Desktop/

Step 4: ANALYZE

Now work through the collected data systematically. The goal is to build a timeline of events leading to the failure.

Correlate by Timestamp

Start with the known symptom time and work backward:

# Show all log activity in a narrow window around the incident
log show --start "2026-02-14 10:23:00" --end "2026-02-14 10:25:00" \
  --style compact | less

# Filter for errors/faults in that same window
log show --start "2026-02-14 10:23:00" --end "2026-02-14 10:25:00" \
  --predicate 'messageType == error OR messageType == fault' --style compact

Common Error Patterns

When scanning logs, watch for these high-signal patterns:

Pattern	Subsystem / Process	Likely Cause
`AMFI: code signature invalid`	`kernel` / `amfid`	Unsigned or tampered binary; check Gatekeeper and notarization
`Sandbox: deny`	`sandboxd`	App trying to access a resource outside its sandbox entitlements
`TCC deny`	`com.apple.TCC`	Missing privacy permission (Full Disk Access, Camera, etc.)
`I/O error` or `disk0s2: I/O`	`kernel`	Failing storage device; check SMART status with `diskutil info`
`apsd: Failed to connect`	`com.apple.apsd`	Push notification failure; check firewall rules for port 5223
`softwareupdated: Failed`	`com.apple.SoftwareUpdate`	Update download or install failure; check network and disk space
`mdmclient: command failed`	`com.apple.mdmclient`	MDM command processing failure; check profile validity
`launchd: Service exited with abnormal code`	`com.apple.launchd`	A daemon or agent is crashing on launch; check crash reports
`kernel: memorystatus`	`kernel`	Memory pressure triggered a jetsam kill; check RAM usage
`com.apple.xpc: Connection interrupted`	`com.apple.xpc`	XPC service crashed; look for related crash reports

Scripted Log Analysis

# Count errors by subsystem in the last hour
log show --last 1h --style ndjson --predicate 'messageType == error' | \
  python3 -c "
import sys, json
from collections import Counter
c = Counter()
for line in sys.stdin:
    try:
        entry = json.loads(line)
        c[entry.get('subsystem', 'unknown')] += 1
    except: pass
for sub, count in c.most_common(20):
    print(f'{count:6d}  {sub}')
"

# Extract unique error messages for a specific subsystem
log show --last 2h --style ndjson \
  --predicate 'subsystem == "com.apple.mdmclient" AND messageType == error' | \
  python3 -c "
import sys, json
seen = set()
for line in sys.stdin:
    try:
        msg = json.loads(line).get('eventMessage', '')
        if msg not in seen:
            seen.add(msg)
            print(msg)
    except: pass
"

Crash Report Analysis

When you have crash reports, focus on these fields:

# Quick crash report summary
grep -A 5 "Exception Type\|Termination Reason\|Crashed Thread" /tmp/crash_reports/*.crash

Key fields:

Exception Type: EXC_CRASH (general crash), EXC_BAD_ACCESS (memory access violation), EXC_RESOURCE (resource limit exceeded)
Termination Reason: Often contains the most human-readable explanation
Crashed Thread backtrace: The stack trace of the thread that caused the crash

Step 5: REMEDIATE

Only after you understand the root cause should you apply a fix. Common Mac admin remediations organized by category:

Disk Space

# Identify large files
sudo du -sh /var/folders/* 2>/dev/null | sort -rh | head -10
sudo du -sh /Library/Caches/* 2>/dev/null | sort -rh | head -10
du -sh ~/Library/Caches/* 2>/dev/null | sort -rh | head -10

# Clear system caches (safe to remove)
sudo rm -rf /Library/Caches/com.apple.iconservices.store
sudo rm -rf /var/folders/*  # Caution: only if no users are logged in

# Purge old software update downloads
sudo rm -rf /Library/Updates/*

Permissions and TCC

# Reset TCC database for a specific app (requires FDA)
sudo tccutil reset All com.example.problematic.app

# Verify disk permissions (modern macOS)
diskutil verifyVolume /

# Check SIP status
csrutil status

Network Connectivity

# Flush DNS cache
sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponder

# Reset network preferences (nuclear option)
# Remove preference files and reboot:
sudo rm /Library/Preferences/SystemConfiguration/NetworkInterfaces.plist
sudo rm /Library/Preferences/SystemConfiguration/preferences.plist

# Test APNS connectivity (critical for MDM)
nc -z -w 5 courier.push.apple.com 5223 && echo "APNS reachable" || echo "APNS blocked"

Kernel Extension and System Extension Issues

# List loaded kernel extensions
kextstat | grep -v com.apple

# List installed system extensions
systemextensionsctl list

# Rebuild kext cache
sudo kextcache -invalidate /
sudo kextcache -i /

Application Issues

# Remove application preferences (resets to defaults)
defaults delete com.example.problematic.app

# Clear application caches
rm -rf ~/Library/Caches/com.example.problematic.app

# Re-register Launch Services (fixes "application can't be opened" errors)
/System/Library/Frameworks/CoreServices.framework/Frameworks/LaunchServices.framework/Support/lsregister -kill -r -domain local -domain system -domain user

Step 6: DOCUMENT

Every significant diagnostic effort should produce a brief record. Future you (or your teammates) will thank you.

Incident Report Template

## Incident Report

**Date**: YYYY-MM-DD
**Reported By**: [user / monitoring system]
**Affected Systems**: [hostname(s), serial number(s), scope]
**Severity**: [Critical / High / Medium / Low]

### Symptom
[What was observed. Be specific.]

### Timeline
- HH:MM — [First report / alert fired]
- HH:MM — [Investigation began]
- HH:MM — [Root cause identified]
- HH:MM — [Remediation applied]
- HH:MM — [Confirmed resolved]

### Root Cause
[Technical explanation of what went wrong and why.]

### Evidence
[Key log entries, crash report excerpts, screenshots.
Include the exact log commands used so they can be re-run.]

### Remediation
[What was done to fix the issue.]

### Prevention
[What changes should be made to prevent recurrence.
Configuration changes, monitoring additions, policy updates.]

Save incident reports in a shared location (wiki, Git repository, or ticketing system) so the entire team benefits.

Advanced Diagnostic Tools

Beyond the standard log command, these tools are available for deeper investigation:

fs_usage

Real-time file system activity tracing:

# Watch all file system operations for a specific process
sudo fs_usage -w -f filesys PID

# Watch file system activity by process name
sudo fs_usage -w -f filesys -e Safari

DTrace (where available)

DTrace provides kernel-level tracing. On modern macOS with SIP enabled, its capabilities are limited, but basic probes still work:

# Count system calls by process (requires partial SIP disable)
sudo dtrace -n 'syscall:::entry { @[execname] = count(); }'

Note: Full DTrace functionality requires disabling System Integrity Protection, which is not recommended on production machines. Use it only on test systems.

Instruments.app

Part of Xcode, Instruments provides graphical profiling for CPU, memory, disk I/O, and network activity. It is particularly useful for diagnosing performance issues that are difficult to pinpoint with command-line tools. Launch it from Xcode > Open Developer Tool > Instruments.

spindump and sample

# Capture a process sample (useful for hung applications)
sudo sample Safari 10 -file /tmp/safari_sample.txt

# Trigger a spindump (captures all processes)
sudo spindump -reveal -o /tmp/spindump.txt

taskpolicy and powermetrics

# Check thermal and power state (useful for performance throttling issues)
sudo powermetrics --samplers cpu_power -i 5000 -n 3

Case Study: Diagnosing a Recurring Kernel Panic

Symptom: Three MacBook Pros in the engineering department experienced kernel panics over two days. Users reported the machines restarted without warning.

OBSERVE: Checked /Library/Logs/DiagnosticReports/ and found .panic files on all three machines. The panic reports all showed the same backtrace originating from a third-party kernel extension.

ls -lt /Library/Logs/DiagnosticReports/*panic*
# Found: Kernel_2026-02-12-143022_MacBook-Pro.panic

ISOLATE: All three machines had the same third-party endpoint security product installed at the same version. Other machines with an older version of the product were unaffected. The vendor had pushed an update two days prior.

COLLECT:

# Extract kernel extension details from panic log
grep -A 10 "Kernel Extensions in backtrace" /Library/Logs/DiagnosticReports/Kernel_2026-02-12-*.panic

# Check loaded kexts
kextstat | grep -v com.apple

ANALYZE: The panic backtrace pointed to the vendor’s kernel extension. The extension was attempting a memory access in an invalid address space during file scanning operations. The issue correlated exactly with the update timestamp.

REMEDIATE: Contacted the vendor and confirmed a known bug in the update. Rolled back to the previous version using the vendor’s provided uninstall script and then reinstalled the previous stable version.

# Unload the problematic kext
sudo kextunload -b com.vendor.endpoint.kext

# Install the previous stable version via MDM
# (pushed the older .pkg via Jamf Pro policy)

DOCUMENT: Filed incident report, added the problematic version to the MDM’s restricted software list, and configured monitoring to alert if the bad version reappeared.

Key Takeaways

Follow the process in order. Observe, Isolate, Collect, Analyze, Remediate, Document. Skipping steps (especially jumping straight to remediation) wastes more time than it saves.
Capture state before changing state. Every command you run, every service you restart, every preference you delete destroys potential evidence. Snapshot first.
Build your command library. Keep a shared document of your team’s most-used diagnostic commands. When an incident happens at 2 AM, you do not want to be constructing predicates from memory.
Time-bound your investigation. If you have not identified the root cause within a reasonable window, escalate or collect a sysdiagnose and continue analysis offline. Do not leave the user waiting indefinitely.
Always document. The incident report is not bureaucracy. It is the artifact that turns a one-off fix into institutional knowledge. The next time the same error pattern appears, your team will find it in the search results and save hours.