12.01 Monitoring SimpleRisk

Why this matters

A monitored SimpleRisk install catches problems before users do. A SimpleRisk install nobody monitors fails silently in the night and produces "we're paying for this and it's broken" surprises in the morning. The infrastructure to monitor SimpleRisk is mostly off-the-shelf (uptime monitors, log aggregators, database monitoring tools), but it has to be configured. This article walks through what to monitor and how.

The honest scope to know up front: SimpleRisk doesn't have a comprehensive built-in monitoring dashboard. There's a healthcheck endpoint, the audit trail, and the debug log, but no "system health" page that summarizes everything. Operators stitch together the monitoring picture from external tools that consume SimpleRisk's signals.

Before you start

Have these in hand:

Operational ownership of the SimpleRisk install — who's on call, who responds to alerts, what's the escalation path.
A monitoring stack — at minimum: an uptime monitor (Pingdom, UptimeRobot, Datadog Synthetics, or self-hosted Uptime Kuma); ideally also a log aggregator (Splunk, ELK, Datadog Logs, Grafana Loki) and an APM/metrics platform (Datadog, New Relic, Prometheus + Grafana).
Access to relevant infrastructure — the SimpleRisk server (OS-level metrics), the database server (database metrics), the web server (access and error logs).

What to monitor

1. Application reachability (uptime)

The healthcheck endpoint at /healthcheck.php returns a quick status response. Configure your uptime monitor to:

Hit https://your-simplerisk.example.com/healthcheck.php every 1-5 minutes.
Expect a 200 response with the expected body.
Alert on any non-200 response or timeout.

If the healthcheck returns errors but the application looks fine to users, that's a configuration drift worth investigating; if the healthcheck succeeds but users report errors, the healthcheck might not cover the failing path.

2. Web server response time

Beyond uptime, response time matters. A page that loads in 30 seconds is "up" but unusable. Configure the uptime monitor (or a synthetic-monitoring tool) to:

Track response time on the healthcheck and on a representative page (e.g., the login page).
Alert on response time exceeding a threshold (e.g., 5 seconds for the healthcheck, 10 seconds for the login page).

3. Web server logs

Forward Apache or nginx access logs and error logs to your aggregator:

Access logs — useful for traffic patterns, slow requests, error rates by endpoint.
Error logs — PHP errors, web server errors. Alert on error rate spikes.

Common log patterns to alert on:

High 5xx response rate (server errors).
Sudden drop in successful response volume (the application is silently failing).
Repeated requests for non-existent paths from a single source (scanning / probing).

4. The debug log

See The Debug Log. Monitor:

error and critical entries — alert immediately. These indicate real problems.
warning entries — review periodically; spikes may indicate emerging issues.
notice entries for failed cron runs — alert on repeated failures of any cron job.

For installs writing to file or syslog, use the standard log-aggregation pipeline. For database destinations, periodic SQL queries can drive alerting.

5. Cron job execution

The cron jobs are SimpleRisk's background workers. Their failure produces visible application degradation (notifications stop sending, AI jobs queue indefinitely, workflows don't fire). Monitor:

Last successful run timestamp for each cron job. Query cron_history (the table that records cron executions). Alert if any job hasn't run in N expected intervals.
Cron worker queue depth — for the queue worker that processes background jobs, monitor pending job count. Sudden increases indicate the worker is falling behind.
Specific cron output files — some installs write per-job logs. Tail and parse for failure patterns.

For installs running cron via system cron (not via the application's internal scheduler), the system cron itself may produce errors visible via journalctl -u cron or mailx to the operator account.

6. Database

Database health is application health. Monitor:

Connection pool utilization — if SimpleRisk's connection pool is saturated, requests block.
Query performance — slow queries log to MySQL's slow query log; ingest into the aggregator.
Replication lag (if you have replication) — replicated reads from a lagged replica produce stale data.
Disk space — running out of database disk is catastrophic. Alert at 80%, page at 90%.
Table size growth — audit_log and debug_log grow continuously; monitor.

Database vendors (MySQL Enterprise Monitor, Percona Monitoring and Management) and APM tools have built-in database monitoring; configure for your install.

7. Disk space (application server)

The application server has logs, file uploads, temporary files. Monitor:

/var/log/ and simplerisk/logs/ — log files grow without rotation.
simplerisk/uploads/ (if file uploads are stored locally) — user-uploaded documents accumulate.
/tmp/ — temporary files including activation backups. Should be cleaned up but sometimes aren't.
System root and database volumes — generic disk-fill monitoring.

8. Memory and CPU

Standard server metrics:

CPU utilization — sustained high CPU indicates load or runaway process.
Memory utilization — Apache/nginx + PHP-FPM memory consumption; OOM-killer risk.
Swap usage — sustained swap means insufficient memory.

These are operating system-level metrics; standard infrastructure monitoring covers them.

9. Application-specific metrics

Beyond infrastructure, track application-level metrics for the program:

Active user count — how many users have authenticated in the last hour / day.
Risk submission rate — risks created per day; sudden drops or spikes are worth investigating.
Job queue depth — pending workflows, AI jobs, notification queue.
Authentication failure rate — sustained high authentication failures may indicate brute-force.

These metrics typically come from SQL queries against the database run on a schedule and pushed to your metrics platform.

10. Backup verification

Backups that aren't tested aren't backups. Monitor:

Backup completion — alert on missed backup runs.
Backup file size — sudden change indicates corruption or scope change.
Periodic restore tests — schedule actual restore drills (monthly or quarterly); alert on failures.

See Database Backup and Restore.

Alert thresholds and runbooks

For each alert type, define:

Threshold — when does the alert fire?
Severity — page (immediate response) vs ticket (next business day).
Runbook — a documented procedure for diagnosing and resolving.

Without runbooks, alerts produce confused responders. Even a one-paragraph runbook ("when this fires, check X then Y, escalate to Z if not resolved in 30 minutes") materially improves incident response.

Common operational signals

A handful of signal patterns recur:

Healthcheck failing: SimpleRisk is down or degraded. Check web server, PHP-FPM, database connectivity.
Cron jobs not running: workflows, notifications, AI all stop. Check cron daemon, system clock, application's cron worker process.
Database disk fill: alert; truncate logs if appropriate; expand storage.
Login failures spiking: brute-force attempt, credential leak, or systemic issue (LDAP outage, SSO problem).
Slow page loads: database performance, web server tuning, opcache miss rate.
Notifications not sending: SMTP connectivity, notification cron, queue depth.

Common pitfalls

A handful of patterns recur with monitoring.

Configuring uptime monitoring without monitoring response time. Slow but technically up is broken from the user's perspective.
Only monitoring what you know to monitor. New failure modes appear after upgrades or feature additions. Periodically review what you're monitoring and what you're missing.
Alert fatigue from too-low thresholds. Alerts that fire constantly get ignored. Tune thresholds to actual operational signals.
No runbooks for alerts. A page at 3 AM with no runbook is a confused responder. Write runbooks for every page-level alert.
Not monitoring the monitoring system. A dead Datadog agent doesn't alert that it's dead. Cross-monitor.
Treating the audit log as monitoring. It captures changes, not state. Use the debug log + infrastructure metrics for monitoring.
Not testing alerts. Configure an alert that's never been verified to actually fire — when it should fire in production, it doesn't. Test in non-production.
Storing all log data forever. Storage cost compounds. Define retention; rotate old data to cheaper storage or delete.
Not monitoring backups. Backups that fail silently lose you data when you need it.
Forgetting to monitor the database. Database issues underlie most application issues. Monitor it explicitly.
Monitoring only via dashboards. Dashboards require active viewing; alerts push to responders. Both have a place.

Reference

Healthcheck endpoint: /healthcheck.php — returns a quick status response.
Cron history table: cron_history — record of cron job executions; query for last-run timestamps.
Debug log: See The Debug Log. Destination is database, file, or syslog.
Web server logs: Standard Apache or nginx access/error logs at the OS level.
Implementing files: simplerisk/healthcheck.php (the healthcheck endpoint); simplerisk/cron/*.php (the cron jobs).
External dependencies: An uptime monitor (Pingdom, UptimeRobot, etc.); a log aggregator (Splunk, ELK, Datadog Logs); a metrics platform (Datadog, New Relic, Prometheus + Grafana); database monitoring tools (MySQL Enterprise Monitor, Percona PMM); alerting / paging integration (PagerDuty, Opsgenie).