12.03 Scaling Considerations

Why this matters

A SimpleRisk install that worked fine for two years can suddenly feel slow as the program grows. Users add more risks; teams add more frameworks; integrations push more data; reports query larger result sets. At some point, the install needs more resources or a different architecture. Knowing the typical scaling path means you can anticipate and plan rather than respond to outages.

The honest scope to know up front: most SimpleRisk installs never need to scale beyond a single VM. SimpleRisk is a typical mid-tier PHP web application; a modest VM (4 vCPU, 16 GB RAM, SSD storage) handles hundreds of users and tens of thousands of records comfortably. The scaling discussion below is for the minority of installs that grow beyond that — large enterprises with thousands of users, programs running multiple business units in one install, public-internet-exposed installs serving partner ecosystems.

How SimpleRisk's architecture scales

SimpleRisk is a mostly-stateless PHP application:

Stateless web tier: PHP requests don't carry server-side state between requests beyond what's in the session.
Database-backed sessions: when configured for database session storage (the default), sessions are visible to any web tier instance.
Database-backed everything else: the database holds risks, controls, audits, configuration, audit trail, debug log.
Master encryption key on the file system (when the Encryption Extra is active): the per-install master key file must be present on every web tier instance.
Cron jobs run on a designated host (typically one of the web tier hosts or a dedicated cron host).

This shape scales reasonably. Add web tier instances; they all read the same database; the database becomes the bottleneck eventually.

The typical scaling path

Stage 1: Single VM (most installs)

Everything on one server: web server, PHP, database, cron jobs. The simplest deployment.

Profile: up to ~500 active users, ~50,000 risks, normal compliance and audit volume.

When to upgrade: when CPU, memory, or disk on the single VM becomes the constraint despite tuning (Performance Tuning).

Stage 2: Vertically scaled single VM

Same architecture, bigger VM. Often the right answer for installs outgrowing Stage 1.

Profile: up to ~2,000 active users, ~200,000 risks.

When to upgrade: when a single VM's largest available size is the constraint, or when separation-of-concerns (database isolation, security boundaries) becomes operationally important.

Stage 3: Separated database tier

Web tier on one VM (or container); database on a separate VM (or managed service like AWS RDS, Azure Database for MySQL, GCP Cloud SQL).

Benefits: - Independent scaling of web and database. - Database-specific tuning (managed services handle this). - Database high-availability (replication, automatic failover). - Database backups become a managed service feature.

Profile: up to ~5,000 active users; multi-team programs with sustained activity.

When to upgrade: when web tier needs scaling beyond a single VM (concurrent user count, request volume).

Stage 4: Multiple web tier instances

Multiple VMs (or containers) running SimpleRisk's web tier; load balancer in front; shared database tier; shared file storage for uploads.

Requirements: - Database-backed sessions (already the default). - Shared file storage for uploaded documents. Either NFS, S3-mounted via FUSE, or moving uploads to object storage with the application configured accordingly. - Master encryption key file on every web tier instance (if Encryption Extra is active). - Cron jobs running on exactly one host (or properly coordinated to avoid duplicate execution). - Load balancer health checks on /healthcheck.php.

Profile: ~5,000+ active users, high request volume, public-internet exposure.

When to upgrade: when database tier becomes the bottleneck.

Stage 5: Database read replicas

Add read replicas for query scaling. The application reads from replicas (where consistency requirements allow) and writes to the primary.

Note: SimpleRisk's standard configuration doesn't natively split reads from writes. Implementing read-replica routing requires either application-side changes (custom database connection logic) or proxy-based solutions (ProxySQL, MaxScale).

Profile: very high read volume; reporting workloads competing with operational queries.

Stage 6: Sharding (rarely needed)

Splitting the database across multiple shards (e.g., per-business-unit shards) is theoretically possible but not natively supported. Programs at this scale usually consolidate at multiple SimpleRisk installs (one per major segment) rather than sharding a single install.

Specific scaling considerations

Database growth

Tables that grow continuously:

audit_log — every change appends a row. Tens of millions of rows in long-running active installs.
debug_log — every operational event (subject to log level). Can grow rapidly without rotation.
mgmt_reviews — reviews accumulate over time.

Mitigation:

Define and enforce retention for audit_log and debug_log.
Consider table partitioning by date for efficient bulk-archive.
Monitor table size; alert before it constrains backups or queries.

Encryption performance

The Encryption Extra adds CPU cost on every encrypted-field read or write. At scale:

Vertical scale: more CPU. Modern CPUs with AES-NI accelerate AES-CBC; a CPU upgrade can materially help.
Read caching: if applicable, cache decrypted values at the application layer to avoid repeat decryption.
Selective encryption: ensure only the fields that need encryption are encrypted. Adding encryption to fields that don't need it adds cost without security benefit.

Cron job scaling

The cron jobs are single-threaded and sequential by default. At scale:

Notification cron: a backlog of pending emails can grow if the cron interval is too long. Run more frequently.
Workflow / AI queue worker: high-volume workflows or AI calls can backlog. Consider scaling the worker.
Long-running cron jobs: framework installation, AI processing — these can run for minutes. Ensure the cron host has sufficient resources.

For multi-server deployments: pick one host as the "cron host" and disable cron on the others to avoid duplicate execution. Or use a coordination mechanism (file locks, advisory locks in MySQL) so only one host runs each scheduled task.

Session table

Database-backed sessions write a row per session and update on every request. At very high concurrency:

The sessions table can become a write hotspot.
Consider a dedicated session storage backend (Redis, Memcached) if PHP supports it for your install. (SimpleRisk's standard config uses database sessions; alternatives may require code changes.)

File uploads

User-uploaded documents live in simplerisk/uploads/ (or a configurable path). For multi-server deployments:

The path must be shared (NFS, network mount, or object storage).
Local file uploads on a single server don't scale to multi-server.

Backups at scale

Larger databases mean longer backups and longer restores. At scale:

Use the database's native backup tooling (Percona XtraBackup, MySQL Enterprise Backup, AWS RDS snapshots) rather than mysqldump for large databases.
Schedule backups during low-traffic windows.
Test restores periodically with realistic data volumes; restore time matters during incidents.

Network bandwidth

For installs with large file uploads, frequent dashboard refreshes, or high API volume, network bandwidth between users and SimpleRisk, and between SimpleRisk and its database, can be a constraint. Standard cloud networking handles this; on-prem may need attention.

Anti-patterns at scale

A few patterns to avoid:

Running with default tuning at scale. SimpleRisk's defaults are for the typical install. At 5,000+ users, every config setting deserves attention.
Ignoring audit_log growth. A 100-million-row audit log slows every audit-trail render and bloats every backup.
Running cron on multiple servers without coordination. Duplicate job execution produces duplicate notifications, conflicting workflow runs, double-charged AI calls.
Treating multi-server as a goal in itself. A well-tuned single VM beats a poorly-tuned multi-server cluster. Scale only when needed.
Not load-testing before scaling. Estimating capacity is harder than measuring it. Generate load against a non-production instance to find the actual ceiling.

When not to scale

Some "performance issues" aren't capacity issues:

Slow specific operation: usually a query optimization issue (see Performance Tuning), not a capacity issue.
Slow at certain times only: investigate the contention (cron job impact, batch report jobs, backup window).
Slow under specific conditions: examine the conditions; the fix may be configuration or code, not capacity.

Throwing capacity at a non-capacity problem masks the actual issue and costs money.

Common pitfalls

A handful of patterns recur with scaling.

Scaling before tuning. Start with Performance Tuning. Tuning is cheaper than scaling.
Adding capacity without measuring. If you don't know what's slow, you don't know if more capacity helps. Measure first.
Scaling the web tier when the database is the bottleneck. More web tier doesn't help if every request is database-blocked. Scale the database tier.
Treating SimpleRisk as horizontally scalable like a stateless API. It mostly is, but file uploads, cron jobs, and the encryption key file all need attention.
Forgetting the encryption key file in multi-server deployments. Some servers can decrypt; others can't; users see inconsistent behavior.
Running cron jobs on every server in a multi-server deployment. Duplicate execution. Pick one cron host.
Not coordinating with the program team on the change. Scaling involves downtime windows, configuration changes, possibly URL changes (if you front with a CDN). Communicate.
Scaling without backup verification. A scaled install with non-functional backups is a bigger liability than the unscaled version.
Treating SimpleRisk as a substitute for the broader GRC tool stack at very large scale. At 50,000+ user enterprise scale, multiple GRC tools may be appropriate; a single SimpleRisk install for everything is a single point of failure.
Underestimating database growth from the audit log. Plan for it.
Not testing the scaled architecture before going live. Capacity tests; failover tests; backup tests.