Guide

What Is Cron Job Monitoring? (And Why You Need It)

April 27, 2026·5 min read

What is a cron job?

A cron job is a scheduled task that runs automatically at a defined interval — every minute, every hour, every night at 2 AM. The name comes from the Unix cron daemon, which reads a crontab file and fires off commands on schedule.

Most production systems rely on cron jobs for tasks that can't happen interactively: database backups, invoice generation, report emails, data syncs, cache warming, log rotation. If those tasks stop running, something breaks — often quietly, often for a long time before anyone notices.

The silent failure problem

Cron jobs have no built-in way to report success or failure. When a job runs, it either completes or it doesn't. There's no Slack notification, no email, no dashboard update. Cron will dutifully attempt the job each interval and log output to syslog — but unless you're watching syslog, you'll never see it.

Common ways cron jobs fail silently:

An exception is thrown but not caught — the script exits non-zero, cron logs it, nobody reads the log
A dependency (database, API, filesystem) is temporarily unavailable at runtime
A deploy changes environment variables or file paths, breaking a script that worked fine before
The crontab entry is accidentally overwritten during a server provisioning run
The server runs out of memory or disk, killing the process mid-run
A package update changes behavior in a way that causes silent data corruption

In every one of these cases, cron keeps scheduling. The job keeps "running." And nothing tells you it's broken.

What is cron job monitoring?

Cron job monitoring is the practice of tracking whether scheduled jobs actually run successfully — and alerting you immediately when they don't.

There are two broad approaches:

Log scraping — a monitoring system reads your cron logs and looks for error patterns. Fragile, delayed, and requires custom parsing per job.
Heartbeat monitoring — your job actively signals completion by hitting an HTTP endpoint. Simple, reliable, and language-agnostic.

Heartbeat monitoring is the standard approach for production systems. It's what Cronaman implements.

How heartbeat monitoring works

The concept is a dead man's switch:

You create a monitor in Cronaman and configure the expected interval (e.g., every 24 hours)
Your cron job sends a simple HTTP GET request to a unique ping URL at the end of each successful run
Cronaman resets its clock each time a ping arrives
If the next expected ping doesn't arrive within the interval (plus a grace period), Cronaman transitions the monitor to "late" and then "down"
You receive an alert — email, Slack, or webhook — within minutes of the missed run

The job signals its own health. No log parsing, no agents, no polling. The system only knows something is wrong when the expected signal stops arriving.

What a ping looks like

At its simplest, a heartbeat ping is a single HTTP call appended to your script:

crontab

# Daily backup — ping Cronaman on success
0 2 * * * /home/user/backup.sh && curl -fsS --max-time 10 https://cronaman.dev/ping/my-backup

The && means curl only runs if the backup exits with code 0. A failed backup skips the ping; Cronaman alerts you when the deadline passes.

You can also send an explicit failure signal by calling the /fail endpoint — useful when you want immediate alerts rather than waiting for the deadline:

backup.py

PING_URL = "https://cronaman.dev/ping/my-backup"
FAIL_URL = "https://cronaman.dev/ping/my-backup/fail"

try:
    run_backup()
    urllib.request.urlopen(PING_URL, timeout=10)   # success
except Exception as e:
    urllib.request.urlopen(FAIL_URL, timeout=10)   # immediate failure alert
    raise

What the monitor statuses mean

Cronaman tracks each monitor through a simple state machine:

New — monitor created, no pings received yet
Healthy — pings are arriving within the configured interval
Late — the expected ping hasn't arrived yet, but within the grace period
Down — the grace period has elapsed with no ping — alerts fire
Paused — temporarily silenced (e.g., during a maintenance window)

The grace period is important: it absorbs minor timing drift (load spikes, slow startup) without triggering false alarms. A 10–15 minute grace period is usually right for daily jobs.

Why not just check the logs?

Log-based monitoring seems straightforward but breaks down in practice:

Logs only tell you something happened — not that nothing happened. A job that never runs produces no log entries at all
Error patterns vary per job — you need custom rules for each one
Log pipelines introduce latency — by the time an alert fires, the job may have been broken for hours
Log aggregation adds cost and complexity that grows with the number of jobs

Heartbeat monitoring inverts the model: instead of looking for evidence of failure, you look for the absence of evidence of success. A job that never runs never sends a ping — and Cronaman catches that immediately.