Homelab Monitoring with Prometheus and Grafana: The Stack I Actually Trust
Build a practical homelab monitoring stack with Prometheus, Grafana, node_exporter, and Alertmanager using real configs, alerts, and operator-tested defaults.
Author
Marcus Chen
FTC disclosure: This article contains affiliate links. If you purchase through these links, we may earn a commission at no additional cost to you.
Key Takeaways
- Uptime checks tell you that a service died. Metrics tell you that it has been dying slowly for the last three days.
- For most homelabs, Prometheus + Grafana + node_exporter + cAdvisor is enough to cover Linux hosts, Docker workloads, and alerting without building an observability theme park.
node_exportershould usually run natively on Linux hosts, not buried inside a container where it can lie to you about the machine.- Start with a short alert list - disk usage, filesystem saturation, memory pressure, failed scrapes, and backup host health - unless you enjoy alerts the way some people enjoy mosquitoes.
- Put the monitoring stack on a management network or behind a reverse proxy/VPN. Grafana does not need to become your newest public-facing hobby.
After years of running homelab services, I have one opinion that keeps getting stronger: if you do not monitor your infrastructure, you do not really know your infrastructure.
You know how it feels on good days. You know which container everybody blames first. You know the sound your NAS makes when it is about to ruin an evening. That is not the same thing as observability.
The stack I keep coming back to is Prometheus and Grafana. Not because it is trendy. Not because every DevOps diagram on the internet includes it. Because it is boring, flexible, well-documented, and brutally good at answering the question that matters when something acts weird: what changed, when did it change, and was it already on fire before I noticed?
If you are still sorting out the rest of your infrastructure, my guide to Proxmox storage architecture pairs well with this one. If most of your applications live in containers, read my comparison of Docker monitoring tools too - it covers where Prometheus + Grafana wins versus simpler tools.
What this stack is actually for
A lot of people start with uptime checks.
That is fine. Uptime Kuma is useful. A ping check is useful. "Can I hit port 443?" is useful.
But uptime checks only answer whether something is alive right now. They do not tell you that RAM has been climbing all week, disk I/O is getting ugly during backups, or that your Docker host has quietly been swapping like it is 2012.
Prometheus and Grafana fill that gap.
- Prometheus scrapes and stores metrics.
- Grafana turns those metrics into dashboards and alerts you can actually read.
- node_exporter exposes host-level Linux metrics.
- cAdvisor exposes container metrics.
- Alertmanager routes alerts somewhere useful.
That covers almost every small-to-medium homelab use case that matters.
The architecture I recommend
For a real homelab, keep the layout simple:
- Run Prometheus, Grafana, Alertmanager, and cAdvisor on one monitoring host.
- Run node_exporter on each Linux server or VM you care about.
- Scrape everything from Prometheus every 15 to 30 seconds.
- Store 30 to 90 days of data depending on disk space and how much trend history you care about.
If you already have a management VLAN, put the monitoring stack there. If not, at least avoid exposing Grafana directly to the internet. A reverse proxy plus SSO or VPN access is the civilized option.
If you are already segmenting your network, my homelab network segmentation guide will help you place this stack somewhere sensible. Monitoring belongs in the "important, boring, do not poke it" part of the lab.
Resource sizing - what a homelab actually needs
This is one of those topics that gets overcomplicated fast.
For most homelabs, a monitoring stack does not need enterprise hardware. It needs enough RAM, enough SSD space, and enough discipline to avoid scraping every nonsense metric at 5-second intervals.
Here is the sizing I would use:
| Homelab size | Prometheus + Grafana host recommendation | Retention |
|---|---|---|
| 1-3 servers, light Docker usage | 2 vCPU, 2 GB RAM, 20-30 GB SSD | 30 days |
| 4-10 servers, moderate containers | 2-4 vCPU, 4 GB RAM, 40-80 GB SSD | 60-90 days |
| 10+ servers, many exporters and dashboards | 4 vCPU, 8 GB RAM, 100+ GB SSD | 90 days |
For a small lab, even a tiny mini PC or VM is fine.
Relevant gear I would actually consider
- Raspberry Pi 5 starter kit: https://www.amazon.com/s?k=raspberry+pi+5+8gb+starter+kit&tag=homelabaddiction-20
- Beelink S12 Pro mini PC: https://www.amazon.com/s?k=beelink+s12+pro+mini+pc&tag=homelabaddiction-20
- APC UPS for the monitoring node and switch: https://www.amazon.com/s?k=apc+bx1500m+ups&tag=homelabaddiction-20
No, you do not need a dedicated rackmount server just to learn whether /var is filling up. Let us stay emotionally grounded.
The Docker Compose stack
This is the starting point I recommend for the monitoring node.
Create a working directory:
mkdir -p /opt/monitoring
cd /opt/monitoring
Create docker-compose.yml:
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alert.rules.yml:/etc/prometheus/alert.rules.yml:ro
- prometheus-data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=60d
- --web.enable-lifecycle
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: change-this-now
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
volumes:
prometheus-data:
grafana-data:
Then create prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alert.rules.yml"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ["prometheus:9090"]
- job_name: node
static_configs:
- targets:
- 192.168.10.11:9100
- 192.168.10.12:9100
- 192.168.10.13:9100
- job_name: cadvisor
static_configs:
- targets: ["cadvisor:8080"]
Bring it up:
docker compose up -d
Verify Prometheus targets:
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {scrapeUrl: .scrapeUrl, health: .health}'
If the targets are healthy, you are already ahead of most abandoned monitoring projects.
Why I run node_exporter natively on Linux hosts
You can run node_exporter in a container.
You can also microwave leftover pizza in a cast-iron pan if you are creative enough. The fact that something is possible does not make it the better default.
For host metrics, native installation is cleaner and more trustworthy. It avoids weird path mapping mistakes, avoids container abstraction problems, and makes it obvious that you are monitoring the host rather than a carefully cropped version of the host.
Use the official project docs for reference here: - Prometheus overview - node_exporter repository and collector docs - Grafana documentation
Install node_exporter like this on each Linux system:
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --system --no-create-home --shell /usr/sbin/nologin node_exporter
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
Wants=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes
Restart=on-failure
[Install]
WantedBy=multi-user.target
Then enable it:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://localhost:9100/metrics | head
That last command matters.
If you do not verify that metrics are actually being exposed before editing Prometheus targets, you are creating a future troubleshooting session for yourself. Future-you is rarely grateful.
What to monitor first
This is where many people overbuild.
You do not need 40 dashboards on day one. You need the first 10 things that catch real problems.
I start with these:
1. CPU saturation
Watch sustained CPU pressure, not brief spikes.
A backup job hitting 90% CPU for thirty seconds is normal. A host sitting at 90% for fifteen minutes while Docker containers time out is not.
2. Memory pressure and swap use
Swap is not always evil. Surprise swap is usually annoying.
If a box that should idle comfortably starts leaning on swap every day, something changed. That is the story metrics are good at telling.
3. Disk space and inode usage
Disk alerts are the most boring alerts in the world.
They are also the ones that save the most services.
4. Disk I/O latency and filesystem pressure
This matters a lot during backups, scrubs, media imports, and VM-heavy workloads.
If you already read my article on Proxmox backup strategies, you know backup windows are where storage lies get exposed very quickly.
5. Network throughput and dropped packets
You do not need to graph every packet forever.
You do need to know when a link is saturating or when traffic patterns suddenly change.
6. Container restarts and resource usage
cAdvisor helps you see which container is chewing RAM, thrashing CPU, or quietly restarting itself every few minutes while pretending everything is fine.
A small alert rules file that is actually useful
Create alert.rules.yml:
groups:
- name: homelab-alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Target down: {{ $labels.instance }}"
- alert: FilesystemAlmostFull
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Filesystem almost full on {{ $labels.instance }}"
- alert: HighMemoryPressure
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 10m
labels:
severity: warning
annotations:
summary: "Low available memory on {{ $labels.instance }}"
- alert: HostHighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained high CPU on {{ $labels.instance }}"
That is enough to start.
Please resist the temptation to import every alert pack you find online and then act surprised when your phone becomes a haunted object.
Grafana dashboards I would import first
Start with three dashboard classes:
- Node Exporter Full - host CPU, RAM, disk, network
- Docker / cAdvisor dashboard - per-container CPU and memory
- Prometheus self-monitoring - scrape health, target count, rule evaluation
The point is not beauty. The point is visibility.
Once the basics are stable, build one dashboard that reflects your lab: - backup host health - Proxmox storage nodes - main Docker host - NAS utilization - internet edge/router stats if you export them
That is the dashboard you will actually look at.
Security and placement rules I actually follow
This part gets ignored too often.
Monitoring systems accumulate infrastructure details fast - hostnames, ports, disk layouts, service labels, and enough metadata to make an attacker's day slightly more organized.
My defaults:
- do not expose Grafana directly to the public internet
- put Prometheus and Grafana on a management VLAN or restricted subnet
- use a reverse proxy with auth if remote access is required
- prefer VPN access for admin usage
- rotate default Grafana credentials immediately
- back up Grafana data and Prometheus config like any other critical service
If your Compose stack still needs cleanup, review my piece on Docker Compose best practices. Monitoring should not be the messiest application in the environment it is supposed to observe.
Common mistakes I keep seeing
Monitoring only containers, not hosts
Container graphs are useful.
They are not enough when the host kernel, disk layer, or memory pressure is the real problem.
Using too many exporters too early
You do not need SNMP, blackbox checks, process exporters, Postgres exporters, and custom scripts before you even know whether your main node is filling its root partition.
Start narrow. Expand on purpose.
Scraping too aggressively
A 5-second scrape interval looks energetic.
It also burns storage faster and rarely helps a small lab. Fifteen seconds is a sensible default. Thirty is fine for less critical targets.
Ignoring retention
Prometheus will happily store data until you discover your monitoring host needs monitoring.
Set retention intentionally.
Treating alerts like decorations
If an alert does not lead to action, either fix it or delete it.
A noisy monitoring stack teaches you to ignore the right warnings along with the bad ones.
My recommended rollout order
If you want the shortest path to a useful setup, do it in this order:
- Deploy Prometheus and Grafana on one monitoring node.
- Install
node_exporteron one Linux host. - Confirm host metrics are visible in Prometheus.
- Import one node dashboard in Grafana.
- Add cAdvisor for container metrics.
- Add two or three more hosts.
- Enable only a handful of high-value alerts.
- Back up the config and dashboard data.
That sequence gives you a real win quickly.
It also avoids the classic homelab trap of spending six hours perfecting a stack you never quite finish using.
FAQ
Do I need both Prometheus and Grafana?
Technically, no. Prometheus can show raw queries and basic graphs.
Practically, yes. Grafana is what makes the data easy to read, compare, and alert on without feeling like you are interrogating a database during a hostage situation.
Should node_exporter run in Docker or as a systemd service?
For host metrics, I prefer a native systemd service on Linux machines.
It is simpler to reason about and less likely to misreport the host because of container path mapping or permission quirks.
Is Alertmanager overkill for a small homelab?
Not if you keep the alert list short.
Alertmanager becomes overkill when people feed it every possible warning and then blame the tool for their own lack of restraint.
How much storage does Prometheus need?
For a small homelab, 20-40 GB is usually plenty for a 30- to 60-day window.
If you scrape many hosts, many exporters, or use aggressive intervals, budget more SSD space and review retention before Prometheus helpfully eats the disk it is supposed to monitor.
Can one stack monitor Proxmox, Linux VMs, and Docker together?
Yes. That is one of the best reasons to use this stack.
Prometheus does not care whether a metric came from a Docker host, a VM, a Proxmox exporter, or a physical Linux box. It just cares that the target exposes metrics cleanly.
Final recommendation
If your homelab has grown beyond "one box and a dream," build this stack.
Not because it is fashionable. Because it gives you trend data, real troubleshooting context, and a way to see problems before users, family members, or your own bad decisions discover them first.
Keep the first version small.
Monitor hosts, monitor containers, alert on the obvious failures, and lock the stack down properly. You can always grow into more exporters later. You do not get bonus points for building a monitoring cathedral before you have a usable dashboard.
That is the stack I trust. It has saved me from bad storage windows, noisy containers, and at least a few episodes of deeply undeserved confidence.
