The-Home-Node-build-004

Tags: [[The Home Node 🏡🖥️🗄️ (DevLog)]]

Build 004: Monitoring (Logging and Metrics)

March 24, 2025

Changelog

Architecture

In this build, I focus on implementing monitoring for my servers. The primary motivation was to have one place to look at for all my system resources (CPU, RAM, storage, networking, etc.) and have a way to log this information so I could analyze server performance later on. This would also be useful for diagnosing errors on my server, as I would have historical data to work with.

Observability

I knew that I would want to explore Grafana, as this is very popular dashboard monitoring tool. As I research more on this, I came across the concept of Observability. This term is commonly associated with the collection of various data to understand the state of the system you are monitoring. This data could be application loading performance, security logs, error reports, and many more. The collected data is then used to influence necessary action like performance improvements and infrastructure changes.

It turns out Grafana has grown into a whole ecosystem supporting various data to be monitored called Telemetry. So it can handle system resource monitoring through metrics, system, and application logging, application performance profiling, incident response management, and many more. To get a tighter understanding of observability, we need to introduce the three pillars of observability--Logs, Metrics, and Traces.

| Concept | Data | Use Case | Application | | ------- | --------------------- | ----------------------------------------- | ----------- | | Logs | Text records | Debugging logs and errors | Loki | | Metrics | Time-series | Performance monitoring and resource usage | Mimir | | Traces | Services request flow | Finding bottlenecks, slow requests | Tempo |

Logs Logs are text-based events happening within the system. These events typically contains the name of the event (what happened), the source (where it happened), and the timestamp (when it happened). Shown below is a typical system log for a remote ssh connection. Notice these three elements are present in every line of the log.

2025-03-28T01:15:01.233228+08:00 hserver CRON[1221111]: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
2025-03-28T01:14:43.257165+08:00 hpserver systemd-logind[910]: Removed session 856.
2025-03-28T01:13:23.753646+08:00 hpserver systemd-logind[910]: Session 856 logged out. Waiting for processes to exit.
2025-03-28T01:13:23.747556+08:00 hpserver sshd[959753]: pam_unix(sshd:session): session closed for user hpserver

Logs can be generated from the system processes or from applications. It can literally tell you any information on what's happening on your system, so some are less useful than others. Typically, you only check logs to look for errors when you are debugging. So it would be convenient to have some way of searching through thousands of lines of logs efficiently and possibly aggregate multiple logs sources into one program. Programs like Grafana Loki and Elasticsearch are able to collect logs from various sources and store them into a database for quick access or querying.

Metrics

Metrics, on the other hand, are numerical time-series data from your system or application. This data can for monitoring performance (network bandwidth, latency, uptime) and resource usage (CPU, RAM, storage).

sample metrics.png

Some of the popular applications that collect and store metrics are Prometheus, Grafana Mimir, InfluxDB, and Graphite. These can have varying applications. For example, Prometheus runs on a single node and allows to easily collect and send various metrics to many applications, which is good for homelabs and a small environment. Mimir is based from Prometheus but allows scalable, distributed deployments for cloud applications. This also requires efficient cloud object storage like Amazon Simple Storage Service (S3), Google Cloud Services (GCS), or MinIO. InfluxDB specializes in high-frequency data like IoT metrics.

Traces

Traces are lesser known among the three, but these track the flow of requests as they move through the system. This is used for finding bottlenecks in the network architecture or an application system. For example, we want to know how much time a request spends on each step when we access a file from a remote Jellyfin server.

Remote Jellyfin server file access:

  1. Reverse proxy (Caddy) = 50 ms
  2. Jellyfin server = 100 ms
  3. Media storage (NAS, DAS, local storage) = 700 ms

In this case, we can see that the request has to go through multiple steps to access a file from a remote media server, where each step can introduce a certain amount of delay. As we see from the Example, the media storage access is a bottleneck for this request. Perhaps the media storage is sending data to multiple clients and the disks are reaching its maximum speed. This enables us to understand the flow of requests, so we can improve our systems accordingly.

LGTM Stack A widely popularized stack in the field is the LGTM stack.

This stack contains applications from the three pillars of observability and Grafana for visualization. The LGTM stack is actually all comprised of Grafana applications, which benefits from a seamless integration for observability. This is not some marketing ploy to vendor lock the entire industry on Grafana applications because these applications are fully open-source and support other open-source monitoring projects. Instead, this is purely an initiative to develop an amazing open-source stack to empower the community to create software with ease.

Monitoring Architecture

buiild 004 monitoring architecture.png

This is my monitoring architecture diagram. I am collecting metrics and logs for my system and docker containers on my Hpserver machine. As you can see I am running Loki for logs and Prometheus for Metrics to aggregate telemetry. These essentially serve as a central database for my metrics and logs. Grafana then queries these databases for visualization.

Loki and Prometheus cannot collect the data themselves so they need collectors to scrape data. I use Grafana Alloy and cAdvisor to collect the various telemetry from my machine and they provide this data to the aggregators.

Grafana As you might know by now, Grafana is a leading tool for visualizing all kinds of telemetry data. It allows to create highly extensible all-in-one visualization dashboard for every application out there. Grafana is able to connect wide range of data sources including time-series databases, logs, relational database, object database, cloud services, and IoT platforms. This allows Grafana to be the only visualization dashboard you need for every use case you can think of.

grafana sample.png

Prometheus Prometheus is the pioneer for modern open-source monitoring, which began at SoundCloud in 2012 to address the insufficiency of existing technologies at the time. This development success of Prometheus led to creation of future modern observability applications like Grafana Loki and others. At its core, Prometheus is a time-series database that enables storage efficiency under the hood. The main strength of Prometheus is its flexible query language, PromQL. This allows powerful and flexible metrics querying that allows multi-dimensional queries. Prometheus is primarily uses pull architecture, where it pulls data from sources/exporters using HTTP endpoints. Data from Prometheus can be easily visualized through a seamless Grafana integration.

Loki Grafana Loki is Grafana's log aggregation system. It is horizontally-scalable, highly-available, and multi-tenant, which allows it to handle a huge infrastructure with lots of logs. It integrates seamlessly with the Grafana dashboard, as expected. As the tagline says, "Like prometheus, but for logs".

Unlike Prometheus, Loki uses a push architecture to collect logs from sources. The main strength of Prometheus is that it does not index the contents of the logs itself, but indexes the metadata as a set of labels. This allows for fast and efficient data queries. The data is stored and compressed into chunks on an object storage like Amazon S3, Google Cloud Storage (GCS), and the like. So during queries, only the labels are used to find the data in the object store directly. Similar to Prometheus, Loki uses agents to gather data from applications.

Loki uses Grafana Alloy as an agent to collect logs, but also support OpenTelemetry and others. At the time of writing, Promtail, the previously widely used agent is deprecated and replaced with Grafana Alloy.

Grafana Alloy Grafana Alloy is a relatively new tool that is trying to become the all-in-one telemetry collector that can scrape metrics, logs, traces, and profiling from applications; and then send data to Prometheus, Loki, Tempo and many more. This is built upon the OpenTelemetry collector and also integrates various popular collects like Node exporter and cAdvisor to provide seamless integration between various observability tools. As this matures further and sees wider adoption, we can expect Grafana Alloy to be the only collector you would need for all Telemetry.

Note: I wouldn't be deploying tracing as I don't have a good use for this yet. Tracing is usually relevant when you seek to optimize your microservices architecture or when you are developing application stacks. So for my case, I am focusing on logs and metrics.

Configuration

Grafana

As usual, we deploy these applications through Docker containers using docker-compose.

services:
  grafana:
    image: grafana/grafana-oss
    container_name: grafana
    restart: unless-stopped
    networks:
      - caddy-net
    volumes:
      - grafana-storage:/var/lib/grafana

volumes:
  grafana-storage: {}

networks:
  caddy-net:
    external: true
@grafana host grafana.interstellarmist.space
	handle @grafana {
		reverse_proxy grafana:3000
	}
grafana login.png
grafana homepage.png

Prometheus

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    networks:
      - caddy-net
      - monitoring
    command: "--config.file=/etc/prometheus/prometheus.yml"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    restart: unless-stopped

volumes:
  prometheus-data: {}

networks:
  caddy-net:
    external: true
  monitoring:
    external: true
@prometheus host prometheus.interstellarmist.space
handle @prometheus {
	reverse_proxy prometheus:9090
}

Note that I am attaching this container to two docker networks, one is for my reverse proxy network and the other is exclusive to monitoring data traffic. This monitoring docker network would be used by prometheus data integrations we will configure later on. In the prometheus.yml, we can specify configuration like scrape_interval which scrapes every 15 seconds and various sources of data to scrape from in the scrape_configs section. We are able to scrape data from prometheus itself and we also add two more exporters to collect various metrics on our system (more on this later).

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

   - job_name: 'cadvisor'
     static_configs:
       - targets: ['cadvisor:8080']

  # [OPTIONAL]
  # - job_name: 'node_exporter'
  #   static_configs:
  #     - targets: ['node_exporter:9100']

After deploying the docker compose, head on to your prometheus webUI and you should see something like this.

prometheus webui.png

Lets try to make a simple query on the number of prometheus requests using the metric name promhttp_metric_handler_requests_total. We can see I have 84 requests with metric label code="200".

prometheus test query.png

Adding Prometheus to Grafana Prometheus collects the metrics by scraping from various sources like Node Exporter for machine metrics and cAdvisor for docker metrics. We will use Grafana to visualize these data so we would need to connect Prometheus to Grafana. To do this, we would need to add Prometheus as a Data Source. On the sidebar go to Connections>Data Sources, then click Add data source and select Prometheus. Here, you can also see various data sources you can add to Grafana, from metrics, logs, tracing, profiling, SQL and others.

grafana add data source.png

Then provide your Prometheus server URL and scroll down to Save and Test.

prometheus grafana data source.png

Note: The following section discusses the use of Prometheus exporters that scrapes metrics from the host machine. These are widely used and easy to configure, but I chose to use Grafana Alloy, which combines all of these exporters/collectors into one program. In my current setup, I was able to integrate node exporter to Alloy but decided to use the standalone cAdvisor exporter due to implementation challenges. It is possible that cAdvisor integration within Alloy becomes easier which would eliminate the need for these standalone exporters.

Prometheus Exporters

[Optional] Node Exporter Prometheus does not collect data all by itself. It is more like a centralized storage the pulls metrics from various sources called exporters. Node exporter is an exporter that allows to collect hardware metrics like CPU, memory, storage, networking on the machine where this is installed.

services:
  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - "--path.rootfs=/host"
    networks:
      - monitoring
    pid: host
    restart: unless-stopped
    volumes:
      - "/:/host:ro,rslave"

networks:
  monitoring:
    external: true

Note to allow node exporter to obtain system host metrics, we would need to mount the root(/) directory of the system and use host pid. This might seem scary, but we only set the mount to read-only so the container cannot make any changes to the filesystem. Moreover, I've put the exporter on the monitoring docker network so access is only restricted to Prometheus and other services I explicitly include in this network. In the prometheus.yml earlier, we have already specified to add node exporter as a data source to scrape from, simply uncomment this as needed. The configuration is very simple, we simply need to add the target to the node exporter, in this case we can simply use the container name since they are in the same monitoring docker network.

scrape_configs:
  - job_name: "node_exporter"
    static_configs:
      - targets: ["node_exporter:9100"]

cAdvisor Another useful exporter is the cAdvisor (container Advisor), which is developed by Google. This obtains metrics from containers, so you can observe resource usage and performance on each container. We again install using a Docker container.

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.52.1
    container_name: cadvisor
    networks:
      - monitoring
    volumes:
      - /:/rootfs:ro
      - /run:/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg
    privileged: true
    restart: unless-stopped

networks:
  monitoring:
    external: true
scrape_configs:
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

Note that we again have to mount various root level directories as well as the docker daemon for the program to access various metrics about the docker containers. But I've isolated network access to the monitoring docker network so only Prometheus can access this. (technically, the other exporters can communicate with each other in this network but they shouldn't really do that)

Loki

I am choosing to deploy Grafana Loki in monolithic mode, which is an all-in-one deployment consisting of multiple pre-configured containers in a single binary. This is good enough for small deployment systems like a homelab. The other scalable modes are preferred for high-traffic systems and systems that require high-availability.

[Image from Loki Documentation]

loki monolithic mode.png
services:
  loki:
    image: grafana/loki:latest
    container_name: loki
    command: "-config.file=/etc/loki/config.yaml"
    networks:
      - monitoring
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml:ro
      - loki-data:/loki:rw
    restart: unless-stopped

volumes:
  loki-data: {}

networks:
  monitoring:
    external: true
auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

Loki does not have a WebUI so you have to check the logs if Loki has started successfully. Optionally, we can check the endpointhttp://<loki-domain>:3100/ready to see if Loki is ready; however, this is not possible in myself because access is limited to the monitoring docker network and not accessible from my machine.

We then proceed to add Loki as a Grafana data source, just like with Prometheus earlier. If the test connection succeeds then we are now ready to head on to the next step.

loki grafana data source.png

Grafana Alloy

services:
  alloy:
    image: grafana/alloy:latest
    container_name: alloy
    networks:
      - monitoring
      # Temporary for debugging
      - caddy-net
    volumes:
	  # Alloy volumes
      - alloy-data:/var/lib/alloy/data
      - ./config.alloy:/etc/alloy/config.alloy
      # Logs volumes
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/log:/var/log:ro
    command:
      - "run"
      - "--server.http.listen-addr=0.0.0.0:12345" # Debug UI
      - "--storage.path=/var/lib/alloy/data"
      - "/etc/alloy/config.alloy"
    restart: unless-stopped

volumes:
  alloy-data: {}

networks:
  monitoring:
    external: true
# Temporary for debugging
  caddy-net:
    external: true
@alloy host alloy.interstellarmist.space
handle @alloy {
	reverse_proxy alloy:9090
}

I've included Alloy in the reverse proxy network caddy-net temporarily to access the debugging UI, but normally access would only be restricted to the monitoring network for security. The --server.http.listen-addr=0.0.0.0:12345 enables the Alloy debugging UI, remove these lines after ensuring everything works flawlessly.

System Logs To allow Alloy to collect various telemetry and send this to aggregators like Prometheus and Loki, we need to create a configuration file config.alloy. Alloy configuration consists of components that connect to each other creating a pipeline.

// SYSTEM LOGS
// 1. Match log files, check every 5s
local.file_match "local_files" {
  path_targets = [{"__path__" = "/var/log/*.log"}]
  sync_period = "5s"
}

// 2. Scrapes the files from "local_files" target and forwards to "add_labels" component.
//    tail_from_end, scrapes tail of logs instead of entire log
loki.source.file "log_scrape" {
  targets    = local.file_match.local_files.targets
  forward_to = [loki.process.add_labels.receiver]
  tail_from_end = true
}

// 3. Add label "system_logs" and hostname
loki.process "add_labels" {
  stage.static_labels {
    values = {
      service_name = "system_logs",
      host = "hpserver",
    }
  }
  forward_to = [loki.write.grafana_loki.receiver]
}

// LOKI WRITER

loki.write "grafana_loki" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

Let's breakdown this configuration. The first component is the local.file_match with a custom label "local_files". This takes in an argument path_targets, which we specify the path to our log files /var/log/*.log and sync_period that specifies the update interval. This would then tell Alloy to perform the file matching and create a targets output. In the second component, loki.source.file we specify the targets to be the output of our first component. As well as, we specify to forward to result of this component to the next component loki.process.add_labels.receiver.. Now we can see how a pipeline is created by chaining components. The next component simply takes the scraped logs from loki.source.file and sets some labels. The result is then sent to loki.write, which write to the Loki using the Loki API.

This is a simple pipeline that enables to scrape system logs and sends this to Loki. Adding more pipeline simply requires adding these to the Alloy configuration. Restart the container or reload the configuration through http://<alloy-domain>:12345/-/reload to update the configuration. If you check your debugging UI at http://<alloy-domain>:12345, you can see a graph representation of our configuration. You can also click each component to view its outputs.

allow config graph.png

If you now check your Grafana instance, under Explore>Logs and filter service_name=system_logs to see our scraped system logs. You should see all our system logs like this! You can also filter using filename with filename=/path/to/log. This shows the ease of using Alloy, simply provide the configuration pipeline for multiple telemetry sources and this automatically gets implemented after a configuration reload. No need to manage multiple exporters and collectors and configure them individually.

grafana loki logs.png

Docker logs For docker logs, we simply use the following configuration. The pipeline is straightforward and it similar to the previous, we just changed the components to obtain the logs with the help of docker.sock.

// DOCKER LOGS

// 1. Discover Docker containers from docker.sock
discovery.docker "linux" {
  host = "unix:///var/run/docker.sock"
}

// 2. Define relabeling rule to label service name from container name
discovery.relabel "docker_relabel" {
  targets = []

  rule {
      source_labels = ["__meta_docker_container_name"]
      regex = "/(.*)"
      target_label = "service_name"
  }
}

// 3. Reads Docker container log entries
loki.source.docker "default" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.docker.linux.targets
  labels     = {"app" = "docker"}
  relabel_rules = discovery.relabel.docker_relabel.rules
  forward_to = [loki.write.grafana_loki.receiver]
}

// LOKI WRITER

loki.write "grafana_loki" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

To refresh your configuration without restarting the container, simply go to https://<alloy-domain>:12345/-/reload. You should now see more components added to the configuration graph.

alloy docker logs graph.png

And now you can easily view your all docker logs in one place without having to type docker logs for each container manually.

grafana docker logs.png

System Metrics Before we scrape system metrics using Alloy, we need to enable the --web.enable-remote-write-receiver flag for Prometheus to allow write requests. This is because Prometheus by default uses a pull architecture while Alloy pushes data by writing to Prometheus.

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    networks:
      - monitoring
    command: "--config.file=/etc/prometheus/prometheus.yml --web.enable-remote-write-receiver=true"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    restart: unless-stopped

volumes:
  prometheus-data: {}

networks:
  monitoring:
    external: true

We also need to mount the roof file system (/) to allow Alloy to scrape system metrics. We set this to read-only to ensure it cannot modify system files and rslave to allow system mounts to be recursively shared with the container.

services:
  alloy:
    image: grafana/alloy:latest
    container_name: alloy
    networks:
      - monitoring
      - caddy-net
    volumes:
      # Alloy volumes
      - alloy-data:/var/lib/alloy/data
      - ./config.alloy:/etc/alloy/config.alloy
      # Logs volumes
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/log:/var/log:ro
      # Metrics volumes
      - /:/rootfs:ro,rslave
    command:
      - "run"
      - "--server.http.listen-addr=0.0.0.0:12345" # Debug UI
      - "--storage.path=/var/lib/alloy/data"
      - "/etc/alloy/config.alloy"
    restart: unless-stopped

volumes:
  alloy-data: {}

networks:
  monitoring:
    external: true
  caddy-net:
    external: true

The Alloy config uses prometheus.exporter.unix, which is just a wrapper for node_exporter under the hood to obtain the system metrics. We specify the root filesystem path to /rootfs, scrape every 10 s, then write to prometheus. Very simple. Make sure to replace the instance label with your desired hostname.

// SYSTEM METRICS

// Setup Node exporter collector and scrape metrics every 10 seconds
prometheus.exporter.unix "local_system" {
  include_exporter_metrics = true
  rootfs_path = "/rootfs"
}

prometheus.scrape "scrape_metrics" {
  targets         = prometheus.exporter.unix.local_system.targets
  forward_to      = [prometheus.relabel.rename_instance.receiver]
  scrape_interval = "10s"
}

// Change instance and job labels
prometheus.relabel "rename_instance" {
  forward_to = [prometheus.remote_write.grafana_prometheus.receiver]
  rule {
    action       = "replace"
    target_label = "instance"
    replacement  = <hostname>
  }
  rule {
    action       = "replace"
    target_label = "job"
    replacement  = "system-metrics"
  }
}

// PROMETHEUS WRITER

prometheus.remote_write "grafana_prometheus" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

After reload, you should see this configuration graph. Note that I removed the logs temporarily pipeline to reduce clutter.

prometheus system logs graph.png

To check the system metrics data, go to Explore>Metrics in your Grafana instance and filter job=system-metrics.

alloy system metrics.png

Docker Metrics Grafana Alloy provides a prometheus.exporter.cadvisor component, which uses the popular tool cAdvisor from Google. This enables to collect docker metrics and written to Prometheus. From my testing, I found that getting cAdvisor to work inside Alloy is pretty challenging as there is not a lot of documentation for this. I did managed to get the some important docker metrics to work with cAdvisor in Alloy, but I can't figure out how to emulate the same behavior as the standalone container that exports directly to Prometheus. If you are interested in my configuration here it is, but I've decided to go with cAdvisor standalone with Prometheus for now. Instructions for this is outlined in the Prometheus section.

services:
  alloy:
    image: grafana/alloy:latest
    container_name: alloy
    networks:
      - monitoring
      - caddy-net
    volumes:
      # Alloy volumes
      - alloy-data:/var/lib/alloy/data
      - ./config.alloy:/etc/alloy/config.alloy
      # Cadvisor volumes
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/:/var/lib/docker:ro
      - /sys:/sys:ro
      - /dev:/dev:ro
    cap_add:
      - NET_ADMIN
      - SYS_PTRACE
      - SYS_ADMIN
    pid: host
    command:
      - "run"
      - "--server.http.listen-addr=0.0.0.0:12345" # Debug UI
      - "--storage.path=/var/lib/alloy/data"
      - "/etc/alloy/config.alloy"
    restart: unless-stopped

volumes:
  alloy-data: {}

networks:
  monitoring:
    external: true
  caddy-net:
    external: true
prometheus.exporter.cadvisor "localhost" {
  docker_host = "unix:///var/run/docker.sock"
  storage_duration = "5m"
}

prometheus.scrape "docker_metrics" {
  targets = prometheus.exporter.cadvisor.localhost.targets
  forward_to = [prometheus.remote_write.grafana_prometheus.receiver]
}

prometheus.remote_write "grafana_prometheus" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

Putting all configuration together Simply put all alloy configuration in the same config.alloy file, the order doesn't really matter since it figures out the pipeline programmatically. Full configuration files can be found on my [GitHub repo]. Go ahead and check your Alloy components graph, you should see something like this.

Alloy full graph.png

Alloy is now able to become an all-in-one collector for you logs and metrics, but minus the docker metrics for now. The advantage of this, is that this is very extensible. If you decide to include tracing and profiling or decide to have clusters of Alloys, you can easily do this by just adding to your Alloy configuration. Maybe someday docker metrics would be seamless too! 🚀

Adding Dashboards to Grafana

Customizing your own Grafana dashboard would be very complicated at the start because you would need to be familiar with all the metrics from your various sources and query them using PromQL (Prometheus query language). Thankfully, you can copy other people's pre-made Grafana dashboard here. This contains many popular dashboards that can serve as your starting point to monitor your data so you can just edit according to your needs later on.

grafana premade dashboard.png

We will be using the very popular Node Exporter Full. This simply uses the metrics from node exporter, which we already added earlier. To use this we simply need to take note of the dashboard ID, in this case ID=1860.

grafana node exporter full.png

Then on your Grafana instance, go to Dashboards, then on the New button there is a dropdown with import. Then simply input the ID from earlier, as shown below, then click Load.

grafana import dashboard.png

Make sure to select your Prometheus data source below, then click Import.

grafana import dashboard 2.png

There you go! Now you have all your system metrics in one dashboard. If you scroll below, you'll see many more metrics to view.

grafana node exporter full dashboard.png

For cAdvisor, I found cAdvisor Docker Insights (ID:19908) to be simple enough and cadvisor dashboard (ID:19792) if you want every metric like Node Exporter Full.

grafana cAdvisor Docker Insights.png
grafana cadvisor dashboard.png
cAdvisor insights sample dashboard.png
cAdvisor dashboard sample.png

Grafana Stack Conclusion Now we have every metric and log from our system and Docker containers in a single Dashboard application. But of course, this is just a starting point. As you go through your metrics and logs, you would know which ones are important to you and start to customize your own dashboard. But at the start, you have to at least be familiar of data is available and work from there.

Uptime Kuma (status monitoring)

I have decided to deploy another monitoring application that is way simpler than Grafana. Uptime Kuma is a simple but elegant dashboard to monitor uptime of your services or any website you want to monitor. This allows me to quickly check the status of my services then I can head to Grafana if I need to investigate deeper. Although Grafana can also be configured to do this, I just like the minimal aesthetic of this application. Uptime Kuma efficiently conveys the status of your applications through design. In contrast, Grafana is good at more heavy and extensive monitoring.

Uptime Kuma demo.png
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: uptime-kuma
    volumes:
      - ./data:/app/data
    networks:
      - caddy-net
      - monitoring
    restart: unless-stopped

networks:
  caddy-net:
    external: true
  monitoring:
    external: true
@uptimekuma host uptime-kuma.interstellarmist.space
handle @uptimekuma {
	reverse_proxy uptime-kuma:3001
}

You should see the Uptime Kuma sign up page once you spin up the container.

uptime kuma signup.png

Adding monitors is quite straightforward. Most of the time you will only need to add HTTP(s) monitors if your application has a web UI as shown below. But if your service is like a database then you can check on a specific port.

uptime kuma example config.png

Future Plans

Monitoring was way deeper than I thought. I thought I can fit some automation with Ansible in this build entry, but I guess that's another can of worms so I'll just explore it in the next build. I would like to automate the administration of my servers, like simple system updates and creating an Ansible playbook of the entire server configuration in the future. This is so I am able to easily copy this exact configuration without having to install everything by myself. This is also convenient when your device suddenly breaks. You are able to replicate your system configuration with a single click.