Documenting Efforts to Introduce Active Monitoring

Documenting Efforts to Introduce Active Monitoring
Photo by Luke Chesser / Unsplash

1. Background

The current state of the cloud VM is secure in a reactive manner. SSH has been hardened and fail2ban blocks active attackers. However, this provides zero visibility into the overall health, performance, and security posture of the server beyond SSH brute-force attempts. We are "flying blind" regarding web traffic, application errors, system performance, and privileged access. The current method for investigating an issue is manual, slow, and requires logging into the server to check multiple log files individually.

2. Objective & Goal

This project aims to build a centralized, real-time observability platform. The goal is to move from a reactive security model to a proactive monitoring model. By aggregating logs from disparate sources into a single dashboard, we will gain immediate, actionable insights into the server's activity, enabling faster debugging, better security awareness, and performance tuning.

3. Scope

In Scope (V1):

  • Log Aggregation: Collect and centralize logs from the following core sources:
    • Web Server (Nginx/Apache access.log & error.log)
    • System Authentication (auth.log / secure)
    • Database (mysql-slow.log or equivalent)
  • Visualization: Develop a Grafana dashboard with key panels to visualize this data.
  • Alerting: Configure at least one critical alert to be sent to a chosen notification channel (e.g., email, Discord).

Out of Scope (V1):

  • Metrics collection (e.g., CPU, RAM, Disk I/O via Prometheus). This is a potential V2.
  • Distributed tracing for application code.
  • Automated remediation actions based on alerts.

4. Features & User Stories

As the Server Administrator, I want to...

Feature 1: Centralized Log Aggregation

  • Story: ...collect logs from multiple system components into a single, searchable interface (Loki) so that I don't have to SSH into the machine for diagnostics.

Feature 2: Web Traffic & Health Dashboard

  • Story 1: ...see a real-time geographical map of website visitors to understand my user base and identify anomalous regional traffic.
  • Story 2: ...view a graph of HTTP 4xx vs 5xx error rates over time, so I can instantly detect when the application is failing for users.

Feature 3: Database Performance Monitoring

  • Story: ...view a list of the slowest database queries, so I can identify and optimize performance bottlenecks that are slowing down the website.

Feature 4: Privileged Access Alerting

  • Story: ...receive an immediate alert whenever a sudo command is executed on the server, so I have full awareness of all administrative actions and potential security breaches.

5. Success Metrics

The project will be considered a success when:

  • At least 3 different log sources are successfully and continuously being ingested into Loki.
  • A Grafana dashboard is created displaying visualizations for all features listed above.
  • A sudo usage alert can be successfully triggered and received within 2 minutes of the event.
  • Time-to-Insight: The time required to identify the cause of a 500 error is reduced from potentially hours (manual log review) to under 5 minutes (via the dashboard).