Monitoring & Alerting

UBU Finance backend implements comprehensive monitoring, logging, and alerting systems to provide real-time visibility into system health and security events.

Overview

Monitoring and alerting are crucial for maintaining the security and reliability of the UBU Finance backend. The system collects metrics, logs events, and sends alerts when potential issues are detected.

Components

Prometheus

Prometheus is used for metrics collection and storage. It scrapes metrics from the application and other services at regular intervals and stores them in a time-series database.

Grafana

Grafana is used for visualization and dashboarding. It provides a user-friendly interface for viewing metrics and creating alerts.

Alertmanager

Alertmanager handles alerts from Prometheus and routes them to the appropriate receiver (email, Slack, etc.).

Node Exporter

Node Exporter collects host-level metrics such as CPU, memory, and disk usage.

cAdvisor

cAdvisor collects container metrics for monitoring Docker containers.

Metrics

The UBU Finance backend collects the following metrics:

HTTP Metrics

Request count by endpoint and method
Request duration by endpoint
Error count by endpoint and status code

Security Metrics

Login attempts (success, failure)
Rate limit hits
Account lockouts
Token blacklist hits
IP whitelist blocks

Business Metrics

Transaction count by type
Loan count by status
Active users

System Metrics

CPU usage
Memory usage
Disk usage
Network traffic

Logging

The UBU Finance backend uses structured logging to record events in a format that is easy to parse and analyze. Logs are written to files and can be forwarded to a centralized logging system.

Log Levels

DEBUG: Detailed information for debugging
INFO: General information about system operation
WARNING: Potential issues that don't affect normal operation
ERROR: Errors that affect a specific operation
CRITICAL: Critical errors that affect the entire system

Log Format

Logs are formatted as JSON for easy parsing:

{
  "timestamp": "2023-05-23T12:34:56.789Z",
  "level": "INFO",
  "logger": "app.auth.authentication",
  "message": "User logged in successfully",
  "user_id": "123e4567-e89b-12d3-a456-426614174000",
  "ip": "192.168.1.1"
}

Alerting

Alerts are triggered when certain conditions are met, such as high error rates or system resource exhaustion. Alerts can be sent via email, Slack, or webhooks.

Alert Types

High Error Rate: Triggered when the error rate exceeds a threshold
High Response Time: Triggered when response times exceed a threshold
High Rate Limit Hits: Triggered when rate limit hits exceed a threshold
High Account Lockouts: Triggered when account lockouts exceed a threshold
System Resource Exhaustion: Triggered when CPU, memory, or disk usage exceeds a threshold
Service Down: Triggered when a service is unreachable

Setup

Docker Compose

The monitoring infrastructure is set up using Docker Compose. The docker-compose.yml file includes all the necessary services:

version: '3'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./monitoring/prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./monitoring/alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Starting the Monitoring Stack

To start the monitoring stack, run:

docker-compose up -d

Accessing the Dashboards

Grafana: http://localhost:3000 (default credentials: admin/admin)
Prometheus: http://localhost:9090
Alertmanager: http://localhost:9093

Client Implementation Examples

Checking System Health

PythonJavaScriptcURLC#

import requests
import json

def check_system_health(base_url):
    url = f"{base_url}/health"

    try:
        response = requests.get(url)

        if response.status_code == 200:
            return {
                "status": "healthy",
                "details": response.json()
            }
        else:
            return {
                "status": "unhealthy",
                "details": response.json() if response.text else {"error": "No response"}
            }
    except Exception as e:
        return {
            "status": "error",
            "details": {"error": str(e)}
        }

def get_metrics(base_url):
    url = f"{base_url}/metrics"

    try:
        response = requests.get(url)

        if response.status_code == 200:
            # Parse Prometheus metrics format
            metrics = {}
            for line in response.text.split('\n'):
                if line and not line.startswith('#'):
                    parts = line.split(' ')
                    if len(parts) >= 2:
                        metrics[parts[0]] = float(parts[1])

            return metrics
        else:
            return {"error": "Failed to get metrics"}
    except Exception as e:
        return {"error": str(e)}

async function checkSystemHealth(baseUrl) {
  const url = `${baseUrl}/health`;

  try {
    const response = await fetch(url);

    if (response.ok) {
      const data = await response.json();
      return {
        status: "healthy",
        details: data
      };
    } else {
      const data = await response.json().catch(() => ({ error: "No response" }));
      return {
        status: "unhealthy",
        details: data
      };
    }
  } catch (error) {
    return {
      status: "error",
      details: { error: error.message }
    };
  }
}

async function getMetrics(baseUrl) {
  const url = `${baseUrl}/metrics`;

  try {
    const response = await fetch(url);

    if (response.ok) {
      const text = await response.text();

      // Parse Prometheus metrics format
      const metrics = {};
      text.split('\n').forEach(line => {
        if (line && !line.startsWith('#')) {
          const parts = line.split(' ');
          if (parts.length >= 2) {
            metrics[parts[0]] = parseFloat(parts[1]);
          }
        }
      });

      return metrics;
    } else {
      return { error: "Failed to get metrics" };
    }
  } catch (error) {
    return { error: error.message };
  }
}

#!/bin/bash

check_system_health() {
  local base_url=$1

  response=$(curl -s -w "%{http_code}" "$base_url/health")

  http_code=${response: -3}
  content=${response:0:${#response}-3}

  if [ "$http_code" == "200" ]; then
    echo "System is healthy"
    echo "$content"
  else
    echo "System is unhealthy"
    echo "$content"
  fi
}

get_metrics() {
  local base_url=$1

  response=$(curl -s -w "%{http_code}" "$base_url/metrics")

  http_code=${response: -3}
  content=${response:0:${#response}-3}

  if [ "$http_code" == "200" ]; then
    echo "$content"
  else
    echo "Failed to get metrics"
  fi
}

using System;
using System.Net.Http;
using System.Threading.Tasks;
using Newtonsoft.Json;
using System.Collections.Generic;
using System.Linq;

public class MonitoringClient
{
    private readonly HttpClient _client;
    private readonly string _baseUrl;

    public MonitoringClient(string baseUrl)
    {
        _client = new HttpClient();
        _baseUrl = baseUrl;
    }

    public class HealthStatus
    {
        public string status { get; set; }
        public object details { get; set; }
    }

    public async Task<HealthStatus> CheckSystemHealthAsync()
    {
        string url = $"{_baseUrl}/health";

        try
        {
            HttpResponseMessage response = await _client.GetAsync(url);
            string content = await response.Content.ReadAsStringAsync();

            if (response.IsSuccessStatusCode)
            {
                object details = JsonConvert.DeserializeObject(content);
                return new HealthStatus
                {
                    status = "healthy",
                    details = details
                };
            }
            else
            {
                object details;
                try
                {
                    details = JsonConvert.DeserializeObject(content);
                }
                catch
                {
                    details = new { error = "No response" };
                }

                return new HealthStatus
                {
                    status = "unhealthy",
                    details = details
                };
            }
        }
        catch (Exception ex)
        {
            return new HealthStatus
            {
                status = "error",
                details = new { error = ex.Message }
            };
        }
    }

    public async Task<Dictionary<string, double>> GetMetricsAsync()
    {
        string url = $"{_baseUrl}/metrics";

        try
        {
            HttpResponseMessage response = await _client.GetAsync(url);

            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();

                // Parse Prometheus metrics format
                var metrics = new Dictionary<string, double>();
                foreach (string line in content.Split('\n'))
                {
                    if (!string.IsNullOrEmpty(line) && !line.StartsWith("#"))
                    {
                        string[] parts = line.Split(' ');
                        if (parts.Length >= 2 && double.TryParse(parts[1], out double value))
                        {
                            metrics[parts[0]] = value;
                        }
                    }
                }

                return metrics;
            }
            else
            {
                return new Dictionary<string, double> { { "error", 1 } };
            }
        }
        catch (Exception ex)
        {
            return new Dictionary<string, double> { { "error", 1 } };
        }
    }
}

Configuration

Prometheus Configuration

The Prometheus configuration is defined in monitoring/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - rules.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'ubu-finance-backend'
    metrics_path: /metrics
    static_configs:
      - targets: ['host.docker.internal:8080']

Alertmanager Configuration

The Alertmanager configuration is defined in monitoring/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@ubufinance.com'
  smtp_auth_username: 'alerts@ubufinance.com'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

route:
  receiver: 'email-notifications'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  routes:
    - match:
        severity: critical
      receiver: 'all-notifications'
      continue: true
    - match:
        severity: warning
      receiver: 'email-notifications'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

receivers:
  - name: 'email-notifications'
    email_configs:
      - to: 'admin@ubufinance.com'
        send_resolved: true

  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
        channel: '#alerts'
        send_resolved: true
        title: |-
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
        text: >-
          {{ range .Alerts -}}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Time:* {{ .StartsAt }}
          {{ end }}

  - name: 'all-notifications'
    email_configs:
      - to: 'admin@ubufinance.com,oncall@ubufinance.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
        channel: '#alerts'
        send_resolved: true
        title: |-
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
        text: >-
          {{ range .Alerts -}}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Time:* {{ .StartsAt }}
          {{ end }}

Grafana Configuration

Grafana is configured with datasources and dashboards in the monitoring/grafana/provisioning directory:

datasources/datasource.yml: Configures Prometheus as a datasource
dashboards/dashboard.yml: Configures dashboard provisioning
dashboards/json/ubu_finance_overview.json: Defines the UBU Finance overview dashboard

Best Practices

Monitor Critical Metrics: Focus on monitoring metrics that are critical to your application's health and security.
Set Appropriate Thresholds: Set alert thresholds that balance sensitivity and noise.
Use Structured Logging: Use structured logging to make logs easier to parse and analyze.
Implement Log Rotation: Implement log rotation to prevent log files from growing too large.
Secure Monitoring Endpoints: Secure monitoring endpoints to prevent unauthorized access.
Regularly Review Alerts: Regularly review alerts to ensure they are still relevant and effective.
Document Alert Procedures: Document procedures for responding to different types of alerts.
Test Alert Channels: Regularly test alert channels to ensure they are working correctly.