Monitoring Stack

This document provides a comprehensive guide to the monitoring implementation in PTIIKInsight using Prometheus and Grafana for system observability and performance tracking.

Overview

The PTIIKInsight monitoring stack consists of:

  • Prometheus: Metrics collection and time-series database

  • Grafana: Visualization dashboards and alerting

  • FastAPI Instrumentator: Automatic metrics collection from API

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │    │   Prometheus    │    │    Grafana      │
│   Port: 8000    │────┤   Port: 9090    │────┤   Port: 3000    │
│                 │    │                 │    │                 │
│ /metrics        │    │ Scrapes every   │    │ Queries         │
│ endpoint        │    │ 15 seconds      │    │ Prometheus      │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Implementation Details

1. FastAPI Metrics Integration

The FastAPI application automatically exposes metrics using the prometheus-fastapi-instrumentator library:

from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Histogram, Gauge

app = FastAPI(title="PTIIK Insight API", description="ML-powered topic analysis")

# Initialize Prometheus metrics
instrumentator = Instrumentator()
instrumentator.instrument(app).expose(app)

# Custom metrics for ML operations
model_predictions_total = Counter('model_predictions_total', 'Total number of model predictions')
model_prediction_errors_total = Counter('model_prediction_errors_total', 'Total number of prediction errors')
model_prediction_duration = Histogram('model_prediction_duration_seconds', 'Time spent on predictions')
model_accuracy = Gauge('model_accuracy', 'Current model accuracy')
scraping_requests_total = Counter('scraping_requests_total', 'Total number of scraping requests')
scraping_errors_total = Counter('scraping_errors_total', 'Total number of scraping errors')

Available Metrics:

  • Standard HTTP metrics (request count, duration, status codes)

  • Custom ML metrics (predictions, errors, accuracy)

  • Application-specific metrics (scraping operations)

2. Prometheus Configuration

The Prometheus configuration file (monitoring/prometheus/prometheus.yml) defines how metrics are collected:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'fastapi-app'
    static_configs:
      - targets: ['fastapi-app:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Key Configuration:

  • Scrape Interval: 15 seconds between metric collections

  • Target: FastAPI application on port 8000

  • Metrics Path: /metrics endpoint

  • Rule Files: Custom alerting rules from rules/ directory

3. Docker Compose Configuration

The monitoring services are defined in docker-compose.yml:

services:
  fastapi-app:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: fastapi-scraper
    ports:
      - "8000:8000"
    volumes:
      - ./data:/app/data
      - ./model:/app/model
      - ./preprocessing:/app/preprocessing
      - ./api:/app/api
    restart: always

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/prometheus/rules:/etc/prometheus/rules
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"
      - "--web.enable-lifecycle"
    ports:
      - "9090:9090"
    restart: always

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_USERS_DEFAULT_THEME=dark
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    depends_on:
      - prometheus
    restart: always

volumes:
  grafana-storage:

Monitoring Features

Available Metrics

HTTP Metrics (Automatic):

  • http_requests_total: Total HTTP requests by method and status

  • http_request_duration_seconds: Request processing time

  • http_request_size_bytes: Request payload size

  • http_response_size_bytes: Response payload size

Custom ML Metrics:

  • model_predictions_total: Count of topic predictions made

  • model_prediction_errors_total: Count of prediction failures

  • model_prediction_duration_seconds: Time spent on predictions

  • model_accuracy: Current model accuracy score

  • scraping_requests_total: Count of scraping operations

  • scraping_errors_total: Count of scraping failures

Grafana Dashboard Configuration

Grafana is pre-configured with:

  • Data Source: Prometheus connection

  • Provisioning: Automatic dashboard and data source setup

  • Theme: Dark theme by default

  • Plugins: Pie chart panel for better visualizations

Dashboard Features:

  • API request volume and response times

  • Error rate monitoring

  • ML model performance metrics

  • System resource utilization

  • Custom alerts and notifications

Alerting Rules

Custom alerting rules can be defined in monitoring/prometheus/rules/alerts.yml:

groups:
  - name: ptiik-insight-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 10% for 5 minutes"

      - alert: ModelPredictionFailures
        expr: rate(model_prediction_errors_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Model prediction failures detected"
          description: "Model prediction error rate is above 5%"

      - alert: APIDown
        expr: up{job="fastapi-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API service is down"          description: "FastAPI service has been down for more than 1 minute"

Access and Usage

Service URLs

  • Prometheus: http://localhost:9090

  • Grafana: http://localhost:3000 (admin/admin)

  • API Metrics: http://localhost:8000/metrics

Starting the Monitoring Stack

# Start all services including monitoring
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f prometheus
docker-compose logs -f grafana

Grafana Setup

  1. Access Grafana: Navigate to http://localhost:3000

  2. Login: Use admin/admin (change password on first login)

  3. Data Source: Prometheus should be auto-configured at http://prometheus:9090

  4. Dashboards: Import or create custom dashboards for your metrics

Common Queries

Prometheus Query Examples:

# Request rate per minute
rate(http_requests_total[1m])

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Model prediction success rate
rate(model_predictions_total[5m]) - rate(model_prediction_errors_total[5m])

# Scraping success rate
(rate(scraping_requests_total[5m]) - rate(scraping_errors_total[5m])) / rate(scraping_requests_total[5m]) * 100

Troubleshooting

Common Issues

Prometheus Not Scraping Metrics:

# Check FastAPI metrics endpoint
curl http://localhost:8000/metrics

# Check Prometheus targets
# Visit http://localhost:9090/targets

Grafana Can't Connect to Prometheus:

# Verify Prometheus is running
docker-compose logs prometheus

# Check network connectivity
docker exec -it grafana ping prometheus

Missing Custom Metrics:

# Ensure metrics are being incremented in your code
model_predictions_total.inc()
scraping_requests_total.inc()

Performance Optimization

Reduce Metric Cardinality:

  • Limit label values in custom metrics

  • Use appropriate metric types (Counter, Gauge, Histogram)

  • Avoid high-cardinality labels

Storage Optimization:

  • Set appropriate retention periods

  • Monitor disk usage for time-series data

  • Use metric aggregation for long-term storage

Best Practices

Metric Design

  1. Use Standard Names: Follow Prometheus naming conventions

  2. Appropriate Types: Choose Counter, Gauge, or Histogram based on use case

  3. Consistent Labels: Use consistent label names across metrics

  4. Documentation: Add help text to all custom metrics

Monitoring Strategy

  1. Start Simple: Begin with basic HTTP metrics

  2. Add Business Metrics: Include ML-specific metrics gradually

  3. Set Meaningful Alerts: Focus on actionable alerts

  4. Regular Review: Periodically review and clean up unused metrics

Security Considerations

  1. Change Default Passwords: Update Grafana admin credentials

  2. Network Isolation: Use Docker networks for service communication

  3. Access Control: Implement proper authentication for production

  4. Data Retention: Set appropriate data retention policies

The monitoring stack provides comprehensive observability for the PTIIKInsight system, enabling proactive monitoring and performance optimization of both the infrastructure and ML operations.

  • Response Time: Request duration percentiles (95th, 99th)

  • Error Rate: Failed requests percentage

  • Active Connections: Current active connections count

  • Throughput: Data transfer rates

  • System Resources: CPU and memory usage

Dashboard Panels:

  1. Request Volume: Total requests over time

  2. Response Time: Average response time trends

  3. Error Rate: HTTP error rate percentage

  4. Status Code Distribution: Breakdown of response codes

  5. Active Users: Current active connections

  6. System Health: Overall system status indicators

3. Challenges and Solutions

3.1 Current Service Limitations

Challenge Identified: The current ML services include scraping functionality and display of scraped data results. The team intended to add a model service for predictions but encountered significant obstacles.

Specific Issues:

  • Extended Build Times: Docker image build process consumes excessive time

  • Resource Constraints: Model integration requires substantial computational resources

  • Service Complexity: ML model deployment adds complexity to the monitoring stack

Current Service Architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Scraping      │    │   Data Display  │    │   Monitoring    │
│   Service       │───▶│   Service       │──▶│   Stack         │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Planned Future Architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Scraping      │    │   Data Display  │    │   ML Prediction │
│   Service       │───▶│   Service      │───▶│   Service       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                 │                       │
                                 ▼                       ▼
                         ┌─────────────────────────────────────┐
                         │        Monitoring Stack             │
                         │    (Prometheus + Grafana)           │
                         └─────────────────────────────────────┘

3.2 Solutions and Workarounds

Immediate Solutions:

  1. Optimized Build Process: Implement multi-stage Docker builds to reduce image size

  2. Resource Management: Configure resource limits and requests for containers

  3. Caching Strategy: Implement build caching to reduce subsequent build times

Future Enhancements:

  1. Model Service Integration: Gradual integration of ML prediction services

  2. Load Balancing: Implement load balancing for high-availability

  3. Auto-scaling: Configure automatic scaling based on load metrics

4. Technical Specifications

4.1 System Requirements

Minimum Requirements:

  • CPU: 4 cores

  • RAM: 8 GB

  • Storage: 20 GB available space

  • Docker: Version 20.10+

  • Docker Compose: Version 2.0+

Recommended Requirements:

  • CPU: 8 cores

  • RAM: 16 GB

  • Storage: 50 GB SSD

  • Network: High-speed internet for Docker image downloads

4.2 Port Configuration

Service Ports:

  • FastAPI Application: 8000

  • Prometheus: 9090

  • Grafana: 3000

Network Security:

  • Internal communication via Docker network

  • External access controlled via port mapping

  • Authentication required for Grafana access

4.3 Data Retention

Prometheus Data Retention:

  • Default: 15 days

  • Configured: 200 hours (configurable via command line)

  • Storage: Time-series database (TSDB)

Grafana Data Persistence:

  • Dashboards: Stored in Grafana database

  • Configuration: Persistent via Docker volumes

  • Backup: Regular automated backups recommended

5. Best Practices Implemented

5.1 Security Measures

  • Authentication: Grafana admin password configuration

  • Network Isolation: Services communicate via dedicated Docker network

  • Access Control: Limited external port exposure

5.2 Performance Optimization

  • Metric Collection: Optimized 5-second scrape interval

  • Data Retention: Configured retention policies to manage storage

  • Resource Limits: Container resource constraints for stability

5.3 Monitoring Standards

  • Metric Naming: Consistent naming conventions

  • Dashboard Organization: Logical grouping of metrics

  • Alert Configuration: Preparation for future alerting setup

6. Future Improvements

6.1 Enhanced Monitoring

  • Application Performance Monitoring (APM): Integrate distributed tracing

  • Log Aggregation: Add ELK stack or Loki for log management

  • Custom Metrics: Implement business-specific metrics

6.2 Alerting System

  • Alert Rules: Configure Prometheus alert rules

  • Notification Channels: Integrate Slack, email notifications

  • Escalation Policies: Define alert escalation procedures

6.3 Scalability Enhancements

  • Horizontal Scaling: Multi-instance Prometheus setup

  • Load Balancing: Implement Grafana load balancing

  • High Availability: Configure redundant monitoring infrastructure

Conclusion

The monitoring stack implementation successfully establishes comprehensive observability for the PTIIKInsight Machine Learning System. Despite challenges with ML model service integration due to build time constraints, the current implementation provides a solid foundation for monitoring and performance tracking.

The Prometheus and Grafana integration enables real-time monitoring of system metrics, HTTP performance, and service health. The containerized deployment ensures easy management and scalability, while the dashboard provides intuitive visualization of key performance indicators.

This monitoring infrastructure will support the continued development and operation of the PTIIKInsight platform, providing essential insights for performance optimization and system reliability.


This implementation demonstrates practical application of modern monitoring tools in machine learning systems and provides a template for similar projects requiring comprehensive observability solutions.

Last updated