Monitoring Stack

This document provides a comprehensive guide to the monitoring implementation in PTIIKInsight using Prometheus and Grafana for system observability and performance tracking.

Overview

The PTIIKInsight monitoring stack consists of:

Prometheus: Metrics collection and time-series database
Grafana: Visualization dashboards and alerting
FastAPI Instrumentator: Automatic metrics collection from API

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │    │   Prometheus    │    │    Grafana      │
│   Port: 8000    │────┤   Port: 9090    │────┤   Port: 3000    │
│                 │    │                 │    │                 │
│ /metrics        │    │ Scrapes every   │    │ Queries         │
│ endpoint        │    │ 15 seconds      │    │ Prometheus      │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Implementation Details

1. FastAPI Metrics Integration

The FastAPI application automatically exposes metrics using the prometheus-fastapi-instrumentator library:

from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Histogram, Gauge

app = FastAPI(title="PTIIK Insight API", description="ML-powered topic analysis")

# Initialize Prometheus metrics
instrumentator = Instrumentator()
instrumentator.instrument(app).expose(app)

# Custom metrics for ML operations
model_predictions_total = Counter('model_predictions_total', 'Total number of model predictions')
model_prediction_errors_total = Counter('model_prediction_errors_total', 'Total number of prediction errors')
model_prediction_duration = Histogram('model_prediction_duration_seconds', 'Time spent on predictions')
model_accuracy = Gauge('model_accuracy', 'Current model accuracy')
scraping_requests_total = Counter('scraping_requests_total', 'Total number of scraping requests')
scraping_errors_total = Counter('scraping_errors_total', 'Total number of scraping errors')

Available Metrics:

Standard HTTP metrics (request count, duration, status codes)
Custom ML metrics (predictions, errors, accuracy)
Application-specific metrics (scraping operations)

2. Prometheus Configuration

The Prometheus configuration file (monitoring/prometheus/prometheus.yml) defines how metrics are collected:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'fastapi-app'
    static_configs:
      - targets: ['fastapi-app:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Key Configuration:

Scrape Interval: 15 seconds between metric collections
Target: FastAPI application on port 8000
Metrics Path: /metrics endpoint
Rule Files: Custom alerting rules from rules/ directory

3. Docker Compose Configuration

The monitoring services are defined in docker-compose.yml:

services:
  fastapi-app:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: fastapi-scraper
    ports:
      - "8000:8000"
    volumes:
      - ./data:/app/data
      - ./model:/app/model
      - ./preprocessing:/app/preprocessing
      - ./api:/app/api
    restart: always

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/prometheus/rules:/etc/prometheus/rules
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"
      - "--web.enable-lifecycle"
    ports:
      - "9090:9090"
    restart: always

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_USERS_DEFAULT_THEME=dark
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    depends_on:
      - prometheus
    restart: always

volumes:
  grafana-storage:

Monitoring Features

Available Metrics

HTTP Metrics (Automatic):

http_requests_total: Total HTTP requests by method and status
http_request_duration_seconds: Request processing time
http_request_size_bytes: Request payload size
http_response_size_bytes: Response payload size

Custom ML Metrics:

model_predictions_total: Count of topic predictions made
model_prediction_errors_total: Count of prediction failures
model_prediction_duration_seconds: Time spent on predictions
model_accuracy: Current model accuracy score
scraping_requests_total: Count of scraping operations
scraping_errors_total: Count of scraping failures

Grafana Dashboard Configuration

Grafana is pre-configured with:

Data Source: Prometheus connection
Provisioning: Automatic dashboard and data source setup
Theme: Dark theme by default
Plugins: Pie chart panel for better visualizations

Dashboard Features:

API request volume and response times
Error rate monitoring
ML model performance metrics
System resource utilization
Custom alerts and notifications

Alerting Rules

Custom alerting rules can be defined in monitoring/prometheus/rules/alerts.yml:

groups:
  - name: ptiik-insight-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 10% for 5 minutes"

      - alert: ModelPredictionFailures
        expr: rate(model_prediction_errors_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Model prediction failures detected"
          description: "Model prediction error rate is above 5%"

      - alert: APIDown
        expr: up{job="fastapi-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API service is down"          description: "FastAPI service has been down for more than 1 minute"

Access and Usage

Service URLs

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)
API Metrics: http://localhost:8000/metrics

Starting the Monitoring Stack

# Start all services including monitoring
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f prometheus
docker-compose logs -f grafana

Grafana Setup

Access Grafana: Navigate to http://localhost:3000
Login: Use admin/admin (change password on first login)
Data Source: Prometheus should be auto-configured at http://prometheus:9090
Dashboards: Import or create custom dashboards for your metrics

Common Queries

Prometheus Query Examples:

# Request rate per minute
rate(http_requests_total[1m])

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Model prediction success rate
rate(model_predictions_total[5m]) - rate(model_prediction_errors_total[5m])

# Scraping success rate
(rate(scraping_requests_total[5m]) - rate(scraping_errors_total[5m])) / rate(scraping_requests_total[5m]) * 100

Troubleshooting

Common Issues

Prometheus Not Scraping Metrics:

# Check FastAPI metrics endpoint
curl http://localhost:8000/metrics

# Check Prometheus targets
# Visit http://localhost:9090/targets

Grafana Can't Connect to Prometheus:

# Verify Prometheus is running
docker-compose logs prometheus

# Check network connectivity
docker exec -it grafana ping prometheus

Missing Custom Metrics:

# Ensure metrics are being incremented in your code
model_predictions_total.inc()
scraping_requests_total.inc()

Performance Optimization

Reduce Metric Cardinality:

Limit label values in custom metrics
Use appropriate metric types (Counter, Gauge, Histogram)
Avoid high-cardinality labels

Storage Optimization:

Set appropriate retention periods
Monitor disk usage for time-series data
Use metric aggregation for long-term storage

Best Practices

Metric Design

Use Standard Names: Follow Prometheus naming conventions
Appropriate Types: Choose Counter, Gauge, or Histogram based on use case
Consistent Labels: Use consistent label names across metrics
Documentation: Add help text to all custom metrics

Monitoring Strategy

Start Simple: Begin with basic HTTP metrics
Add Business Metrics: Include ML-specific metrics gradually
Set Meaningful Alerts: Focus on actionable alerts
Regular Review: Periodically review and clean up unused metrics

Security Considerations

Change Default Passwords: Update Grafana admin credentials
Network Isolation: Use Docker networks for service communication
Access Control: Implement proper authentication for production
Data Retention: Set appropriate data retention policies

The monitoring stack provides comprehensive observability for the PTIIKInsight system, enabling proactive monitoring and performance optimization of both the infrastructure and ML operations.

Response Time: Request duration percentiles (95th, 99th)
Error Rate: Failed requests percentage
Active Connections: Current active connections count
Throughput: Data transfer rates
System Resources: CPU and memory usage

Dashboard Panels:

Request Volume: Total requests over time
Response Time: Average response time trends
Error Rate: HTTP error rate percentage
Status Code Distribution: Breakdown of response codes
Active Users: Current active connections
System Health: Overall system status indicators

3. Challenges and Solutions

3.1 Current Service Limitations

Challenge Identified: The current ML services include scraping functionality and display of scraped data results. The team intended to add a model service for predictions but encountered significant obstacles.

Specific Issues:

Extended Build Times: Docker image build process consumes excessive time
Resource Constraints: Model integration requires substantial computational resources
Service Complexity: ML model deployment adds complexity to the monitoring stack

Current Service Architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Scraping      │    │   Data Display  │    │   Monitoring    │
│   Service       │───▶│   Service       │──▶│   Stack         │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Planned Future Architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Scraping      │    │   Data Display  │    │   ML Prediction │
│   Service       │───▶│   Service      │───▶│   Service       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                 │                       │
                                 ▼                       ▼
                         ┌─────────────────────────────────────┐
                         │        Monitoring Stack             │
                         │    (Prometheus + Grafana)           │
                         └─────────────────────────────────────┘

3.2 Solutions and Workarounds

Immediate Solutions:

Optimized Build Process: Implement multi-stage Docker builds to reduce image size
Resource Management: Configure resource limits and requests for containers
Caching Strategy: Implement build caching to reduce subsequent build times

Future Enhancements:

Model Service Integration: Gradual integration of ML prediction services
Load Balancing: Implement load balancing for high-availability
Auto-scaling: Configure automatic scaling based on load metrics

4. Technical Specifications

4.1 System Requirements

Minimum Requirements:

CPU: 4 cores
RAM: 8 GB
Storage: 20 GB available space
Docker: Version 20.10+
Docker Compose: Version 2.0+

Recommended Requirements:

CPU: 8 cores
RAM: 16 GB
Storage: 50 GB SSD
Network: High-speed internet for Docker image downloads

4.2 Port Configuration

Service Ports:

FastAPI Application: 8000
Prometheus: 9090
Grafana: 3000

Network Security:

Internal communication via Docker network
External access controlled via port mapping
Authentication required for Grafana access

4.3 Data Retention

Prometheus Data Retention:

Default: 15 days
Configured: 200 hours (configurable via command line)
Storage: Time-series database (TSDB)

Grafana Data Persistence:

Dashboards: Stored in Grafana database
Configuration: Persistent via Docker volumes
Backup: Regular automated backups recommended

5. Best Practices Implemented

5.1 Security Measures

Authentication: Grafana admin password configuration
Network Isolation: Services communicate via dedicated Docker network
Access Control: Limited external port exposure

5.2 Performance Optimization

Metric Collection: Optimized 5-second scrape interval
Data Retention: Configured retention policies to manage storage
Resource Limits: Container resource constraints for stability

5.3 Monitoring Standards

Metric Naming: Consistent naming conventions
Dashboard Organization: Logical grouping of metrics
Alert Configuration: Preparation for future alerting setup

6. Future Improvements

6.1 Enhanced Monitoring

Application Performance Monitoring (APM): Integrate distributed tracing
Log Aggregation: Add ELK stack or Loki for log management
Custom Metrics: Implement business-specific metrics

6.2 Alerting System

Alert Rules: Configure Prometheus alert rules
Notification Channels: Integrate Slack, email notifications
Escalation Policies: Define alert escalation procedures

6.3 Scalability Enhancements

Horizontal Scaling: Multi-instance Prometheus setup
Load Balancing: Implement Grafana load balancing
High Availability: Configure redundant monitoring infrastructure

Conclusion

The monitoring stack implementation successfully establishes comprehensive observability for the PTIIKInsight Machine Learning System. Despite challenges with ML model service integration due to build time constraints, the current implementation provides a solid foundation for monitoring and performance tracking.

The Prometheus and Grafana integration enables real-time monitoring of system metrics, HTTP performance, and service health. The containerized deployment ensures easy management and scalability, while the dashboard provides intuitive visualization of key performance indicators.

This monitoring infrastructure will support the continued development and operation of the PTIIKInsight platform, providing essential insights for performance optimization and system reliability.

This implementation demonstrates practical application of modern monitoring tools in machine learning systems and provides a template for similar projects requiring comprehensive observability solutions.

PreviousDVC Implementation

Last updated 27 days ago