DevOps Monitoring Project

DevOps Monitoring Project

In this technical guide, we'll set up a robust monitoring solution for EC2 instances running Ubuntu OS. This project incorporates Prometheus, Alertmanager, Node Exporter, and Blackbox Exporter for effective monitoring and alerting.

Understanding the Flow Diagram

Components and Their Roles

VM-1:

Node Exporter:

  • Installed on VM-1 to expose system metrics such as CPU usage, memory usage, and disk space in a format that Prometheus can scrape.

Nginx:

  • A web server running on VM-1 that serves as an HTTP endpoint for testing availability and performance using Blackbox Exporter.

VM-2:

Prometheus:

  • Metrics Collection:

    • Continuously scrapes metrics from Node Exporter on VM-1 and system services like Nginx.

    • Retrieves external probes and HTTP status checks from Blackbox Exporter.

  • Alert Rules Evaluation:

    • Processes the collected metrics based on pre-defined rules (e.g., detecting high memory usage or service unavailability).
  • Data Flow:

    • Metrics from Node Exporter and Blackbox Exporter are combined for real-time analysis and visualization.

Blackbox Exporter:

  • Used for probing external targets like HTTP endpoints (e.g., Nginx) to check availability and performance.

  • Sends probe success or failure metrics back to Prometheus.

Alertmanager:

  • Alert Routing:

    • Receives alerts from Prometheus when certain thresholds or conditions are breached (e.g., service downtime, high CPU usage).

    • Routes these alerts to configured notification channels (e.g., email).

  • Inhibition Rules:

    • Suppresses less critical alerts when higher-priority alerts are active, ensuring relevant notifications are sent.

Email Notifications:

  • Alertmanager integrates with an email system (e.g., Gmail).

  • Sends email alerts to notify administrators about critical issues like service downtime or resource exhaustion.


Prerequisites

  • Install Required Tools: Ensure that wget and tar are installed on both VMs.

  • Permissions: Ensure you have permissions to download, extract, and run the binaries.

  • Networking: Configure the firewall to allow the following ports:

    • Prometheus: 9090

    • Alertmanager: 9093

    • Blackbox Exporter: 9115

    • Node Exporter: 9100


Setup Steps

1. Create 2 EC2 Instances

  • Use t2.medium instances with 20GB storage and Ubuntu OS.

VM-1:

  • Services: Node Exporter, Nginx.

VM-2:

  • Services: Prometheus, Alertmanager, Blackbox Exporter.

Node Exporter Setup (VM-1)

Update the Packages:

sudo apt update -y

Download Node Exporter:

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz

Extract Node Exporter:

tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz

Set Up Node Exporter as a Systemd Service:

  • Create a Service File:
sudo vim /etc/systemd/system/node_exporter.service
  • Configuration:
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=ubuntu
ExecStart=/home/ubuntu/node_exporter-1.8.1.linux-amd64/node_exporter

[Install]
WantedBy=multi-user.target
  • Reload and Enable the Service:
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Check Service Status:

sudo systemctl status node_exporter

Verify Node Exporter: Access http://<public-ip>:9100


Prometheus Setup (VM-2)

Update the Packages:

sudo apt update -y

Download Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz

Extract Prometheus:

tar xvfz prometheus-2.52.0.linux-amd64.tar.gz

Move Files and Configure:

sudo mv prometheus-2.52.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.52.0.linux-amd64/promtool /usr/local/bin/

Create Prometheus User and Directories:

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Copy the Configuration File:

sudo cp prometheus-2.52.0.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Set Up as a Systemd Service:

  • Create a Service File:
sudo vim /etc/systemd/system/prometheus.service
  • Configuration:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/

[Install]
WantedBy=multi-user.target
  • Reload and Enable the Service:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Access Prometheus Web Interface: http://<your-ec2-public-ip>:9090

Check Logs (Optional):

journalctl -u prometheus -f

Alertmanager Setup (VM-2)

Download and Install Alertmanager

Download Alertmanager:

wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz

Extract Alertmanager:

tar xvfz alertmanager-0.27.0.linux-amd64.tar.gz

Move Alertmanager Binaries:

sudo mv alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/

Create Alertmanager User and Directories

Create a Dedicated User:

sudo useradd --no-create-home --shell /bin/false alertmanager

Create Directories for Data and Configuration:

sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Configure Alertmanager

Move Configuration File:

sudo mv ~/alertmanager-0.27.0.linux-amd64/alertmanager.yml /etc/alertmanager/
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

Create a Systemd Service for Alertmanager

Create the Service File:

sudo vim /etc/systemd/system/alertmanager.service

Add the Following Configuration:

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target

Start and Enable Alertmanager

Reload Systemd Configuration:

sudo systemctl daemon-reload

Start the Service:

sudo systemctl start alertmanager

Enable the Service:

sudo systemctl enable alertmanager

Check Service Status:

sudo systemctl status alertmanager

Setting Up Blackbox Exporter

Step 1: Download Blackbox Exporter

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz

Step 2: Extract Blackbox Exporter

tar xvfz blackbox_exporter-0.25.0.linux-amd64.tar.gz

Step 3: Move Files to /usr/local/bin

sudo mv blackbox_exporter-0.25.0.linux-amd64/blackbox_exporter /usr/local/bin/

Step 4: Create a Dedicated User to Run the Exporter

sudo useradd --no-create-home --shell /bin/false blackbox_exporter

Step 5: Set Up Configuration Directory

  1. Create the directory:

     sudo mkdir /etc/blackbox_exporter
    
  2. Move the configuration file (if required):

     sudo mv blackbox_exporter-0.25.0.linux-amd64/blackbox.yml /etc/blackbox_exporter/
    
  3. Set proper permissions:

     sudo chown blackbox_exporter:blackbox_exporter /etc/blackbox_exporter -R
    

Step 6: Create a Systemd Service File

  1. Open a new systemd service file:

     sudo vim /etc/systemd/system/blackbox_exporter.service
    
  2. Add the following content:

     [Unit]
     Description=Blackbox Exporter
     Wants=network-online.target
     After=network-online.target
    
     [Service]
     User=blackbox_exporter
     Group=blackbox_exporter
     Type=simple
     ExecStart=/usr/local/bin/blackbox_exporter \
       --config.file=/etc/blackbox_exporter/blackbox.yml
    
     [Install]
     WantedBy=multi-user.target
    

Step 7: Reload Systemd and Start the Service

  1. Reload systemd to recognize the new service:

     sudo systemctl daemon-reload
    
  2. Start the Blackbox Exporter service:

     sudo systemctl start blackbox_exporter
    
  3. Enable the service to start on boot:

     sudo systemctl enable blackbox_exporter
    

Step 8: Verify the Installation

  1. Check the service status:

     sudo systemctl status blackbox_exporter
    
  2. Test that the exporter is working by accessing it on its default port (9115):

     curl http://localhost:9115/metrics
    

Comprehensive Setup for Monitoring Applications with Prometheus and Alertmanager

Step 1: Clone the Project Repository

Download the repository of the project you want to monitor:

git clone https://github.com/imkiran13/Boardgame.git

Step 2: Install Java and Maven

Since the application is Java-based, you need to install Java and Maven:

sudo apt install openjdk-17-jre-headless -y
sudo apt install maven -y

Step 3: Package the Application into a JAR File

cd Boardgame/
mvn package

Navigate to the target directory and run the JAR file:

cd target/
java -jar jar-file-name.jar

Access the application on port 8080.

Step 4: Configure Prometheus Alert Rules

Go to the Prometheus configuration folder and create an alert rules file:

sudo vim /etc/prometheus/alert_rules.yml

Add the following content:

groups:
  - name: alert_rules
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Endpoint {{ $labels.instance }} down
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

      - alert: WebsiteDown
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Website down
          description: The website at {{ $labels.instance }} is down.

      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable / node_memory_MemTotal * 100 < 25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Host out of memory (instance {{ $labels.instance }})
          description: |
            Node memory is filling up (< 25% left)
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail{mountpoint="/"} * 100) / node_filesystem_size{mountpoint="/"} < 50
        for: 1s
        labels:
          severity: warning
        annotations:
          summary: Host out of disk space (instance {{ $labels.instance }})
          description: |
            Disk is almost full (< 50% left)
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

      - alert: HostHighCpuLoad
        expr: sum by (instance) (irate(node_cpu{job="node_exporter_metrics", mode="idle"}[5m])) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Host high CPU load (instance {{ $labels.instance }})
          description: |
            CPU load is > 80%
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

      - alert: ServiceUnavailable
        expr: up{job="node_exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Service Unavailable (instance {{ $labels.instance }})
          description: |
            The service {{ $labels.job }} is not available
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

      - alert: HighMemoryUsage
        expr: (node_memory_Active / node_memory_MemTotal) * 100 > 90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: High Memory Usage (instance {{ $labels.instance }})
          description: |
            Memory usage is > 90%
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

      - alert: FileSystemFull
        expr: (node_filesystem_avail / node_filesystem_size) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: File System Almost Full (instance {{ $labels.instance }})
          description: |
            File system has < 10% free space
            VALUE = {{ $value }}
            LABELS: {{ $labels }}

Step 5: Update Prometheus Configuration

Edit the Prometheus configuration file to include the alert rules file:

sudo vim /etc/prometheus/prometheus.yml

Add the following:

rule_files:
  - "alert_rules.yml"

Restart Prometheus:

sudo systemctl restart prometheus

Step 6: Add Node Exporter and Blackbox Exporter Configurations

Edit the Prometheus configuration file:

sudo vim /etc/prometheus/prometheus.yml

Add the following:

---
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

rule_files:
  - alert_rules.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node_exporter
    static_configs:
      - targets:
          - 13.127.62.132:9100    # Node Exporter IP & Port

  - job_name: blackbox
    metrics_path: /probe
    params:
      module:
        - http_2xx
    static_configs:
      - targets:
          - http://prometheus.io
          - https://prometheus.io
          - http://13.127.62.132:8080/   # Application Server IP & Port

    relabel_configs:
      - source_labels:
          - __address__
        target_label: __param_target
      - source_labels:
          - __param_target
        target_label: instance
      - target_label: __address__
        replacement: 43.205.120.90:9115  # Blackbox Exporter IP & Port

Restart Blackbox Exporter:

sudo systemctl restart blackbox_exporter

Restart Prometheus:

sudo systemctl restart prometheus

Go to status/targets

Step 7: Configure Email Notifications with Alertmanager

Edit the Alertmanager configuration file:

sudo vim /etc/alertmanager/alertmanager.yml

Add the following:

route:
  group_by:
    - alertname
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: email-notifications

receivers:
  - name: email-notifications
    email_configs:
      - to: imkiran13@gmail.com
        from: test@gmail.com
        smarthost: smtp.gmail.com:587
        auth_username: imkiran13@gmail.com
        auth_identity: imkiran13@gmail.com
        auth_password: dlqvbjsebrxslrts
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal:
      - alertname
      - dev
      - instance

Generate an app password for your Gmail account:

  1. Go to Manage My Google Account.

  2. Search for App Passwords.

  3. Generate and use the password in the configuration.

Restart Alertmanager:

sudo systemctl restart alertmanager

Restart Prometheus:

sudo systemctl restart prometheus

Step 8: Test the Configuration

  1. Stop the application running on port 8080.

  2. Verify the "Website Down" alert in Prometheus. You should see the alert transitioning to a firing state.

  3. Check your email for notification alerts.

This comprehensive setup ensures proactive monitoring and timely alerting for your EC2 instances. Happy monitoring!