TIL/Monitoring(k8s, grafana)

Prometheus - exporter들의 정확성 테스트

쓱쓱565 2025. 2. 21. 19:04

node-exporter 가 주요 매트릭(cpu 사용량, file system 사용량 등) 을 제대로 수집하고 있는지 테스트하여 결과를 남긴다.

Prometheus - exporter들의 정확성 테스트

1. 테스트 설계

  • Node exporter, cAdvisor 등으로 metric을 수집한다.
  • shell script로 서버와 docker container(어플리케이션)의 metric을 수집한다.
  • shell script 로 직접 수집한 metric값을 시각화, Grafana 에서 확인할 수 있는 값들과 비교한다.

2. Node exporter 테스트

1) 테스트 방법

  • linux top 명령어 출력물을 수집한다.
  • 출력물을 시계열대에 맞춰 csv 로 파싱한다.
  • 시각화 후 결과물을 비교한다.

2) 테스트 결과

3) 테스트 세부사항

(1) TOP to txt shell script

#!/bin/bash

OUTPUT_FILE="top_output.txt"
INTERVAL=5
COUNT=12

echo "Capturing top output every $INTERVAL seconds for $COUNT intervals"

# Clear the output file if it exists
> $OUTPUT_FILE

# Run top command and append output to the file at specified intervals
for ((i=1; i<=COUNT; i++))
do
    echo "Timestamp: $(date +"%Y-%m-%d %H:%M:%S")" >> $OUTPUT_FILE
    top -b -n 1 | head -n 20 >> $OUTPUT_FILE
    sleep $INTERVAL
done

echo "Capture complete."

2) TOP output txt to csv

import re
import csv

# Define the input and output file names
input_file = "top_output.txt"
output_file = "top_output.csv"

# Regex patterns to extract relevant data
timestamp_pattern = re.compile(r"Timestamp: (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})")
cpu_pattern = re.compile(r"%Cpu\(s\):\s+(\d+\.\d+) us,\s+(\d+\.\d+) sy,\s+(\d+\.\d+) ni,\s+(\d+\.\d+) id,\s+(\d+\.\d+) wa,\s+(\d+\.\d+) hi,\s+(\d+\.\d+) si,\s+(\d+\.\d+) st")
mem_pattern = re.compile(r"KiB Mem\s+:\s+(\d+)\+?\s*total,\s+(\d+)\+?\s*free,\s+(\d+)\s*used,\s+(\d+)\s*buff/cache")
swap_pattern = re.compile(r"KiB Swap:\s+(\d+)\s+total,\s+(\d+)\s+free,\s+(\d+)\s+used\.\s+(\d+)\+avail Mem")

# Open the output CSV file for writing
with open(output_file, mode='w') as csvfile:
    csvwriter = csv.writer(csvfile)
    # Write the header
    csvwriter.writerow(["Timestamp", "CPU Usage (%)", "System CPU Usage (%)", "Nice CPU Usage (%)",
                        "Idle CPU (%)", "IO Wait CPU (%)", "Hardware Interrupts (%)",
                        "Software Interrupts (%)", "Steal Time (%)",
                        "Memory Total (KiB)", "Memory Free (KiB)", "Memory Used (KiB)",
                        "Swap Total (KiB)", "Swap Free (KiB)", "Swap Used (KiB)"])

    # Read the input file
    with open(input_file, mode='r') as infile:
        lines = infile.readlines()
        i = 0
        while i < len(lines):
            line = lines[i].strip()

            # Check for timestamp
            timestamp_match = timestamp_pattern.match(line)

            if timestamp_match:
                timestamp = timestamp_match.group(1)
                # print('timestamp_match',timestamp_match)
                cpu_usage = None
                system_cpu_usage = None
                nice_cpu_usage = None
                idle_cpu = None
                io_wait_cpu = None
                hardware_interrupts = None
                software_interrupts = None
                steal_time = None
                mem_total = None
                mem_free = None
                mem_used = None
                swap_total = None
                swap_free = None
                swap_used = None

                # Extract CPU usage
                cpu_match = cpu_pattern.search(lines[i + 3])
                if cpu_match:
                    cpu_usage = float(cpu_match.group(1))
                    system_cpu_usage = float(cpu_match.group(2))
                    nice_cpu_usage = float(cpu_match.group(3))
                    idle_cpu = float(cpu_match.group(4))
                    io_wait_cpu = float(cpu_match.group(5))
                    hardware_interrupts = float(cpu_match.group(6))
                    software_interrupts = float(cpu_match.group(7))
                    steal_time = float(cpu_match.group(8))
                else:
                    print("Failed to parse CPU usage from line:", lines[i + 3].strip())

                # Extract memory usage
                mem_match = mem_pattern.search(lines[i + 4])
                if mem_match:
                    # print('mem_match',mem_match )
                    mem_total = int(mem_match.group(1))
                    mem_free = int(mem_match.group(2))
                    mem_used = int(mem_match.group(3))
                else:
                    print("Failed to parse memory usage from line:", lines[i + 4].strip())

                # Extract swap usage
                # swap_match = re.search(pattern2, lines[i + 5])
                swap_match = swap_pattern.search(lines[i + 5])
                if swap_match:
                    swap_total = int(swap_match.group(1))
                    swap_free = int(swap_match.group(2))
                    swap_used = int(swap_match.group(3))
                else:
                    print("Failed to parse swap usage from line:", lines[i + 5].strip())

                # Calculate usage percentages
                if mem_total is not None and mem_total > 0:
                    mem_usage = (float(mem_used) / mem_total) * 100
                else:
                    mem_usage = 0.0

                if swap_total is not None and swap_total > 0:
                    swap_usage = (float(swap_used) / swap_total) * 100
                else:
                    swap_usage = 0.0

                # Write to CSV
                csvwriter.writerow([timestamp, cpu_usage, system_cpu_usage, nice_cpu_usage,
                                    idle_cpu, io_wait_cpu, hardware_interrupts,
                                    software_interrupts, steal_time,
                                    mem_total, mem_free, mem_used,
                                    swap_total, swap_free, swap_used])

                # Move to the next set of lines
                i += 5  # move to the next timestamp block
            else:
                # If line doesn't match timestamp, skip to the next line
                i += 1

print("Data has been written to", output_file)

3. cAdvisor 테스트

docker stats 와 cadvisor 가 수집한 값들을 비교.

3) 상세

(1) docker status 수집 shell script


#!/bin/bash

OUTPUT_FILE="docker_stats_output.txt"
INTERVAL=5
COUNT=12

echo "Capturing Docker stats every $INTERVAL seconds for $COUNT intervals"

# Clear the output file if it exists
> $OUTPUT_FILE

# Run docker stats and append output to the file at specified intervals
for ((i=1; i<=COUNT; i++))
do
    echo "Timestamp: $(date +"%Y-%m-%d %H:%M:%S")" >> $OUTPUT_FILE
    docker stats --no-stream >> $OUTPUT_FILE
    echo "" >> $OUTPUT_FILE
    sleep $INTERVAL
done

echo "Capture complete."

(2) timestamp 단위로 파싱하는 python script