snmp monitor #monitoring-system/todo

Table of Content

snmp monitor #monitoring-system/todo¶

Here I am documenting on how to setup a monitoring system using open source software.

snmp_exporter
- gets snmp metrics from endpoints
prometheus
- stores the metric data
grafana
- uses prometheus as data source and can be used to explore and visualize obtained metric data

Note that this is somewhat different from what is covered in the other page [[building-homelab-cluster-part-7]] using kube-prometheus stack. This page covers how to setup a snmp monitoring system from scratch on docker containers.

---
title: monitoring system
---
graph LR
    User -- web UI access --> grafana[grafana:3000] -- data source --> prometheus[prometheus:9090] -- obtain snmp metric data --> snmp[snmp_exporter:9100]

Items to be covered are...¶

explanation of the monitoring system to be setup, with words and illustrations
- multiple remote devices to monitor in multiple sites
- monitoring system in one location
well-known technologies involved in the monitoring system
- snmp v2c
  - this section should be very simple and short
- docker, because the monitoring system will be running as docker containers using docker compose
  - this section should be very simple and short too
components of the monitoring system covered in this document
- snmp agent, responder on the managed device
- prometheus snmp exporter, a snmp manager which collects metric data from the managed devices
- promethues, an open source monitoring toolkit which scrapes metric data working with snmp manager and stores them
- grafana, an open source visualization platform with rich features including alerting
building the monitoring system from ground up as a single docker compose
- snmp agent, a simple example using Cisco IOS configuration
- snmp manager, the prometheus snmp exporter
  - first explain about the ideas of the scrape target to be defined in the prometheus configuration file and the modules to be defined in the snmp exporter configuration file
    - illustrate this!
    - and explain in which order the document is going to cover
  - briefly go through how to generate the default snmp.yml file using the generator available in the official repository
    - add a note that the snmp manager customization on which metric to obtain must happen here
  - start with the default if_mib module, and add a section later about the customization
- promethues
  - scrape config in prometheus.yml
    - use file_sd_config to specify the monitor targets with necessary grouping and labeling
  - basic health check
    - navigate to "status > targets" to see the list of monitor targets
    - navigate to "status > tsdb status" to see the activity of the server by observing the numbers counting up
- grafana
  - initial login credentials
    - mention about the administrator section and note about adding teams and users as necessary
how to setup and use grafana
- adding prometheus as the data source
- explore, maybe use up metric just to see things working
- create a dashboard using the up metric in the gauge format
  - preferably also create a dashboard for the interface utilization, introducing the equation to use, to show how to come up with a complex condition for monitoring and even firing an alert
- configure notification settings
- create an alert using discord webhook for example
how to add monitoring targets
- add a note that there is no need to restart promethues server if you are just adding lines in the existing file_sd_configs target files
how to add a monitoring metric data to obtain
- briefly touch upon OID and MIB
- manually test the target OID to a test device using snmpwalk
- add a new module in the generator.yml and also prepare required MIB file, and run the generator to generate the snmp.yml file
- replace the snmp.yml file and restart the snmp exporter container
- manually test the new module using curl to the snmp exporter
- add a new test job in the scrape config on prometheus.yml and restart the prometheus container
- check the target status on the prometheus web UI
- revise the prometheus.yml configuration file along with the target yaml files as necessary
  - do not forget to show example!
issues I encountered
- scrape intervals and timeouts
- sample size

docker compose¶

In short, I put all three services above in a single docker compose, creating volumes for data persistency and placing and mounting configuration files to specify what data to obtain from which targets.

Here is the docker compose file with image tags as of Q2 2024.

docker-compose.yml

services:
  prometheus:
    image: quay.io/prometheus/prometheus:v2.52.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - prometheus-data:/prometheus
      - ./snmp_target:/etc/prometheus/file_sd_config
      - type: bind
        source: ./config/prometheus.yml
        target: /etc/prometheus/prometheus.yml
        read_only: true
    command:
      - --web.enable-remote-write-receiver
      - --config.file=/etc/prometheus/prometheus.yml

  snmp:
    image: quay.io/prometheus/snmp-exporter:v0.26.0
    container_name: snmp
    ports:
      - "9116:9116"
    volumes:
      - type: bind
        source: ./config/snmp.yml
        target: /etc/snmp_exporter/snmp.yml
        read_only: true

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    ports:
      - 3000:3000
    environment:
      - GF_SECURITY_ADMIN_USER=your_admin_username_here
      - GF_SECURITY_ADMIN_PASSWORD=your_admin_password_here
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  prometheus-data: {}
  grafana-data: {}

snmp_exporter¶

https://github.com/prometheus/snmp_exporter

This snmp_exporter is the one that actually sends snmp requests to monitoring target nodes. It's capable of snmpv3, but here I will just cover snmpv3c.

As you can see in the docker compose file, it's listening on tcp port 9116, and the prometheus server sends its requests there to have snmp_exporter run snmp get requests and to receive the collected metric data.

You have to specify the target, module, and auth parameters when sending a request. The target is straightforward. The module and auth are something defined in the snmp_exporter configuration file and will be covered shortly.

curl "http://{snmp_exporter}:9116/snmp?module={module_name}&auth={auth_name}&target={ipaddr}"

# snmp_exporter: in this docker compose example above, it's going to "snmp"
# and if trying it out from outside the docker containers, it's going to be the IP address of the server
# and of course, "localhost" would work if sending the request from the host machine running this docker compose

# module_name: name of the module defined in the snmp_exporter configuration file
# auth_name: snmpv2c community string defined in the snmp_exporter configuration file

# ipaddr: the monitoring target to send snmp get request to

# example if trying it out on the server running this docker compose
curl "http://localhost:9116/snmp?module=ifModule&auth=jGc9SRTxNNH4pZk&target=192.168.200.1"

If you have the configuration file with the authentication group "jGc9SRTxNNH4pZk" and module "ifModule" to get metrics such as ifOperStatus (interface operational status), ifAdminStatus (interface admin status), and etc. set, the snmp_exporter sends relevant snmp get requests to get those metrics using the community string "jGc9SRTxNNH4pZk" against the target node 192.168.200.1.

The node being monitor should be configured something like this btw to allow others to get the snmp data on the system.

snmp-server community jGc9SRTxNNH4pZk RO
snmp-server host {ipaddr_of_the_server_running_snmp_exporter_container} version 2c jGc9SRTxNNH4pZk

---
title: snmp
---
graph LR
    prometheus -- request --> snmp[snmp_exporter:9116] -- snmp request --> t[monitor target:161/udp]

snmp and OID¶

There are countless of different data you can obtain using snmp, and every one of those different metrics are stored in different addresses in the snmp system of the node being monitored.

The address of different snmp data store location is called OID. When you want to see the system description, you request for the information specifying the OID 1.3.6.1.2.1.1.1. When you want to see the system uptime, you request for the information at OID 1.3.6.1.2.1.1.3.

Here are some examples of manually obtaining snmp data using snmpwalk.

$ snmpwalk -v 2c -c public localhost .1.3.6.1.2.1.1.1
iso.3.6.1.2.1.1.1.0 = STRING: "Linux d12 6.1.0-20-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64"

$ snmpwalk -v 2c -c public localhost .1.3.6.1.2.1.1.3
iso.3.6.1.2.1.1.3.0 = Timeticks: (183916) 0:30:39.16

OID and MIB¶

OID is in hierarchy like DNS where everything starts at the root .. Try this website for example, or any other website hosting OID information, you can dig first down to .1, and then .1.3, and all the way down to .1.3.6.1.2.1 here, there are .1.3.6.1.2.1.1 for "system", .1.3.6.1.2.1.2 for "interfaces", and so on.

Have a look at "system", .1.3.6.1.2.1.1, and you'll see that the two OIDs mentioned earlier are "sysDescr" at .1.3.6.1.2.1.1.1 and "sysUpTime" at .1.3.6.1.2.1.1.3. The monitoring systems can be configured to specify the target metric by OID or the name of the OID by importing the relevant dictionaries. Such a dictionary file containing the OID definitions is MIB file.

module and auth in snmp_exporter¶

Ultimately, this is all you have to do to monitor a system.

decide what data you want to collect
identify the OID of the data
send snmp request to the target device to obtain the data

Now, of course it's too much of a burden to maintain and execute, say, 30 lines of snmpwalk to collect interesting metric data for 10 devices each in 30 different office locations (9,000 lines...????!).

You also have to consider there may be different set of metrics (OIDs) to monitor for different roles of devices. Some device model offer fan rpm where some doesn't. On some network devices you may want to monitor BGP session status in addition to standard interface metrics.

Let's say you want a module named "systemModule" to cover metrics "sysDescr" and "sysUpTime", and another one named "ifModule" to cover metrics "ifOperStatus" and "ifAdminStatus", below is something you would see in the snmp_exporter configuration file.

Now you can run curl "http://localhost:9116/snmp?module=systemModule&auth=jGc9SRTxNNH4pZk&target=192.168.200.1" to get the system description and system uptime of a node 192.168.200.1 through snmp_exporter. The other "ifModule" also works too.

Still too complicated as there is no way to write up this configuration file from scratch, right? As the comment line is saying, there is a generator tool available to generate the snmp_exporter configuration file.

# WARNING: This file was auto-generated using snmp_exporter generator, manual changes will be lost.
auths:
  jGc9SRTxNNH4pZk:
    community: jGc9SRTxNNH4pZk
    security_level: noAuthNoPriv
    auth_protocol: MD5
    priv_protocol: DES
    version: 2
modules:
  systemModule:
    walk:
      - 1.3.6.1.2.1.31.1.1
    get:
      - 1.3.6.1.2.1.1.1.0
      - 1.3.6.1.2.1.1.3.0
      - 1.3.6.1.2.1.1.5.0
      - 1.3.6.1.2.1.1.6.0
    metrics:
      - name: sysDescr
        oid: 1.3.6.1.2.1.1.1
        type: DisplayString
        help: A textual description of the entity - 1.3.6.1.2.1.1.1
      - name: sysUpTime
        oid: 1.3.6.1.2.1.1.3
        type: gauge
        help:
          The time (in hundredths of a second) since the network management portion
          of the system was last re-initialized. - 1.3.6.1.2.1.1.3
  ifModule:
    walk:
    get:
    metrics:
      - name: ifOperStatus
        oid: 1.3.6.1.2.1.2.2.1.8
        type: gauge
        help: The current operational state of the interface - 1.3.6.1.2.1.2.2.1.8
        indexes:
          - labelname: ifIndex
            type: gauge
        lookups:
          - labels:
              - ifIndex
            labelname: ifAlias
            oid: 1.3.6.1.2.1.31.1.1.1.18
            type: DisplayString
          - labels:
              - ifIndex
            labelname: ifDescr
            oid: 1.3.6.1.2.1.2.2.1.2
            type: DisplayString
          - labels:
              - ifIndex
            labelname: ifName
            oid: 1.3.6.1.2.1.31.1.1.1.1
            type: DisplayString
          - labels:
              - ifIndex
            labelname: ifPhysAddress
            oid: 1.3.6.1.2.1.2.2.1.6
            type: PhysAddress48
        enum_values:
          1: up
          2: down
          3: testing
          4: unknown
          5: dormant
          6: notPresent
          7: lowerLayerDown

generator for snmp_exporter¶

The default configuration is available as mentioned here on the snmp_exporter github repo, but to prepare different modules to meet your own need, there is a generator tool you can use to generate the snmp_exporter configuration file from a simpler "generator.yml" file by following the format explained here.

For example, if there are Cisco IOS, NXOS, and ASA devices, and also NEC IX devices, you might just use the default "if_mib" module available in the example "generator.yml" file, and add something like below to also capture CPU and memory utilization data.

Generated "snmp.yml" configuration file can then be used to run snmp_exporter, which will then be ready to collect CPU and memory related metric data from NEC IX router by running curl "http://localhost:9116/snmp?module=nec_ix_cpu_mem&auth=jGc9SRTxNNH4pZk&target=192.168.200.1".

Now, the revised steps to setup snmp monitoring system would be:

decide what data you want to collect
identify the OID of the data
prepare MIB files for the target metrics/OIDs
edit "generator.yml" file and generate "snmp.yml" file
run snmp_exporter using the generated "snmp.yml" file

generator.yml file, partial

---
auths:
  jGc9SRTxNNH4pZk:
    version: 2
    community: jGc9SRTxNNH4pZk
modules:
  ...
    ...
  if_mib:
    ...

  # nec ix router
  nec_ix_cpu_mem:
    walk:
      # cpu
      - picoSchedRtUtl5Sec
      - picoSchedRtUtl1Min
      # memory util
      - picoHeapUtil

  # cisco asa cpu
  asa_cpu:
    walk:
      - cpmCPUTotalMonIntervalValue  # 5 sec
      - cpmCPUTotal1minRev  # 1 min
      - cpmCPUTotal5minRev  # 5 min

  # cisco ios cpu
  cisco_ios_cpu:
    walk:
      - cpmCPUTotal5secRev  # 5 sec
      - cpmCPUTotal1minRev  # 1 min
      - cpmCPUTotal5minRev  # 5 min

  # cisco nxos memory
  cisco_nxos_memory:
    walk:
      - cempMemPoolHCUsed
      - cempMemPoolHCFree

  # cisco ios and asa memory
  cisco_ios_asa_memory:
    walk:
      - ciscoMemoryPoolUsed
      - ciscoMemoryPoolFree

how to use generator¶

Here is the example on running generator.

git clone --branch v0.26.0 https://github.com/prometheus/snmp_exporter.git
cd snmp_exporter/generator

# place necessary mib files under ./mibs/
cp {necessary mib files} mibs/.

# where to get mib files?
# https://github.com/prometheus/snmp_exporter/tree/main/generator#where-to-get-mibs

# edit generator.yml file

# execute generator to generate snmp.yml file

# you must have these files and docker installed:
# ./generator.yml
# ./mibs/{mib files}
make docker-generate

# or, you can still run generator without docker by building it
# see the instruction - https://github.com/prometheus/snmp_exporter/blob/main/generator/README.md

running snmp_exporter on docker compose¶

You are now ready to run snmp_exporter service once you have generated your snmp.yml file.

place the generated snmp.yml file at ./config/snmp.yml
prepare the ./docker-compose.yml file and docker compose up -d
now the host is listening on port 9116
- test the added module against a target 192.168.210.50 assuming it's the IP address of a NXOS switch
- curl "http://localhost:9116/snmp?module=cisco_nxos_memory&auth=jGc9SRTxNNH4pZk&target=192.168.210.50"

services:
  snmp:
    image: quay.io/prometheus/snmp-exporter:v0.26.0
    container_name: snmp
    ports:
      - "9116:9116"
    volumes:
      - type: bind
        source: ./config/snmp.yml
        target: /etc/snmp_exporter/snmp.yml
        read_only: true