building homelab cluster part 7

#monitoring #kube-prometheus #prometheus #alertmanager #grafana #node-exporter

Table of Content

building homelab cluster part 7¶

In this part, I am going to setup monitoring system for the cluster. I am going to install kube-prometheus which comes along with grafana and other monitoring components.

kube-prometheus¶

https://github.com/prometheus-operator/kube-prometheus

install crds¶

As described in the quick start, the way to install kube-prometheus is to first apply everything in side ./manifests/setup directory, and then ./manifests directory.

The manifest files inside ./manifests/setup directory are all custom resource definitions, so I will merge them into a single crds file and place them in my ./infrastructure/homelab/controllers/crds directory.

# clone the repository
mkdir -p ~/repos/github.com/prometheus-operator
cd ~/repos/github.com/prometheus-operator
git clone https://github.com/prometheus-operator/kube-prometheus.git
cd kube-prometheus

# confirm the version
git branch -lr

# change the version
git checkout release-0.13

# remove namespace manifest as this will be created separately
cd manifests/setup
rm namespace.yaml

# merge all the crds files
cat *.yaml > kube-prometheus-v0.13.yaml

# add '---' separator line
sed -i '/^apiVersion/i ---' kube-prometheus-v0.13.yaml

# copy
cp kube-prometheus-v0.13.yaml {homelab repo}/infrastructure/hyper-v/controllers/crds/.

Now, I changed the way I add a new namespace to add them all in ./clusters/{cluster name}/namespace so that the namespace and secret from homelab-sops repository does not get affected by trial and error adding and pulling back all the manifests in infra-controllers space.

I am planning to gain access to grafana through gateway, so I'm adding the gateway label.

./clusters/CLUSTERNAME/namespace/monitoring.yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    service: monitoring
    type: infrastructure
    gateway-available: yes

I update infra-controllers kustomization to include the new kube-prometheus crds.

./infrastructure/CLUSTERNAME/controllers/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # CRDs
  - crds/gateway-v1.0.0.yaml
  - crds/directpv-v4.0.10.yaml
  - crds/cert-manager-v1.14.3.yaml
  - crds/kube-prometheus-v0.13.yaml
  # infra-controllers
  - sops.yaml
  - metallb.yaml
  - ngf.yaml
  - minio-operator.yaml
  - minio-tenant.yaml
  - cert-manager.yaml
  - gitlab-runner.yaml

Here is the result.

kubectl api-resources | grep monitoring

alertmanagerconfigs               amcfg                                           monitoring.coreos.com/v1alpha1           true         AlertmanagerConfig
alertmanagers                     am                                              monitoring.coreos.com/v1                 true         Alertmanager
podmonitors                       pmon                                            monitoring.coreos.com/v1                 true         PodMonitor
probes                            prb                                             monitoring.coreos.com/v1                 true         Probe
prometheusagents                  promagent                                       monitoring.coreos.com/v1alpha1           true         PrometheusAgent
prometheuses                      prom                                            monitoring.coreos.com/v1                 true         Prometheus
prometheusrules                   promrule                                        monitoring.coreos.com/v1                 true         PrometheusRule
scrapeconfigs                     scfg                                            monitoring.coreos.com/v1alpha1           true         ScrapeConfig
servicemonitors                   smon                                            monitoring.coreos.com/v1                 true         ServiceMonitor
thanosrulers                      ruler                                           monitoring.coreos.com/v1                 true         ThanosRuler

resource manifests¶

Now, there are tons of files in the ./manifests directory of the kube-prometheus repo.

ls ~/repos/github.com/prometheus-operator/kube-prometheus/manifests

alertmanager-alertmanager.yaml                       kubernetesControlPlane-serviceMonitorCoreDNS.yaml                prometheusAdapter-podDisruptionBudget.yaml
alertmanager-networkPolicy.yaml                      kubernetesControlPlane-serviceMonitorKubeControllerManager.yaml  prometheusAdapter-roleBindingAuthReader.yaml
alertmanager-podDisruptionBudget.yaml                kubernetesControlPlane-serviceMonitorKubelet.yaml                prometheusAdapter-serviceAccount.yaml
alertmanager-prometheusRule.yaml                     kubernetesControlPlane-serviceMonitorKubeScheduler.yaml          prometheusAdapter-serviceMonitor.yaml
alertmanager-secret.yaml                             kubeStateMetrics-clusterRoleBinding.yaml                         prometheusAdapter-service.yaml
alertmanager-serviceAccount.yaml                     kubeStateMetrics-clusterRole.yaml                                prometheus-clusterRoleBinding.yaml
alertmanager-serviceMonitor.yaml                     kubeStateMetrics-deployment.yaml                                 prometheus-clusterRole.yaml
alertmanager-service.yaml                            kubeStateMetrics-networkPolicy.yaml                              prometheus-networkPolicy.yaml
blackboxExporter-clusterRoleBinding.yaml             kubeStateMetrics-prometheusRule.yaml                             prometheusOperator-clusterRoleBinding.yaml
blackboxExporter-clusterRole.yaml                    kubeStateMetrics-serviceAccount.yaml                             prometheusOperator-clusterRole.yaml
blackboxExporter-configuration.yaml                  kubeStateMetrics-serviceMonitor.yaml                             prometheusOperator-deployment.yaml
blackboxExporter-deployment.yaml                     kubeStateMetrics-service.yaml                                    prometheusOperator-networkPolicy.yaml
blackboxExporter-networkPolicy.yaml                  nodeExporter-clusterRoleBinding.yaml                             prometheusOperator-prometheusRule.yaml
blackboxExporter-serviceAccount.yaml                 nodeExporter-clusterRole.yaml                                    prometheusOperator-serviceAccount.yaml
blackboxExporter-serviceMonitor.yaml                 nodeExporter-daemonset.yaml                                      prometheusOperator-serviceMonitor.yaml
blackboxExporter-service.yaml                        nodeExporter-networkPolicy.yaml                                  prometheusOperator-service.yaml
grafana-config.yaml                                  nodeExporter-prometheusRule.yaml                                 prometheus-podDisruptionBudget.yaml
grafana-dashboardDatasources.yaml                    nodeExporter-serviceAccount.yaml                                 prometheus-prometheusRule.yaml
grafana-dashboardDefinitions.yaml                    nodeExporter-serviceMonitor.yaml                                 prometheus-prometheus.yaml
grafana-dashboardSources.yaml                        nodeExporter-service.yaml                                        prometheus-roleBindingConfig.yaml
grafana-deployment.yaml                              prometheusAdapter-apiService.yaml                                prometheus-roleBindingSpecificNamespaces.yaml
grafana-networkPolicy.yaml                           prometheusAdapter-clusterRoleAggregatedMetricsReader.yaml        prometheus-roleConfig.yaml
grafana-prometheusRule.yaml                          prometheusAdapter-clusterRoleBindingDelegator.yaml               prometheus-roleSpecificNamespaces.yaml
grafana-serviceAccount.yaml                          prometheusAdapter-clusterRoleBinding.yaml                        prometheus-serviceAccount.yaml
grafana-serviceMonitor.yaml                          prometheusAdapter-clusterRoleServerResources.yaml                prometheus-serviceMonitor.yaml
grafana-service.yaml                                 prometheusAdapter-clusterRole.yaml                               prometheus-service.yaml
kubePrometheus-prometheusRule.yaml                   prometheusAdapter-configMap.yaml                                 setup
kubernetesControlPlane-prometheusRule.yaml           prometheusAdapter-deployment.yaml
kubernetesControlPlane-serviceMonitorApiserver.yaml  prometheusAdapter-networkPolicy.yaml

So the naming convention looks like {component name}-{resource kind}.yaml.

Here is the list of different components included in kube-prometheus.

ls -1 | cut -d- -f1 | sort | uniq

setup
alertmanager
blackboxExporter
grafana
kubePrometheus
kubernetesControlPlane
kubeStateMetrics
nodeExporter
prometheus
prometheusAdapter
prometheusOperator

Let me remove setup directory as it's taken care of, and see the rest.

As stated in the repository README, here is the list of components, and let me have a look on each one from the top.

Prometheus Operator
Prometheus
Alertmanager
Node Exporter
Prometheus Adapter
kube-state-metrics
Grafana

Prometheus Operator¶

These are the manifests for the service account "prometheus-operator" and what is allowed to do by this account.

prometheusOperator-clusterRoleBinding.yaml
prometheusOperator-clusterRole.yaml
prometheusOperator-serviceAccount.yaml

These are the deployment and service to expose it, and the network policy to apply.

prometheusOperator-deployment.yaml
prometheusOperator-service.yaml
prometheusOperator-networkPolicy.yaml

This is the PrometheusRule, alert settings, and ServiceMonitor to apply to the prometheus operator.

prometheusOperator-prometheusRule.yaml
prometheusOperator-serviceMonitor.yaml

Prometheus¶

These are for service account "prometheus-k8s" and its role on what's allowed to do.

prometheus-clusterRoleBinding.yaml
prometheus-clusterRole.yaml
prometheus-serviceAccount.yaml
prometheus-roleSpecificNamespaces.yaml
prometheus-roleBindingSpecificNamespaces.yaml
prometheus-roleConfig.yaml
prometheus-roleBindingConfig.yaml

These are the pods and services.

prometheus-prometheus.yaml
prometheus-service.yaml
prometheus-podDisruptionBudget.yaml
prometheus-networkPolicy.yaml

These are-prometheus rules and service monitor.

prometheus-serviceMonitor.yaml
prometheus-prometheusRule.yaml

Alertmanager¶

This is the service account "alertmanager-main".

alertmanager-serviceAccount.yaml

These are for pods and network policy.

alertmanager-alertmanager.yaml
alertmanager-service.yaml
alertmanager-podDisruptionBudget.yaml
alertmanager-networkPolicy.yaml

This one seems to be the config file "alertmanager.yaml".

alertmanager-secret.yaml

And the rules and monitor files.

alertmanager-prometheusRule.yaml
alertmanager-serviceMonitor.yaml

node-exporter¶

These are for service account "node-exporter" and roles.

nodeExporter-serviceAccount.yaml
nodeExporter-clusterRoleBinding.yaml
nodeExporter-clusterRole.yaml

These are for pods and network policy.

nodeExporter-daemonset.yaml
nodeExporter-service.yaml
nodeExporter-networkPolicy.yaml

And the usual, rules and service monitor.

nodeExporter-prometheusRule.yaml
nodeExporter-serviceMonitor.yaml

prometheus adapter¶

These are for service account "prometheus-adapter" and roles, and delegations set for APIService.

prometheusAdapter-serviceAccount.yaml
prometheusAdapter-clusterRole.yaml
prometheusAdapter-clusterRoleBinding.yaml
prometheusAdapter-roleBindingAuthReader.yaml
prometheusAdapter-clusterRoleAggregatedMetricsReader.yaml
prometheusAdapter-clusterRoleBindingDelegator.yaml
prometheusAdapter-clusterRoleServerResources.yaml
prometheusAdapter-apiService.yaml

Pods and network policy.

prometheusAdapter-deployment.yaml
prometheusAdapter-configMap.yaml
prometheusAdapter-service.yaml
prometheusAdapter-podDisruptionBudget.yaml
prometheusAdapter-networkPolicy.yaml

And then service monitor.

prometheusAdapter-serviceMonitor.yaml

kube-state-metrics¶

These are for service account "kube-state-metrics" and roles.

kubeStateMetrics-serviceAccount.yaml
kubeStateMetrics-clusterRole.yaml
kubeStateMetrics-clusterRoleBinding.yaml

Pods and network policy.

kubeStateMetrics-deployment.yaml
kubeStateMetrics-service.yaml
kubeStateMetrics-networkPolicy.yaml

And rules and service monitoring.

kubeStateMetrics-prometheusRule.yaml
kubeStateMetrics-serviceMonitor.yaml

Grafana¶

Here is the service account "grafana".

grafana-serviceAccount.yaml

Tons of kube-prometheus-builtin grafana dashboard definitions.

grafana-dashboardDefinitions.yaml

Pods including configs to specify prometheus data source, and network policy.

grafana-deployment.yaml
grafana-dashboardSources.yaml
grafana-dashboardDatasources.yaml
grafana-config.yaml
grafana-service.yaml
grafana-networkPolicy.yaml

And the rules and service monitor.

grafana-prometheusRule.yaml
grafana-serviceMonitor.yaml

blackbox-exporter¶

Continuing on to ones not listed.

These are for service account "blackbox-exporter".

blackboxExporter-serviceAccount.yaml
blackboxExporter-clusterRole.yaml
blackboxExporter-clusterRoleBinding.yaml

Pods.

blackboxExporter-configuration.yaml
blackboxExporter-deployment.yaml
blackboxExporter-service.yaml
blackboxExporter-networkPolicy.yaml

And service monitor.

blackboxExporter-serviceMonitor.yaml

promethues rules¶

This one says it's a general rule.

kubePrometheus-prometheusRule.yaml

service monitor¶

Service monitor for services on control plane.

kubernetesControlPlane-prometheusRule.yaml
kubernetesControlPlane-serviceMonitorApiserver.yaml
kubernetesControlPlane-serviceMonitorCoreDNS.yaml
kubernetesControlPlane-serviceMonitorKubeControllerManager.yaml
kubernetesControlPlane-serviceMonitorKubelet.yaml
kubernetesControlPlane-serviceMonitorKubeScheduler.yaml

installing components¶

Since the list is enormous, I will merge them by each component.

# prepare "monitoring" directory
cd {homelab repo}/infrastructure/hyper-v/controllers
mkdir monitoring

# back to the kube-prometheus repo
cd ~/repos/github.com/prometheus-operator/kube-prometheus/manifests/

# move grafana
cat grafana*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/grafana.yaml
rm grafana*.yaml

# move kube-state-metrics
cat kubeStateMetrics*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/kube-state-metrics.yaml
rm kubeStateMetrics*.yaml

# move prometheus-adapter
cat prometheusAdapter*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/prometheus-adapter.yaml
rm prometheusAdapter*.yaml

# move blackbox-exporter
cat blackboxExporter*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/blackbox-exporter.yaml
rm blackboxExporter*.yaml

# move node-exporter
cat nodeExporter*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/node-exporter.yaml
rm nodeExporter*.yaml

# move alertmanager
cat alertmanager*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/alertmanager.yaml
rm alertmanager*.yaml

# move prometheus operator
cat prometheusOperator-*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/operator.yaml
rm prometheusOperator-*.yaml

# move prometheus
cat prometheus-*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/prometheus.yaml
rm prometheus-*.yaml

# remaining rule
cat *prometheusRule.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/prometheusrule.yaml
rm *prometheusRule.yaml

# move service monitor for kubernetes
cat kubernetesControlPlane-serviceMonitor*.yaml > ~/repos/cp.blink-1x52.net/gitops/homelab/infrastructure/hyper-v/controllers/monitoring/kube-servicemonitor.yaml
rm kubernetesControlPlane-serviceMonitor*.yaml

# make sure that there is no manifest missed

# separate manifest resources
cd {homelab repo}/infrastructure/hyper-v/controllers/monitoring
sed -i '/^apiVersion/i ---' *.yaml

I don't think I had to do this... oh well. Now I update infra-controllers kustomization to include monitoring items.

./infrastructure/hyper-v/controllers

.
 |-kustomization.yaml
 |-minio-tenant-values.yaml
 |-metallb.yaml
 |-gitlab-runner.yaml
 |-cert-manager-values.yaml
 |-cert-manager.yaml
 |-minio-tenant.sh
 |-metallb.sh
 |-minio-operator.yaml
 |-default-values
 | |-minio-tenant-values.yaml
 | |-cert-manager-values.yaml
 | |-ngf-values.yaml
 | |-metallb-values.yaml
 | |-gitlab-runner-values.yaml
 |-ngf-values.yaml
 |-gitlab-runner.sh
 |-metallb-values.yaml
 |-cert-manager.sh
 |-crds
 | |-cert-manager-v1.14.3.yaml
 | |-gateway-v1.0.0.yaml
 | |-directpv-v4.0.10.yaml
 | |-kube-prometheus-v0.13.yaml
 |-minio-tenant.yaml
 |-sops.yaml
 |-gitlab-runner-values.yaml
 |-ngf.yaml
 |-monitoring
 | |-prometheus-adapter.yaml
 | |-node-exporter.yaml
 | |-alertmanager.yaml
 | |-kube-state-metrics.yaml
 | |-prometheus.yaml
 | |-grafana.yaml
 | |-prometheusrule.yaml
 | |-blackbox-exporter.yaml
 | |-operator.yaml
 |-ngf.sh
 |-minio-operator.sh

./infrastructure/hyper-v/controllers/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # CRDs
  - crds/gateway-v1.0.0.yaml
  - crds/directpv-v4.0.10.yaml
  - crds/cert-manager-v1.14.3.yaml
  - crds/kube-prometheus-v0.13.yaml
  # infra-controllers
  - sops.yaml
  - metallb.yaml
  - ngf.yaml
  - minio-operator.yaml
  - minio-tenant.yaml
  - cert-manager.yaml
  - gitlab-runner.yaml
  # monitoring
  - monitoring/operator.yaml
  - monitoring/prometheus.yaml
  - monitoring/prometheus-adapter.yaml
  - monitoring/prometheusrule.yaml
  - monitoring/alertmanager.yaml
  - monitoring/kube-state-metrics.yaml
  - monitoring/node-exporter.yaml
  - monitoring/blackbox-exporter.yaml
  - monitoring/kube-servicemonitor.yaml
  - monitoring/grafana.yaml

starting over¶

No, this was not a good idea to merge manifests so that it's easy to add components in kustomization. Still, I cannot add over 80 manifest files one by one in the kustomization resources list.

What I will do instead is to add another flux kustomization for the monitoring.

First, I clean up the monitoring directory in infra-controllers, and then create a separate monitoring directory and put all the manifests there.

# clean up what's added
rm -rf {homelab repo}/infrastructure/homelab/controllers/monitoring
mkdir {homelab repo}/infrastructure/homelab/monitoring

# back to the kube-prometheus repo
cd ~/repos/github.com/prometheus-operator/kube-prometheus/manifests
git stash
cp *.yaml {homelab repo}/infrastructure/homelab/monitoring/.

And here is the another flux ks to watch and reconcile the resources.

./clusters/homelab/monitoring.yaml

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infra-monitoring
  namespace: flux-system
spec:
  dependsOn:
    - name: infra-controllers
  interval: 1h
  retryInterval: 1m
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./infrastructure/homelab/monitoring
  prune: true

installed resources¶

kubectl -n monitoring get all

NAME                                       READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                    2/2     Running   0          3m22s
pod/alertmanager-main-1                    2/2     Running   0          3m22s
pod/alertmanager-main-2                    2/2     Running   0          3m22s
pod/blackbox-exporter-6cfc4bffb6-f42h8     3/3     Running   0          3m40s
pod/grafana-748964b847-fwt5p               1/1     Running   0          3m40s
pod/kube-state-metrics-6b4d48dcb4-8k4wc    3/3     Running   0          3m40s
pod/node-exporter-47flx                    2/2     Running   0          3m40s
pod/node-exporter-8g88d                    2/2     Running   0          3m40s
pod/node-exporter-gkqvf                    2/2     Running   0          3m40s
pod/node-exporter-v9mrt                    2/2     Running   0          3m40s
pod/node-exporter-xb2kq                    2/2     Running   0          3m40s
pod/prometheus-adapter-79c588b474-brvs7    1/1     Running   0          3m40s
pod/prometheus-adapter-79c588b474-zwwc9    1/1     Running   0          3m40s
pod/prometheus-k8s-0                       2/2     Running   0          3m21s
pod/prometheus-k8s-1                       2/2     Running   0          3m21s
pod/prometheus-operator-68f6c79f9d-w2bxs   2/2     Running   0          3m40s

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main       ClusterIP   10.110.53.41     <none>        9093/TCP,8080/TCP            3m40s
service/alertmanager-operated   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m22s
service/blackbox-exporter       ClusterIP   10.108.3.54      <none>        9115/TCP,19115/TCP           3m40s
service/grafana                 ClusterIP   10.101.187.130   <none>        3000/TCP                     3m40s
service/kube-state-metrics      ClusterIP   None             <none>        8443/TCP,9443/TCP            3m40s
service/node-exporter           ClusterIP   None             <none>        9100/TCP                     3m40s
service/prometheus-adapter      ClusterIP   10.105.19.171    <none>        443/TCP                      3m40s
service/prometheus-k8s          ClusterIP   10.101.1.109     <none>        9090/TCP,8080/TCP            3m40s
service/prometheus-operated     ClusterIP   None             <none>        9090/TCP                     3m21s
service/prometheus-operator     ClusterIP   None             <none>        8443/TCP                     3m40s

NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/node-exporter   5         5         5       5            5           kubernetes.io/os=linux   3m40s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/blackbox-exporter     1/1     1            1           3m40s
deployment.apps/grafana               1/1     1            1           3m40s
deployment.apps/kube-state-metrics    1/1     1            1           3m40s
deployment.apps/prometheus-adapter    2/2     2            2           3m40s
deployment.apps/prometheus-operator   1/1     1            1           3m40s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/blackbox-exporter-6cfc4bffb6     1         1         1       3m40s
replicaset.apps/grafana-748964b847               1         1         1       3m40s
replicaset.apps/kube-state-metrics-6b4d48dcb4    1         1         1       3m40s
replicaset.apps/prometheus-adapter-79c588b474    2         2         2       3m40s
replicaset.apps/prometheus-operator-68f6c79f9d   1         1         1       3m40s

NAME                                 READY   AGE
statefulset.apps/alertmanager-main   3/3     3m22s
statefulset.apps/prometheus-k8s      2/2     3m21s

GUI access¶

https://github.com/prometheus-operator/kube-prometheus/blob/main/docs/access-ui.md

There are prometheus, grafana, and alertmanager that you can access, and I am going to create gateway and httproutes for that.

One example for grafana here to add listener in the existing gateway file.

./infrastructure/CLUSTERNAME/configs/gateway.yaml

- name: https-grafana
  hostname: grafana.blink-1x52.net
  port: 443
  protocol: HTTPS
  allowedRoutes:
    namespaces:
      from: Selector
      selector:
        matchLabels:
          gateway-available: yes
  tls:
    mode: Terminate
    certificateRefs:
      - name: tls-grafana-20240307
        namespace: gateway
        kind: Secret

And create a matching httproutes like this.

See the sectionName of the gateway "https-grafana" matches the one you defined in the gateway listener, and same goes with the hostname "grafana.blink-1x52.net".

The backend reference name "grafana" and its port matches the service.

./infrastructure/CLUSTERNAME/configs/monitoring.yaml
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: grafana
  namespace: monitoring
spec:
  parentRefs:
    - name: gateway
      sectionName: https-grafana
      namespace: gateway
  hostnames:
    - "grafana.blink-1x52.net"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: grafana
          port: 3000

kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-main       ClusterIP   10.110.53.41     <none>        9093/TCP,8080/TCP            111m
alertmanager-operated   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   111m
blackbox-exporter       ClusterIP   10.108.3.54      <none>        9115/TCP,19115/TCP           111m
grafana                 ClusterIP   10.101.187.130   <none>        3000/TCP                     111m
kube-state-metrics      ClusterIP   None             <none>        8443/TCP,9443/TCP            111m
node-exporter           ClusterIP   None             <none>        9100/TCP                     111m
prometheus-adapter      ClusterIP   10.105.19.171    <none>        443/TCP                      111m
prometheus-k8s          ClusterIP   10.101.1.109     <none>        9090/TCP,8080/TCP            111m
prometheus-operated     ClusterIP   None             <none>        9090/TCP                     111m
prometheus-operator     ClusterIP   None             <none>        8443/TCP                     111m

Now, since the cluster is using calico which supports network policy, the default network policy that came with all the resource manifest files is effective, which prevents gateway to access these services in monitoring namespace. I can edit the existing network policy file by adding another ingress rule to allow access from "ngf" namespace.

./infrastructure/CLUSTERNAME/monitoring/grafana-networkPolicy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/component: grafana
    app.kubernetes.io/name: grafana
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 9.5.3
  name: grafana
  namespace: monitoring
spec:
  egress:
    - {}
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus
      ports:
        - port: 3000
          protocol: TCP
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ngf
      ports:
        - port: 3000
          protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/component: grafana
      app.kubernetes.io/name: grafana
      app.kubernetes.io/part-of: kube-prometheus
  policyTypes:
    - Egress
    - Ingress

Now I have access to https://grafana.blink-1x52.net. I can use the default "admin:admin" to login to set the password.

And I add similar changes for prometheus and alertmanager.

pvc for grafana¶

I'd like to have grafana remember changes I made, or favorite dashboard I set, so I am going to have PVC set to the grafana deployment. And since the PVC is coming from directpv, I also set node selector.

Below is the part of 250+ lines of grafana-deployment.yaml file. The volume "grafana-storage" is the default name used and I changed it from emptyDir to PVC. The two lines for nodeSelector were something I added.

./infrastructure/homelab/monitoring/grafana-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
spec:
  template:
    spec:
      containers:
        - env: []
          image: grafana/grafana:9.5.3
          name: grafana
      nodeSelector:
        app.kubernetes.io/part-of: directpv
      volumes:
        - name: grafana-storage
          persistentVolumeClaim:
            claimName: grafana-pvc

I add pvc in a separate file. I set storage class name directpv-min-io so that the requested volume gets served by directpv.

./controllers/homelab/monitoring/pvc.yaml

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 3Gi
  storageClassName: directpv-min-io

prometheus settings¶

As for persistence settings, the Prometheus kind has .spec.storage.volumeClaimTemplate available to set pvc.

It appears that the default rentention period is 24h according to this prometheus pvc example file. I'm changing it to 48 days.

I'll mention about the additional scrape config next, but the change for that is also included.

./infrastructure/homelab/monitoring/prometheus-prometheus.yaml

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 2.46.0
  name: k8s
  namespace: monitoring
spec:
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: scrape.yaml
  alerting:
    alertmanagers:
      - apiVersion: v2
        name: alertmanager-main
        namespace: monitoring
        port: web
  enableFeatures: []
  externalLabels: {}
  image: quay.io/prometheus/prometheus:v2.46.0
  nodeSelector:
    app.kubernetes.io/part-of: directpv
  podMetadata:
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/instance: k8s
      app.kubernetes.io/name: prometheus
      app.kubernetes.io/part-of: kube-prometheus
      app.kubernetes.io/version: 2.46.0
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  probeNamespaceSelector: {}
  probeSelector: {}
  replicas: 1
  resources:
    requests:
      memory: 400Mi
  retention: "48d"
  ruleNamespaceSelector: {}
  ruleSelector: {}
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  storage:
    volumeClaimTemplate:
      apiVersion: v1
      kind: PersistentVolumeClaim
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 80Gi
        storageClassName: directpv-min-io
  version: 2.46.0

additional scrape config¶

https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/additional-scrape-config.md

https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/patches/

If I have node exporter running on 192.168.1.254:9100, I can prepare scrape configuration yaml like this, turn this into a secret, and let prometheus load it.

./infrastructure/homelab/configs/scrape/scrape.yaml

- job_name: "exporter_outside_k8s"
  static_configs:
    - targets: ["192.168.1.254:9100"]
      labels:
        service: node-exporter
        instance: node254

Generate a secret in infra-monitoring flux ks directory.

./infrastructure/homelab/configs/scrape/monitoring.sh

kubectl create secret generic additional-scrape-configs \
    --from-file=scrape.yaml \
    --namespace=monitoring \
    --dry-run=client \
    -oyaml >>../../monitoring/additional-scrape-config.yaml

The settings required to use the additional scrape config is already added in the prometheus manifest at .spec.additionalScrapeConfigs.

I have custom node exporter dashboard json def to import as grafana dashboard, and since it's too long to share, I'll just skip it.

alerting¶

Login to grafana and navigate to Home > Alerting > Contact points, and add contact point. In my case I added discord webhook destination. Make sure to test it.

Then navigate to Home > Alerting > Notification policies, edit the default policy and change the destination from default empty email to the new contact point created and tested.

repository structure so far¶

I omitted lines not related to the kube-prometheus setup so the list won't be too long with 80+ kube-prometheus manifest files.

gitops/homelab repository

.
 |-clusters
 | |-homelab
 | | |-infrastructure.yaml
 | | |-monitoring.yaml         # flux kustomization infra-monitoring
                               # so that I don't have to prepare a k8s kustomize with 80+ resources list
 | | |-flux-system
 | | | |-kustomization.yaml
 | | | |-gotk-sync.yaml
 | | | |-gotk-components.yaml
 | | |-sops.yaml
 | | |-namespace
 | | | |-metallb.yaml
 | | | |-cert-manager.yaml
 | | | |-runner.yaml
 | | | |-monitoring.yaml       # monitoring namespace with label to use gateway
 | | | |-minio-operator.yaml
 | | | |-gateway.yaml
 | | | |-minio-tenant.yaml
 | | | |-ngf.yaml
 |-infrastructure
 | |-homelab
 | | |-configs
 | | | |-kustomization.yaml
 | | | |-metallb-config.yaml
 | | | |-issuer.yaml
 | | | |-monitoring.yaml
 | | | |-gateway.yaml
 | | | |-minio-tenant.yaml
 | | | |-scrape
 | | | | |-monitoring.sh       # script to convert the scrape conf file into secret and place it in infra-monitoring directory
 | | | | |-scrape.yaml         # prometheus scrape configuration file
 | | |-controllers
 | | | |-kustomization.yaml               # added kube-prometheus crds
 | | | |-crds
 | | | | |-kube-prometheus-v0.13.yaml     # kube-prometheus crds
 | | |-monitoring
 | | | |-grafana-networkPolicy.yaml       # add ingress rule to allow access from ngf
 | | | |-additional-scrape-config.yaml    # scrape config secret file
 | | | |-prometheus-prometheus.yaml       # add pvc, nodeSelector to choose nodes with directpv, retention settings, and additional scrape config
 | | | |-alertmanager-networkPolicy.yaml  # add ingress rule to allow access from ngf
 | | | |-pvc.yaml                         # directpv pvc for grafana
 | | | |-prometheus-networkPolicy.yaml    # add ingress rule to allow access from ngf
 | | | |-grafana-deployment.yaml          # change emptydir to directpv pvc, nodeSelector to choose nodes with directpv
 | | | |-... and 70+ more files...