Skip to content



building homelab cluster part 8


building homelab cluster part 8

In last part I setup monitoring system, and this part I am going to setup logging system.

loki

https://github.com/grafana/loki

Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream.

components

Agent - An agent or client, for example Promtail, which is distributed with Loki, or the Grafana Agent. The agent scrapes logs, turns the logs into streams by adding labels, and pushes the streams to Loki through an HTTP API.

Loki - The main server, responsible for ingesting and storing logs and processing queries. It can be deployed in three different configurations, for more information see deployment modes.

Grafana for querying and displaying log data. You can also query logs from the command line, using LogCLI or using the Loki API directly.

deployment modes

There are simple, monolithic, and microservice. Illustration and description of each is available and easy to understand.

https://grafana.com/docs/loki/latest/get-started/deployment-modes/

(simple scalable deployment mode)/It strikes a balance between deploying in monolithic mode or deploying each component as a separate microservice.

I am going with simple scalable mode.

loki values file

https://grafana.com/docs/loki/latest/setup/install/helm/

Here is the list of items you can see from the helm values file. I am omitting enterprise section because that's not an option for me building homelab.

  • loki
    • storage
      • bucketNames
        • chunks, ruler, admin
      • type: s3
      • s3 access details
    • memcached
      • chunk_cache (enabled: false)
      • results_cache (enabled: false)
  • monitoring
    • dashboard
    • rules (prometheus rules)
    • service monitor
    • self monitoring
    • loki canary
  • write
  • table manager (enabled: false, deprecated as per v2.9 doc)
  • read
  • backend
  • single binary (replicas 0)
  • ingress (enabled: false)
  • memberlist service
  • gateway (enabled: true) # changed this to false
  • network policy (enabled: false)
  • minio (enabled: false)
  • sidecar

preparing s3 bucket

I will just follow the default bucket names found in the values file and create each bucket on my minio tenant.

  • admin
  • chunks
  • ruler

I create a new group named loki-group and user named loki, set rw policy to the group and generate access key and secret for the user.

Loki helm chart does not seem to have an option to use s3 credentials in secret like gitlab-runner helm chart.

preparing memcached

https://grafana.com/docs/loki/latest/operations/caching/

https://memcached.org/

There are two sets of memcached cluster recommended. They are named chunk_cache and results_cache in the values file.

Here is the instructed memcached settings:

  • chunk_cache: --memory-limit=4096 --max-item-size=2m --conn-limit=1024
  • results_cache: --memory-limit=1024 --max-item-size=5m --conn-limit=1024

The number of concurrent connections limit is set to 1024 by default as per the official wiki.

https://artifacthub.io/packages/helm/bitnami/memcached

# cd {homelab repo}/infrastructure/CLUSTERNAME/controllers/default-values

# confirm the version, 6.14.0 as of 20240308 (memcached v1.6.24)
helm show chart oci://registry-1.docker.io/bitnamicharts/memcached

# get the values file
helm show values oci://registry-1.docker.io/bitnamicharts/memcached > memcached-values.yaml
cp memcached-values.yaml ../.

Here is my values file. This one is for chunk, and I have another copy named results-memcached-values.yaml for results_cache with modified args -m 1024 and -I 5m.

20c20
<   storageClass: "directpv-min-io"
---
>   storageClass: ""
80,81c80,81
<   registry: registry.blink-1x52.net
<   repository: cache-dockerio/bitnami/memcached
---
>   registry: docker.io
>   repository: bitnami/memcached
102c102
< architecture: high-availability
---
> architecture: standalone
130,134c130
< args:
<   - /run.sh
<   - -m 4096
<   - -I 2m
<   - --conn-limit=1024
---
> args: []
152c148
< replicaCount: 3
---
> replicaCount: 1
229,235c225
< resources:
<   requests:
<     cpu: 250m
<     memory: 256Mi
<   limits:
<     cpu: 1
<     memory: 2048Mi
---
> resources: {}
326,327c316
< nodeSelector:
<   app.kubernetes.io/part-of: directpv
---
> nodeSelector: {}
564c553
<   enabled: true
---
>   enabled: false
572c561
<   storageClass: "directpv-min-io"
---
>   storageClass: ""

Here is my script to generate flux helmrepo and hr.

./infrastructure/CLUSTERNAME/controllers/memcached.sh
#!/bin/bash

# add flux helmrepo to the manifest
flux create source helm bitnami \
    --url=oci://registry-1.docker.io/bitnamicharts \
    --interval=1h0m0s \
    --export >memcached.yaml

# add flux helm release to the manifest including the customized values.yaml file
flux create helmrelease chunk-memcached \
    --interval=10m \
    --target-namespace=memcached \
    --source=HelmRepository/bitnami \
    --chart=memcached \
    --chart-version=6.14.0 \
    --values=chunk-memcached-values.yaml \
    --export >>memcached.yaml

# add flux helm release to the manifest including the customized values.yaml file
flux create helmrelease results-memcached \
    --interval=10m \
    --target-namespace=memcached \
    --source=HelmRepository/bitnami \
    --chart=memcached \
    --chart-version=6.14.0 \
    --values=results-memcached-values.yaml \
    --export >>memcached.yaml
kubectl get all -n memcached
NAME                                READY   STATUS    RESTARTS   AGE
pod/memcached-chunk-memcached-0     1/1     Running   0          66s
pod/memcached-chunk-memcached-1     1/1     Running   0          66s
pod/memcached-chunk-memcached-2     1/1     Running   0          66s
pod/memcached-results-memcached-0   1/1     Running   0          66s
pod/memcached-results-memcached-1   1/1     Running   0          66s
pod/memcached-results-memcached-2   1/1     Running   0          66s

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/memcached-chunk-memcached     ClusterIP   10.102.190.128   <none>        11211/TCP   66s
service/memcached-results-memcached   ClusterIP   10.104.25.227    <none>        11211/TCP   66s

NAME                                           READY   AGE
statefulset.apps/memcached-chunk-memcached     3/3     66s
statefulset.apps/memcached-results-memcached   3/3     66s

loki values file, helmrepo, and helmrelease

# add repo
helm repo add grafana https://grafana.github.io/helm-charts

# update
helm repo update

# confirm the version, 5.43.6 as of 2024-03-12
helm search repo grafana/loki
helm show chart grafana/loki

Here is the diff from the original values file.

./infrastructure/homelab/controllers/loki-values.yaml
280c280
<       endpoint: s3.blink-1x52.net
---
>       endpoint: null
282,283c282,283
<       secretAccessKey: REDACTED
<       accessKeyId: REDACTED
---
>       secretAccessKey: null
>       accessKeyId: null
330,331c330,331
<       enabled: true
<       host: "memcached-chunk-memcache.memcached.svc"
---
>       enabled: false
>       host: ""
336,337c336,337
<       enabled: true
<       host: "memcached-results-memcached.memcached.svc"
---
>       enabled: false
>       host: ""
844,845c844
<   nodeSelector:
<     app.kubernetes.io/part-of: directpv
---
>   nodeSelector: {}
867c866
<     storageClass: directpv-min-io
---
>     storageClass: null
1023,1024c1022
<   nodeSelector:
<     app.kubernetes.io/part-of: directpv
---
>   nodeSelector: {}
1041c1039
<     storageClass: directpv-min-io
---
>     storageClass: null
1127,1128c1125
<   nodeSelector:
<     app.kubernetes.io/part-of: directpv
---
>   nodeSelector: {}
1144c1141
<     size: 50Gi
---
>     size: 10Gi
1150c1147
<     storageClass: directpv-min-io
---
>     storageClass: null
1292c1289
<   enabled: false
---
>   enabled: true

loki installed

I see a lot of things are running.

It may take too much time for flux helmrelease max health check retries which results in the flux hr status stuck in "not ready". Try flux suspend hr loki and then flux resume hr loki to resolve it.

kubectl -n loki get all
NAME                                                    READY   STATUS    RESTARTS      AGE
pod/loki-backend-0                                      2/2     Running   2 (11m ago)   12m
pod/loki-backend-1                                      2/2     Running   3 (10m ago)   12m
pod/loki-backend-2                                      2/2     Running   3 (10m ago)   12m
pod/loki-canary-4xrzj                                   1/1     Running   0             12m
pod/loki-canary-p9qll                                   1/1     Running   0             12m
pod/loki-canary-qtqrq                                   1/1     Running   0             12m
pod/loki-canary-vwhnc                                   1/1     Running   0             12m
pod/loki-canary-wtzx5                                   1/1     Running   0             12m
pod/loki-loki-grafana-agent-operator-59b5949888-9nrgv   1/1     Running   0             12m
pod/loki-loki-logs-86vv7                                2/2     Running   0             12m
pod/loki-loki-logs-bcdnh                                2/2     Running   0             12m
pod/loki-loki-logs-j56zb                                2/2     Running   0             12m
pod/loki-loki-logs-k5277                                2/2     Running   0             12m
pod/loki-loki-logs-pnlbd                                2/2     Running   0             12m
pod/loki-read-c9f55f985-hgscn                           1/1     Running   0             12m
pod/loki-read-c9f55f985-qlbgh                           1/1     Running   0             12m
pod/loki-read-c9f55f985-v56lg                           1/1     Running   0             12m
pod/loki-write-0                                        1/1     Running   0             12m
pod/loki-write-1                                        1/1     Running   0             12m
pod/loki-write-2                                        1/1     Running   0             12m

NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/loki-backend                ClusterIP   10.109.177.120   <none>        3100/TCP,9095/TCP   12m
service/loki-backend-headless       ClusterIP   None             <none>        3100/TCP,9095/TCP   12m
service/loki-canary                 ClusterIP   10.99.208.197    <none>        3500/TCP            12m
service/loki-memberlist             ClusterIP   None             <none>        7946/TCP            12m
service/loki-read                   ClusterIP   10.99.247.137    <none>        3100/TCP,9095/TCP   12m
service/loki-read-headless          ClusterIP   None             <none>        3100/TCP,9095/TCP   12m
service/loki-write                  ClusterIP   10.104.241.140   <none>        3100/TCP,9095/TCP   12m
service/loki-write-headless         ClusterIP   None             <none>        3100/TCP,9095/TCP   12m
service/query-scheduler-discovery   ClusterIP   None             <none>        3100/TCP,9095/TCP   12m

NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-canary      5         5         5       5            5           <none>          12m
daemonset.apps/loki-loki-logs   5         5         5       5            5           <none>          12m

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-loki-grafana-agent-operator   1/1     1            1           12m
deployment.apps/loki-read                          3/3     3            3           12m

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-loki-grafana-agent-operator-59b5949888   1         1         1       12m
replicaset.apps/loki-read-c9f55f985                           3         3         3       12m

NAME                            READY   AGE
statefulset.apps/loki-backend   3/3     12m
statefulset.apps/loki-write     3/3     12m

more changes on loki values file

wip

  • memcached service name - related to srv record lookup
  • s3 path settings for the buckets access
  • loki canary args to specify destination loki server

promtail

https://grafana.com/docs/loki/latest/send-data/promtail/installation/

Promtail is an agent which ships the contents of local logs to a private Grafana Loki instance or Grafana Cloud

# confirm chart/version, 6.15.5 as of 2024-03-12
helm show chart grafana/promtail

# get values file
helm show values grafana/promtail --version=6.15.5 > promtail-values.yaml

Since the gateway is disabled on my loki installation, I modified the promtail logging destination in the promtail values file.

415c415
<     - url: http://loki-gateway/loki/api/v1/push
---
>     - url: http://loki-write:3100/loki/api/v1/push

And the helmrepo is the same with loki, I just prepared the script to generate flux hr for promtail. This is installed in the same "loki" namespace.

./infrastructure/homelab/controllers/promtail.sh
#!/bin/bash

# add flux helm release to the manifest including the customized values.yaml file
flux create helmrelease promtail \
    --interval=10m \
    --target-namespace=loki \
    --source=HelmRepository/grafana \
    --chart=promtail \
    --chart-version=6.15.5 \
    --values=promtail-values.yaml \
    --export >promtail.yaml

check logs on grafana

Access grafana, navigate to Home > Connections > Your Connections > Data sources, and add loki data source.

The server http url in my case is "http://loki-read.loki:3100" where "loki-read" is the svc name and following ".loki" is to reach the service in the different namespace, "loki" from "monitoring" namespace where grafana is in.

Once that's added, navigate to Home > Explore, and select loki as data source and run query.