metal-stack v0.8.0 🤩

Updates on Gardener, FSLs and machine management


by Gerrit Schwerthelm on Sep 24, 2021

metal-stack v0.8.0 has been released and as always, there are many new things to show. 🤩

Check out the direct link to the release here.

Gardener 1.19 Compatibility

It’s a long time since we were able to update our Gardener dependencies, but finally metal-stack supports running with Gardener 1.19 including the API Server SNI feature!

Audit Logging

A lot of effort was put into our solution for audit logging (kudos to @mreiger 🚀). Even though we are still trying to find more general solutions with Gardener for audit logging, we can offer the following functionalities at the moment, all “toggle-able” through our shoot spec’s control plane feature gates:

  • Deployment of an audit-forwarder, streaming the API servers audit events directly into the shoot cluster
  • Configuration for sending logs directly to Splunk (this needs to be enabled through the controller configuration and can also be reconfigured from a user through a secret in the shoot cluster’s kube-system namespace)

Cloud Storage

In addition to the logging features, we are also moving forward with our duros-controller, which is an operator providing LightOS NVMe/TCP storage by Lightbits Labs to our shoot clusters. The operator is now capable of automatical API token renewal, managed resource health checking and it now also defines a proper cleanup and deletion flow.

We will present this storage solution in one of the next Gardener Community Meetings, so please check the #gardener Channel on Kubernetes’ Slack Workspace and keep the date. 🤓

Big shout out to our Israeli friends of Lightbits Labs for helping us bringing this storage solution to production-grade! ❤️

Improvements on Filesystem Layouts

In the last minor release we introduced FSLs (file system layouts), which allows users to deploy customizable disk layouts for machine provisioning. In this release we also support software RAIDs (managed through mdadm) to be defined in an FSL. This is how a FSL definition may look like:

id: raid
description: "raid layout example"
constraints:
  sizes:
    - c1-xlarge-x86
  images:
    debian: "*"
disks:
  - device: "/dev/sda"
    wipeonreinstall: true
    partitions:
      - number: 1
        label: "root"
        size: 0
        gpttype: "8300"
  - device: "/dev/sdb"
    wipeonreinstall: true
    partitions:
      - number: 1
        label: "root"
        size: 0
        gpttype: "8300"
raid:
  - arrayname: "/dev/md0"
    level: 1
    devices:
      - "/dev/sda1"
      - "/dev/sdb1"
    createoptions: ["--metadata", "1.0"]
filesystems:
  - path: "/"
    device: "/dev/md0"
    format: "ext4"
    label: "root"
  - path: "/tmp"
    device: "tmpfs"
    format: "tmpfs"
    mountoptions: ["defaults","noatime","nosuid","nodev","noexec","mode=1777","size=512M"]

The example illustrates how to put a system’s root partition on a raid device /dev/md0, where a the software RAID 1 is placed on the two disks /dev/sda and /dev/sdb.

Along with the new RAID capabilities, it is also possible to extend the size of a logical volume to the rest of free space of its volume group. It works in the same way as for disk partitioning by setting the size value to 0. Here is an example:

id: ci-runner
description: ci runner layout
constraints:
  images: {}
  sizes:
    - c1-xlarge-x86
disks:
  - device: /dev/nvme0n1
    partitions: []
    wipeonreinstall: true
  - device: /dev/nvme1n1
    partitions: []
    wipeonreinstall: true
logicalvolumes:
  - lvmtype: striped
    name: root
    size: 0
    volumegroup: vgroot
volumegroups:
  - devices:
      - /dev/nvme0n1
      - /dev/nvme1n1
    name: vgroot
    tags: []
filesystems:
  - createoptions: []
    device: /dev/vgroot/root
    format: ext4
    label: root
    mountoptions: []
    path: /
  - createoptions: []
    device: tmpfs
    format: tmpfs
    label: ""
    mountoptions: ["defaults","noatime","nosuid","nodev","noexec","mode=1777","size=512M"]
    path: /tmp

Updates on metalctl

The metalctl machine ipmi command is now able to show the current power state of a machine, easing the life for metal-stack operators who are managing the machine fleet. By default, the power status is picked up and reported by the bmc-catcher in a five minute update interval.

In addition to that, the machine issues command output was completely refactored. It’s more operator-friendly now, allowing issue filtering and showing the lock description (which is the usual way for operators to move a machine out of the available machine pool):

$ m machine issues --omit bmc-no-distinct-ipbmc-without-mac
ID                                  	POWER	LOCK	LOCK REASON                      	STATUS	LAST EVENT    	WHEN   	ISSUES
263d5a00-f10b-11e9-8000-3cecef408994	●    	🔒  	interfaces: Error set link eth...	💀    	PXE Booting   	24d 13h	- the machine is not sending events anymore (liveliness-dead)
                                    	     	    	                                 	      	              	       	- machine has an incomplete lifecycle (↻) (incomplete-cycles)
942a5c00-a77f-11e9-8000-ac1f6bd38c5a	●    	🔒  	CATERR CPU FEHLER                	💀    	Planned Reboot	92d 22h	- the machine is not sending events anymore (liveliness-dead)
00000000-0000-0000-0000-002590b8f968	●    	    	                                 	      	Phoned Home   	3s     	- machine phones home but not allocated (failed-machine-reclaim)
2ca51200-bdfa-11e9-8000-3cecef23002c	●    	    	                                 	      	Phoned Home   	47s    	- machine phones home but not allocated (failed-machine-reclaim)

All issues listed through metalctl machine issues are now documented in our docs. This should make all the issues of this command more understandable and provides ways to resolve these problems adequately.

Also, there is a new command machine power cycle which enables users to hard reset a machine through API, using our go-hal abstraction layer.

More Information

This is only a small extract of what went into our v0.8.0 release.

Please check out the release notes to find a full overview over every change that went part of this release.

As always, feel free to visit our Slack channel and ask if there are any questions. 😄