by Gerrit Schwerthelm on Sep 24, 2021
metal-stack v0.8.0 has been released and as always, there are many new things to show. 🤩
Check out the direct link to the release here.
Gardener 1.19 Compatibility
It’s a long time since we were able to update our Gardener dependencies, but finally metal-stack supports running with Gardener 1.19 including the API Server SNI feature!
A lot of effort was put into our solution for audit logging (kudos to @mreiger 🚀). Even though we are still trying to find more general solutions with Gardener for audit logging, we can offer the following functionalities at the moment, all “toggle-able” through our shoot spec’s control plane feature gates:
- Deployment of an audit-forwarder, streaming the API servers audit events directly into the shoot cluster
- Configuration for sending logs directly to Splunk (this needs to be enabled through the controller configuration and can also be reconfigured from a user through a secret in the shoot cluster’s
In addition to the logging features, we are also moving forward with our duros-controller, which is an operator providing LightOS NVMe/TCP storage by Lightbits Labs to our shoot clusters. The operator is now capable of automatical API token renewal, managed resource health checking and it now also defines a proper cleanup and deletion flow.
We will present this storage solution in one of the next Gardener Community Meetings, so please check the #gardener Channel on Kubernetes’ Slack Workspace and keep the date. 🤓
Big shout out to our Israeli friends of Lightbits Labs for helping us bringing this storage solution to production-grade! ❤️
Improvements on Filesystem Layouts
In the last minor release we introduced FSLs (file system layouts), which allows users to deploy customizable disk layouts for machine provisioning. In this release we also support software RAIDs (managed through mdadm) to be defined in an FSL. This is how a FSL definition may look like:
description: "raid layout example"
- device: "/dev/sda"
- number: 1
- device: "/dev/sdb"
- number: 1
- arrayname: "/dev/md0"
createoptions: ["--metadata", "1.0"]
- path: "/"
- path: "/tmp"
The example illustrates how to put a system’s root partition on a raid device
/dev/md0, where a the software RAID 1 is placed on the two disks
Along with the new RAID capabilities, it is also possible to extend the size of a logical volume to the rest of free space of its volume group. It works in the same way as for disk partitioning by setting the size value to
0. Here is an example:
description: ci runner layout
- device: /dev/nvme0n1
- device: /dev/nvme1n1
- lvmtype: striped
- createoptions: 
- createoptions: 
metalctl machine ipmi command is now able to show the current power state of a machine, easing the life for metal-stack operators who are managing the machine fleet. By default, the power status is picked up and reported by the bmc-catcher in a five minute update interval.
In addition to that, the
machine issues command output was completely refactored. It’s more operator-friendly now, allowing issue filtering and showing the lock description (which is the usual way for operators to move a machine out of the available machine pool):
$ m machine issues --omit bmc-no-distinct-ipbmc-without-mac
ID POWER LOCK LOCK REASON STATUS LAST EVENT WHEN ISSUES
263d5a00-f10b-11e9-8000-3cecef408994 ● 🔒 interfaces: Error set link eth... 💀 PXE Booting 24d 13h - the machine is not sending events anymore (liveliness-dead)
- machine has an incomplete lifecycle (↻) (incomplete-cycles)
942a5c00-a77f-11e9-8000-ac1f6bd38c5a ● 🔒 CATERR CPU FEHLER 💀 Planned Reboot 92d 22h - the machine is not sending events anymore (liveliness-dead)
00000000-0000-0000-0000-002590b8f968 ● Phoned Home 3s - machine phones home but not allocated (failed-machine-reclaim)
2ca51200-bdfa-11e9-8000-3cecef23002c ● Phoned Home 47s - machine phones home but not allocated (failed-machine-reclaim)
All issues listed through
metalctl machine issues are now documented in our docs. This should make all the issues of this command more understandable and provides ways to resolve these problems adequately.
Also, there is a new command
machine power cycle which enables users to hard reset a machine through API, using our go-hal abstraction layer.
This is only a small extract of what went into our v0.8.0 release.
Please check out the release notes to find a full overview over every change that went part of this release.
As always, feel free to visit our Slack channel and ask if there are any questions. 😄