Terraform and deployments
To deploy all the resources on our kubernetes cluster, we use terraform (actually opentofu), which tries to be the language in which you can deploy anything and everything and if you need to quickly redeploy a whole machine you can with a simple command. In practise it doesn’t work like that, but its good enough for our needs.
You can find the terraform repo on gitlab, this page was originally taken from the README as it got too long.
Other resources
Basically that’s it really, the rest of this will be looking at how to deploy an application with our configuration.
Terraform and its woes
Before we get into making an app, I must briefly explain terraform and its benefits/issues/confusing behaviours.
This expects you to have a rough idea around how terraform works. But here is a quick explainer: terraform is build around resources, provided by providers. These resources have a state stored locally in a state file, whether they are deployed, generated values etc. (note that these can literally be anything e.g. from random passwords to HTTP reequests to kubernetes resources to DNS records). These resources can then be organised into modules, which (can) have outputs from values generated by the resources. There are also “data”, but this is basically a reference to another resource which doesn’t have the controls.
When deploying, terraform will then check the state of all the current deployed modules (even pinging servers if needed) and find anything that has changed (e.g. new resources or updated values) and deploy everything.
Module structure
Modules are the core of terraform and can be a bit tricky to get your head around, as initially they are quite limited (e.g. there is no such thing as a global variable).
But basically a module has a list of inputs as and a list of outputs (and then providers). So it is expected that your module deploys some resources which are then used to output something. E.g. in our ldap user module the module generates a random password, creates the user and assigns them to a group, then outputs the username, email and password to be used later in another resource.
So e.g.
module "user" { source = "../../utils/ldap/user"
display = "Example" username = "example" group_ids = var.ldap_group_ids
providers = { // Note passing providers act somewhat like global constants, passing configuration (e.g. what ldap server we mean) // to the module lldap = lldap }}
// You can then use the email by: module.user.email or password: module.user.passwordNOTE: All files within a module (the folder) are treated as one global space, similar to how Go works. You can reference variables, locals and resources throughout all files within a module which makes it quite difficult to organise nicely.
The modules are usually structure in the way:
init.tf— Usually where yourprovidersgo and if you are lazy (like me), everything elsevars.tf/variables.tf— Where you put all your variables. As explained in Variable madness, I really don’t like this and so commonly ignoreoutputs.tf— All your outputs go here*.tf— Anything else, if you want to split it out nicely into other files
Variable madness
Terraform variables suck.
Anyway, so basically terraform requires you do a full definition for every variable:
variable "my_var" { type = string
description = "Something" nullable = false
sensitive = false # If something is sensitive MAKE THIS TRUE}// You can then later reference it with var.my_varThis defines something that must be inputed by the user, either through the module or your tfvars file (if in the root directory).
Due to this verbosity, and sometimes complex nature of the interfaces I like to create, I have used the object type e.g.:
variable "my_var" { type = object({ l = list(string) m = map(bool) # string -> bool. Same syntax as object, just more flexible s = set(string) # Yes this is different to list but using the same [] option = optional(string, "my_default") }) // ...}But even then you can’t define nice defaults for each sub item and the description is for the whole variable, so it has to be done as so. This is just raw pain and not particularly great syntax in my opinion. Also you cannot enable/disable sensitive nature of values for sub items, this means the whole object must be defined as sensitive if you have one password.
And due to the lack of global constants, you must define every variable in every sub project and duplicate the types (yes there is no way to define a type to use throughout the project).
It is also recommended that you put all variables in a vars.tf file. Which sure does make sense for small modules, but if its that small I find it easier to just chuck at the top of the init.tf file (as the terraform syntax highlighter is soooo broken). Then if its large, I find it more useful to put the variables where they are actually used — but then again this is confusing because the syntax and tooling is so bad.
Oh yeah sorry and then there are locals which are constants you can define from resources/variables and will be calculated when the information is ready. E.g.
locals { temp_val = "hi"}// Then you can reference with local.temp_valDepends on and its pains
The one issue with terraform is that its really slow with large projects with this. But the nature of the design encourages large projects (as you want to reference things throughout the smaller apps).
This is due to it having to create a dependency graph where objects wait on their dependencies. These dependencies can be defined by depends_on in any resource or just referencing a value from another resource.
This is really useful so deployments are not actually deployed until all the secrets are deployed. Due to my perferable of not repeating myself, I heavily use the inferred dependency from the referring to resource names e.g. kubernetes_secret_v1.secret.metadata[0].name (yes this is why I don’t just do the simple thing and use the shorter name, its good to know where the value comes from).
BUT you cannot rely on this, as some things take time to actually deploy even if it says its successful. Therefore you may need to use timers instead. I usually don’t bother due to the numerous other issues with terraform making it so its not actually perfect so there’s minimal point actually making it easy to deploy from scratch.
Timeouts
If something goes wrong during deployment, e.g. you make a typo, you will have to wait the FULL timeout time. This is really painful when you typo the hostname to the db causing the pod to crashloop in a helm config and you have to wait 10 minutes for terraform to give up. You can Ctrl-C, Ctrl-C, but this causes more issues as you will have to manually intervene and delete the helm chart/deployment before you run the command again.
Instead I recommend shortening the timeouts for the deployment/helm to one more applicable to the application. A lot of our first-party stuff usually deploys in a few seconds and if it doesn’t, something has gone very wrong.
Commas or no commas?
The terraform syntax is… interesting. Commas are optional in most cases. So I would recommend, not typing commas where they optional.
BUT within lists/sets (basically between []) you have to type commas, and if this is across multiple lines PLEASE ADD TRAILING COMMAS. The reason? Git histories look sooooooo much better.
How to create a basic project
Creating a namespace
There is a handy util module for this:
module "example_namespace" { source = "./utils/namespace"
name = "example"
enable_dns = true enable_mail = true enable_lldap = true bkp = { // ... }
providers = { kubernetes = kubernetes }}This allows you to enable or disable features for your namespace, e.g. if your pods need to communicate with the outside world, enabling the DNS. All of these features are disabled by default and its heavily encouraged to only enable the features if the namespace needs it.
The next thing is to configure backups through this, to reduce our dependence and costs from our s3 provider, it is recommended that backups are disabled for all namespaces whos data can be regenerated (e.g. froom). If you do enable it, it is then encouraged that you disable backups for any database or pvc that you don’t need backing up with the k8up.io/backup=false annotation (you may notice that all valkey instances set this by default if you are using the app/valkey module).
Placement
Within this repository, it is tradition to put the namespace creation at the highest level, e.g. 20_apps.tf. This means that the apps themselves do not control the namespace they are created in. It is mostly just a personal preference from me after years of configuring k8s on terraform.
Using Helm
Helm is by far the easiest way to deploy third-party tools, and is used throught this repo despite it’s drawbacks when combined with terraform (it’s just so easy).
You just add helm to the providers list (which defines what terraform modules you are integrating with):
terraform { required_providers { helm = { source = "hashicorp/helm" version = "~>3.1.1" } }}You can then use the helm_release resource, which takes the form:
resource "helm_release" "my_app" { name = "my_app" namespace = var.namespace
repository = "https://charts.example.com" chart = "the_app" version = "version"
values = [yamlencode({ // Values go here written within the terraform config language })]}This deploys all the resources an application needs and manages and restarting if a secret or config map changes, providing all the configuration at your fingertips.
However this comes at a cost:
- SECRETS SHOULD NOT GO IN THE HELM CONFIG. This is a big one, all values are easily accessible unencrypted on the cluster, therefore any secrets MUST go in a
kubernetes_secrets_v1object and you should use asecretsRefor similar to link it. If the helm chart does not support this DO NOT USE IT. Helm also stores the history of all values, therefore if you put it temporarily within helm values for testing, you must to a password rotation. - You have no power over the types of resources and the structure in which it deploys. This means that if a feature or support for our strict network policies are not implemented, you have to either not use the helm chart completely or fork your own (which we definitely don’t want to do).
- If the helm chart gets deleted, all pvc related might also get deleted (unless they have the
Retainpolicy, which should be the case for everything). - Sometimes they don’t have the proper security contexts/network policies by default so you will have to add them youself (see the below section)
Overall, helm is pretty good, just use with caution and understand what templates you are inflicting. Note, you will probably have to get pretty good at reading not only default values, but schemas and the templating language of helm itself, as sometimes the charts are not particularly well documented.
Using kubernetes
For this you need to understand a bit of structure of how kubernetes works. I will assume that you are deploying a pod. If that pod needs storage attached (and not through SQL or Redis), then you will need to use a StatefulSet. If you have no storage or are just communicating with a postgres server or redis, you can instead use a Deployment.
The difference between these two concepts are not particularly seen in the world of a single node cluster, but basically deployments are free to spin up another even if the previous one is still terminating or if the node is non responsive. On the other hand, statefulsets must ensure that no two nodes are trying to access the same data, therefore cannot automatically start up if a node goes down.
This also means that it is much easier to scale a deployment to multiple nodes, vs a statefulsets which must have separate volumes per pod.
Anyway, both statefulsets and deployments have a template configuration for creating the pod associated with itself. This pod has a label which is used to monitor and track the associated pods with its parent. So the structure is as follows:
resource "kubernetes_deployment_v1" "my_deployment" { metadata { name = "my_deployment" namespace = var.namespace } spec { replicas = 1 selector { match_labels = { app = "the_deployment" } } // This doesn't really matter in a one node cluster with a replicas = 1 strategy { type = "RollingUpdate" } template { metadata { labels = { app = "the_deployment" } } spec { container { name = "my_deployment" image = "bathbcss/my_image:latest" image_pull_policy = "Always" // Should only be set if the above is "latest"
// This should be the default security context to comply with our pod security policies security_context { run_as_user = 1000 run_as_non_root = true allow_privilege_escalation = false seccomp_profile { type = "RuntimeDefault" } capabilities { drop = ["ALL"] } }
// This is how the pod is checked its alive, so if something happens, // e.g. job which causes it to become unresponsive, it will be automatically killed off and replaced liveness_probe { http_get { path = "/healthz" port = 8080 } initial_delay_seconds = 5 period_seconds = 10 }
// This allows the tracking of when the pod starts, so we wait until the pod is ready to receive requests startup_probe { http_get { path = "/healthz" port = 8080 } // 3 * 30 = 90 seconds to start failure_threshold = 30 // If it takes a while to startup, increase this time period_seconds = 3 }
// If exposing a port the port to expose port { container_port = 8080 // Make sure to give it a name so we can use the name in services name = "web" }
// env and env_from definitions } } } }}Quick side note: kubernetes_\*_v1is the preferred resouce name, any resource that does not have\_v1 on the end is deprecated and should not be used.
It is recommended that the version of the image is actually set and latest is not used, however to reduce the admin overhead, for internal projects it can be easier to set to latest with an image pull policy of Always. However this means if you want the latest version, you must have access to the cluster to restart a pod.
If you are exposing pods, this should be tied with a Service, as seen below.
Security context
As you will notice in the example above, we have a security context set. This is required by the pod security contenxt, otherwise it will not deploy. In most cases you can copy either of the two following policies, depending on whether it is within a kubernetes resource or helm/kubernetes manifest resource:
resource "kubernetes_deployment_v1" "my_deployment" { // ....
security_context { run_as_user = 1000 run_as_non_root = true allow_privilege_escalation = false seccomp_profile { type = "RuntimeDefault" } capabilities { drop = ["ALL"] } }
// ....}Or
resource "kubernetes_manifest" "my_deployment" { manifest = { // ....
securityContext = { runAsUser = 1000 runAsNonRoot = true allowPrivilegeEscalation = false seccompProfile = { type = "RuntimeDefault" } capabilities = { drop = ["ALL"] } }
// .... }}Note that we are running as a user (not root), setting the default seccompProfile (you should only need the default unless you are doing weird things with the host machine) as well as dropping all capabilities (you may need to add some back in but I will leave to you as you probably know more than me — NOTE: Some are disabled by our pod security policy but can be override with baseline).
Liveness and startup probe
The liveness and startup probes are not necessary, but is a nice to have. The liveness probe allows the cluster to detect if a pod becomes unresponsive and is then able to kill it if that is the case. Whereas a startup probe makes it so the cluster knows exactly when the pod is able to receive responses.
Please see the kubernetes docs on probes for more information on the options. But in most cases the HTTP get option should suffice, which just looks for a status 2xx code.
resource "kubernetes_deployment_v1" "my_deployment" { // ....
liveness_probe { http_get { port = 8080 } initial_delay_seconds = 5 period_seconds = 10 }
startup_probe { http_get { port = 8080 } failure_threshold = 30 period_seconds = 3 }
// ....}Or
resource "kubernetes_manifest" "my_deployment" { manifest = { // ....
livenessProbe = { httpGet = { port = 8080 } initialDelaySeconds = 5 periodSeconds = 10 }
startupProbe = { httpGet = { port = 8080 } failureThreshold = 30 periodSeconds = 3 }
// .... }}Environmental variables
When configuring deployments, you will want to set environmental variables. There are a few ways to do it, but note ANY PASSWORDS/API KEYS GO IN SECRETS not the environmental variables. As you will see I will example how to do this.
By default the env list can be used to set a single environmental variable e.g.
resource "kubernetes_deployment_v1" "my_deployment" { // ....
env { name = "TEST" value = "my_value" }
// ....}However, this is quite verbose and takes a lot of space, so if you are configuring a lot of variables or have secrets, you will want to use the env_from list. This allows you to reference a config map or secrets (this is the most basic form).
These look like:
resource "kubernetes_secret_v1" "module" { metadata { name = "deployment-db" namespace = var.namespace } data = { DATABASE_URL = module.database.url } type = "Opaque"}
resource "kubernetes_config_map_v1" "module" { metadata { name = "deployment-config" namespace = var.namespace } data = { PUBLIC_VALUE = "yoooo" ROCKET_ADDRESS = "::" }}
resource "kubernetes_deployment_v1" "my_deployment" { // ....
env_from { secret_ref { name = kubernetes_secret_v1.module.metadata[0].name } } env_from { config_map_ref { name = kubernetes_config_map_v1.module.metadata[0].name } }
// ....}Or
resource "kubernetes_manifest" "my_deployment" { manifest = { // ....
envFrom = [ { secretRef = { name = kubernetes_secret_v1.module.metadata[0].name } }, { configMapRef = { name = kubernetes_config_map_v1.module.metadata[0].name } }, ]
// .... }}Note that you can set the value of a environmental variable from a secret on an individual basis, which can be useful if you are storing environmental variables as well as files inside your secret. E.g:
resource "kubernetes_deployment_v1" "my_deployment" { // ....
env { name = "DB_PASSWORD" value_from { secret_key_ref { name = kubernetes_secret_v1.module.metadata[0].name key = "password" } } }
// ....}Third-party CRDs
Now this is is where terraform becomes less good. Basically when deploying using the kubernetes API, the checks will make sure the CRDs (so the like api and kind are installed and supported on the kubernetes cluster). This means that you won’t even be able to run.
Basically it means that you need to comment out manifests that reference these resources until the CRDs are deployed (usually though a helm chart or something).
PVCs
Please remember ANY PASSWORDS/API KEYS/CERTIFICATES GO IN SECRETS not in the storage (this also means its configurable by us and yes they can be mounted as read only volumes).
So, kubernetes storage works around persistant volumes which are requested by persistant volume claims. On our k3s single node, we are just using the k3s filesystem class. This means it’s a bit basic but does the job.
Things to note:
- You probably should be manually creating persistant volume claims (and definitely not persistant volumes), instead using
statefulsets - It’s really hard to change persistant volumes post fact, so please go through testing phase if you are unsure about anything.
- K3s does not support the storage limit, so please DON’T RELY ON IT to stop abusive behaviour.
- If you are defining yourself, do not accidentally make your deployment depend on the persistant volume claim, as the pvc will not be created until it is used in something. See the mailserver module of how to handle it. NOTE: If you reference the pvc config in your deployment, terraform will add that automatically to the
depends_onlist. - If it is critical data you will need to manually update the pv to “retain” its data if the pvc gets deleted. This just adds a bit of safety if you mess up a deployment. This can be done in
k9sby an admin, updating thepersistentVolumeReclaimPolicytoRetaininstead ofDelete.
So how you should be using pvc, in statefulsets:
resource "kubernetes_stateful_set_v1" "module" { metadata { name = var.name namespace = var.namespace } spec { service_name = var.name replicas = 1 selector { match_labels = { app = var.name } } template { metadata { labels = { app = var.name } } spec { container { name = "my_app" image = "bathbcss/my_app:1.0.0"
// ...
volume_mount { name = "data" mount_path = "/data" } } } } volume_claim_template { metadata { name = "data" } spec { access_modes = ["ReadWriteOnce"] resources { requests = { storage = "10Gi" } } } } }}Services
Servies are the first way to adding an ingress, and helps give a common endpoint if you have multiple pods running (though we don’t do this). If using within the cluster you can access a service with the address service_name.namespace.svc.cluster.local. Note that if you are accessing it from within the same namespace you can just use the service_name as the hostname (and it is better on a network policy basis).
For a namespace you just need a label to select on (note we have to define an app label for the statefulset/deployment anyway so you can just use this).
So it should look like this:
resource "kubernetes_service_v1" "module" { metadata { name = var.name namespace = var.namespace } spec { port { port = 8080 name = "web" } selector = { app = var.name } type = "ClusterIP" }}And then you can just access the port by service_name:8080. Note that you can also set a target_port if you want to change the port from the deployment and service (but like why?).
Also giving it a name is important, so in the ingress we can just use the name instead of the port number itself, increasing readability.
Network policies
Now comes the pain. If a helm chart has a network policy, use that (but please actually read what permissions it gives).
Network policies, either grant, or block network “ingress” (stuff going into the pod) and “egress” (stuff going out of the pod). By default, due to our security, ingress to our pods is denied (so anything in the cluster), but egress to the outside world is allowed.
You then use either pod selectors or namespace selectors to allow traffic from a pod (which is how enable_dns,lldap,mail works). In most cases egress can be left alone, unless you want to block a node from doing something in particular.
As talked about in the next section, for postgres and valkey, we create these policies, limited to pods with the ${name}-${service}-client=true label allowing only inter-namespace communication.
If you do have to write one, which I really hope you don’t, please read kubernetes documentation on Network Policies.
Postgres and Redis/Valkey
We have utilities for postgres and redis (through valkey due to redis being really hard to run).
If you require these, we highly recommend the above (though postgres needs moving away from bitnami due to the requirement of money for stability). These setup the network policies you need as well as generating secure passwords and you can look at the outputs.tf file to see the outputs you can use from a module.
NOTE: If using this, for any pod accessing the data, it must be in the same namespace and have the labels ${name}-postgresql-client=true and/or ${name}-valkey-client=true, otherwise the traffic will be denied by the network policies.
E.g.
module "database" { source = "../postgres"
namespace = var.namespace prefix = var.name username = "user" database = "db_name" size = "250Mi"
providers = { helm = helm }}
// Create a secret with module.database.url
resource "kubernetes_deployment_v1" "module" { metadata { name = var.name namespace = var.namespace } spec { selector { match_labels = { app = var.name } } template { metadata { labels = { app = var.name // This **MUST** be defined otherwise you'll get weird errors froom-pg-postgresql-client = true } } // ... } }}Side note: the namespace will also need to have enable_dns to be able to access it
Adding ingress
Adding ingress to service basically means that you are making it accessible to the outside world.
If you are doing this, please consider security heavily:
- Can any user alter the DB? Are you doing proper type checking on the inputs?
- Do you make sure to not expose any secrets e.g. db urls
- Do you really need to expose this pod? Or can you leave it to k9s port forwarding?
Then you need to ask:
- Just expose it to people within the bath network e.g. on
*.k8s.bathcs.com - Expose it to the whole world
*.bathcs.com(note coordination with backstage to get them to update their traefik will be necessary)
Please expose to the utter minimum people.
If exposing to the full network please note the flow is:
app.bathcs.com<- You should accept this domain in the ingressapp.bcss.su.bath.ac.uk<- You should accept this domain in the ingressapp.k8s.bathcs.com<- This is what the certificate you should be giving (due to this is hostname that backstage is requesting)
Depending on the chosen level, please look at the utils/ingress/* modules as these handle this most of this flow for you.
E.g. making it internal to bath uni only:
module "dns" { source = "../../utils/ingress/dns_flow"
subdomain = var.subdomain # We don't want public one, so don't specify cloudflare_zone_id k8s_cloudflare_zone_id = var.cloudflare_zone_id
# Expanded for effect domains = { uni = var.domains.uni k8s = var.domains.k8s }
providers = { cloudflare = cloudflare, }}
module "ingress" { source = "../../utils/ingress/tls"
name = var.subdomain entry_points = ["websecure"] namespace = var.namespace host = module.dns.hosts.k8s additional_ingress_hosts = [module.dns.hosts.uni] service = { name = kubernetes_service_v1.module.metadata[0].name port = "http" } cert_issuer = var.cert_issuer
providers = { kubernetes = kubernetes, }}If using a helm repo you may have to manually define the certificate, which is quite easy:
resource "kubernetes_manifest" "cert" { manifest = { apiVersion = "cert-manager.io/v1" kind = "Certificate" metadata = { name = "${var.name}-cert" namespace = var.namespace } spec = { secretName = "${var.name}-cert-secret" issuerRef = var.cert_issuer dnsNames = [var.host] } }}Which outputs the cert to the secret with name ${var.name}-cert-secret. You can also use kubernetes_manifest.cert.manifest.spec.secretName (you can guess what I prefer).
If you are making it publically accessible, you can use the full flow which does both dns and ingress records:
module "ingress" { source = "../../utils/ingress/full_flow"
cloudflare_zone_id = var.cloudflare_zone_id namespace = var.namespace domains = var.domains subdomain = var.subdomain
service = { name = kubernetes_service_v1.module.metadata[0].name port = "web" } cert_issuer = var.cert_issuer
providers = { kubernetes = kubernetes, cloudflare = cloudflare }}Putting it behind authelia
If the application doesn’t have oauth integration build in and you are wanting to protect it behind authelia you can add middleware of auth-forwardauth-authelia@kubernetescrd to require them to go through authelia.
You will then have to add an access control rule to authelia e.g.
rules = [ { domain = "app.k8s.bathcs.com" subject = ["user:hw2210"] },]NOTE: Sadly we can’t use groups and so we have to limit by user directly.
Cloudflare dns records
Cloudflare provider is an actual pain. You would’ve hoped it would be good, but its not amazing (has previously caused updates on every apply). Basically when creating a dns record you should use the whole address e.g. app.bathcs.com instead of app. This is because it will cause an update on the second application, changing the name value (which we use to get the full domain).
It sucks because theres an additional useless variable ontop of the cloudflare_zone_id (yes you can use the “data” thing to solve this but still).
Additionally for long TXT records, it will add additional quotes, and terraform will continually think you need to change that if you don’t yourself add the quotes into the data.
Sending mail
Sending mail is somewhat weird.
Basically just because I can, the authentication is managed by lldap (a really fast ldap implementation). This should not be confused with the ldap service that Authelia is hooked up with, as it is not.
Once you’ve enabled mail in the namespace you then have to create an ldap user:
module "user" { source = "../../utils/ldap/user"
display = "Example" username = "example" group_ids = var.ldap_group_ids
providers = { lldap = lldap }}You can then use this to access the mail with module.user.name, module.user.password and module.user.email. NOTE due to it not being exposed, we must use the kubernetes domain certificate, which means that you normally have to disable tls verification it or set the expected domain to bathcs.com (as seen in authelia).
Creating an oauth client
I have created a OAuth client module for generating all the secret data you need and generate the client config (under the config output) you can then pass up and then into the authelia module (see grafana as an example).
But basically please read the authelia oidc docs for a full explanation of how it works.