Running in production¶
Review each section below before directing production traffic to Coxswain.
Replicas and availability¶
The Helm chart defaults to replicaCount: 1, which is fine for evaluation but inadequate for production: a single replica is a single point of failure, and the default PodDisruptionBudget (maxUnavailable: 1) combined with one replica means a voluntary disruption can take the entire data plane offline. Run at least two replicas. Leader election only coordinates status writes, so all replicas serve traffic independently.
helm upgrade coxswain oci://ghcr.io/coxswain-labs/charts/coxswain \
--namespace coxswain-system \
--set replicaCount=2
Verify the PodDisruptionBudget is in place:
kubectl -n coxswain-system get pdb
Pod anti-affinity is not set by default. Add it via your values.yaml to spread replicas across nodes:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/name: coxswain
Resource requests and limits¶
The Helm chart defaults are sized for evaluation: requests 100m CPU / 128Mi memory, limits 500m CPU / 256Mi memory. Adjust for your expected traffic:
| Traffic level | CPU request | Memory request | Proxy threads |
|---|---|---|---|
| Light (< 1k rps) | 100m–250m | 128Mi | 2 |
| Medium (1k–10k rps) | 500m–1 | 128Mi–256Mi | 4 |
| Heavy (> 10k rps) | 2–4 | 256Mi–512Mi | ≥ CPU core count |
Set proxy.threads to match the CPU cores allocated to the container:
helm upgrade coxswain oci://ghcr.io/coxswain-labs/charts/coxswain \
--namespace coxswain-system \
--set resources.requests.cpu=500m \
--set resources.requests.memory=128Mi \
--set resources.limits.cpu=2 \
--set resources.limits.memory=512Mi \
--set proxy.threads=4
Health probes¶
The default Helm chart wires both probes automatically. Do not disable them:
- Readiness (
/readyz, port8081) — Coxswain reportsReadyonly after the initial routing table is built. Kubernetes will not send traffic to the pod until this probe passes. - Liveness (
/healthz, port8081) — always 200 while the process is running.
Verify the probes are present on the deployed pod:
kubectl -n coxswain-system get deploy coxswain \
-o jsonpath='{.spec.template.spec.containers[0].readinessProbe}'
Graceful shutdown¶
On SIGTERM, Coxswain drains in-flight connections for --proxy-shutdown-grace-period (default 30s), then forcibly closes any remaining connections after --proxy-shutdown-timeout (default 5s). Make sure the grace period aligns with your load balancer's connection draining timeout.
For long-lived connections (WebSocket, SSE), increase the grace period:
--proxy-shutdown-grace-period=60s
--proxy-shutdown-timeout=10s
Status address¶
Set --status-address to the external IP or hostname of your load balancer. Without it, Ingress.status and Gateway.status.addresses are left empty, which breaks cert-manager HTTP-01 challenges and external-dns.
--status-address=203.0.113.10
# or
--status-address=lb.example.com
TLS¶
TLS Secrets must be in the correct namespace — for Ingress, the same namespace as the Ingress object; for Gateway, the same namespace as the Gateway unless a ReferenceGrant permits cross-namespace access. See the TLS guide for cert-manager setup.
Observability¶
Configure a Prometheus scrape against the admin port (8082) — see the Observability reference for the ServiceMonitor and scrape_config examples. Alert on /readyz returning non-200 for more than one scrape interval.
Set --log-format=json for structured log ingestion and --log=warn in production to reduce noise.
RBAC¶
The default ClusterRole grants Coxswain cluster-wide:
- Read on
services,endpoints,endpointslices,secrets,configmaps(core API group). - Read on
ingresses,ingressclasses(networking.k8s.io). - Read on
gatewayclasses,gateways,httproutes,referencegrants,backendtlspolicies(gateway.networking.k8s.io). - Status writes (
*/status) oningresses,gateways,httproutes,backendtlspolicies, andgatewayclasses.
A separate namespaced Role (in coxswain-system) grants get, create, patch on coordination.k8s.io/leases — used only for leader election. Review the rendered manifests with helm template or read deploy/manifests/rbac.yaml before deploying.
If Coxswain should only manage resources in a single namespace, set controller.watchNamespace. Note that this only restricts what the controller reads; the chart still installs the cluster-wide ClusterRole/ClusterRoleBinding. To scope RBAC as well, edit the rendered manifests by hand.
Signed image verification¶
Every release image is signed with cosign. Verify the signature before deploying to a production cluster:
cosign verify \
--certificate-identity-regexp \
"https://github.com/coxswain-labs/coxswain/.github/workflows/release.yml" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
ghcr.io/coxswain-labs/coxswain:vX.Y.Z
See Verifying releases for the cosign verification flow for both the image and the Helm chart.