Purpose
Assess EKS control-plane and nodegroup readiness, with optional kubectl node health checks.
Location
cloud/aws/eks/cluster-healthcheck.sh
Preconditions
- Required tools:
bash,aws(andkubectlfor kubectl checks) - Required permissions:
eks:DescribeCluster,eks:ListNodegroups,eks:DescribeNodegroup - Required environment variables: none
Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--cluster-name NAME |
Yes | N/A | Cluster name |
--require-nodegroups |
No | false |
Fail when nodegroup count is zero |
--kubectl-check |
No | false |
Include kubectl connectivity/node readiness checks |
--kubeconfig PATH |
No | default kubeconfig | Kubeconfig path for kubectl checks |
--max-unready-nodes N |
No | 0 |
Allowed unready nodes |
--region REGION, --profile PROFILE |
No | AWS defaults | AWS context override |
--strict |
No | false |
Fail on WARN status |
--json |
No | false |
JSON output |
Scenarios
- Happy path: cluster is ACTIVE, nodegroups healthy, optional node readiness within threshold.
- Common operational path: pre-deployment gate in CI/CD pipelines.
- Failure path: cluster unavailable, non-ACTIVE nodegroups, or kubectl connectivity failures.
- Recovery/rollback path: remediate cluster/nodegroup/network issues and rerun checks.
Usage
cloud/aws/eks/cluster-healthcheck.sh --cluster-name prod-eks
cloud/aws/eks/cluster-healthcheck.sh --cluster-name prod-eks --require-nodegroups --strict
cloud/aws/eks/cluster-healthcheck.sh --cluster-name prod-eks --kubectl-check --kubeconfig ~/.kube/prod-config --max-unready-nodes 1 --json
Behavior
- Main execution flow:
- validates AWS CLI and cluster visibility
- checks cluster status/version/endpoint access/OIDC
- checks nodegroup count and ACTIVE status
- optionally checks kubectl connectivity and unready node count
- emits PASS/WARN/FAIL summary
- Idempotency notes: read-only diagnostics.
- Side effects: none.
Output
- Standard output format: table by default, JSON with
--json. - Exit codes:
0no FAIL checks (and no WARN in strict mode)1FAIL checks present or WARN in strict mode2invalid arguments
Failure Modes
- Common errors and likely causes:
- cluster not found/inaccessible
- nodegroups not ACTIVE
- kubectl context/auth/network issues
- Recovery and rollback steps:
- verify AWS account/region/profile
- resolve nodegroup or control-plane issues
- refresh kubeconfig and retry kubectl checks
Security Notes
- Secret handling: no secret output; kubeconfig access should be controlled.
- Least-privilege requirements: read-only EKS permissions for healthcheck path.
- Audit/logging expectations: healthcheck results should be attached to deployment evidence.
Testing
- Unit tests:
- status aggregation and strict-mode behavior
- Integration tests:
- healthy/unhealthy cluster fixtures in sandbox
- Manual verification:
- compare output with direct
describe-cluster/describe-nodegroupand kubectl node status