{
  "skills": [
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Deep Ansible operational knowledge covering dangerous module classification, idempotency violations, inventory targeting risks, privilege escalation patterns, and handler ordering pitfalls.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "ansible",
      "path": "skills/ansible",
      "scenario_count": 1,
      "skill_body": "## Dangerous module classification\n\n### Destructive modules (CRITICAL)\n- `file` with `state: absent` \u2014 deletes files or entire directory trees; recursive deletion on wrong path is catastrophic\n- `user` with `state: absent` and `remove: true` \u2014 deletes user account AND home directory with all data\n- `mysql_db` / `postgresql_db` with `state: absent` \u2014 drops the database; irreversible without backup\n- `ec2_instance` / `gce_instance` with `state: absent` \u2014 terminates cloud instances\n- `docker_container` with `state: absent` and `keep_volumes: false` \u2014 destroys container AND its data volumes\n- `k8s` with `state: absent` \u2014 deletes Kubernetes resources; can cascade through owner references\n\n### Non-idempotent modules (HIGH)\n- `command` \u2014 executes arbitrary shell commands; has no built-in idempotency; runs every time regardless of current state\n- `shell` \u2014 same as command but through a shell interpreter; even more dangerous because of pipe/redirect side effects\n- `raw` \u2014 sends raw commands over SSH without module wrapper; no change detection, no error handling, no idempotency\n- `script` \u2014 uploads and executes a local script on remote host; runs every time, no state check\n- `expect` \u2014 automates interactive commands; inherently non-idempotent and fragile\n\n### Sensitive modules (HIGH)\n- `copy` with `content` parameter containing secrets \u2014 secrets end up in Ansible logs, facts cache, and potentially in version control\n- `template` rendering credential files \u2014 verify `mode: '0600'` and `owner` are set; world-readable credential files are a common oversight\n- `lineinfile` / `blockinfile` \u2014 modifies files in place; repeated runs with wrong `regexp` can duplicate lines or corrupt config files\n- `cron` \u2014 installs cron jobs; missing `name` parameter causes duplicate entries on every run\n- `authorized_key` with `exclusive: true` \u2014 removes ALL other SSH keys for the user; can lock out other admins\n\n### Safe modules (LOW risk when used correctly)\n- `apt` / `yum` / `dnf` with `state: present` \u2014 idempotent package installation\n- `service` / `systemd` with `state: started/stopped` \u2014 idempotent service management\n- `file` with `state: directory` or `state: file` \u2014 idempotent file/directory creation\n- `template` / `copy` with `backup: yes` \u2014 creates backup before overwriting\n\n## Idempotency violations\n\n### Common anti-patterns\n- `command` or `shell` without `creates` or `removes` guards \u2014 runs every time; use `creates: /path/to/output` to skip if file already exists\n- `command` or `shell` without `changed_when` \u2014 always reports \"changed\" even if the command was a no-op; misleads operators about actual system state\n- `shell: \"echo 'line' >> /etc/config\"` \u2014 appends the line on EVERY run; use `lineinfile` instead for idempotent file editing\n- Task using `register` result but no conditional on next task \u2014 downstream tasks always execute regardless of whether the registered command did anything\n- `when: result.rc == 0` after a `command` that always succeeds \u2014 the conditional provides no useful gating\n\n### Missing guard patterns\n- Tasks that should use `when: ansible_facts['os_family'] == 'Debian'` but apply universally \u2014 OS-specific commands break on wrong distribution\n- Package installation without version pinning \u2014 `apt: name=nginx` installs whatever the latest version is; use `name=nginx=1.24.0-1` for deterministic deployments\n- Tasks that check `stat` for file existence but don't use `register` + `when` to gate subsequent steps\n\n## Inventory targeting risks\n\n### Production targeting (CRITICAL)\n- Play with `hosts: all` = CRITICAL \u2014 targets every host in the inventory; in mixed environments this includes production, staging, dev, and infrastructure hosts simultaneously\n- Play targeting production group without `--limit` in CI/CD = HIGH \u2014 no guardrail against running on all production servers\n- Play targeting parent group that contains production subgroup = HIGH \u2014 `hosts: webservers` may include `prod-webservers` and `staging-webservers`\n- `serial: 100%` or no `serial` on production group = HIGH \u2014 all hosts process simultaneously; a bad task takes down every server at once\n- Missing `max_fail_percentage` \u2014 one host failure doesn't stop the play from destroying remaining hosts\n\n### Safe targeting patterns\n- `serial: 1` or `serial: \"25%\"` for production \u2014 processes hosts in batches; limits blast radius of a bad change\n- `max_fail_percentage: 10` \u2014 stops the play if more than 10% of hosts fail; prevents cascading failure\n- `any_errors_fatal: true` for critical tasks \u2014 stops ALL hosts on the first error anywhere\n- `run_once: true` for tasks that should only execute on one host (database migrations, cluster operations)\n\n## Privilege escalation patterns\n\n### Dangerous escalation (HIGH)\n- `become: true` at play level without `become_user` \u2014 escalates to root for EVERY task in the play, including tasks that don't need root\n- Tasks modifying `/etc/sudoers` or `/etc/sudoers.d/` \u2014 syntax error in sudoers file locks out ALL sudo access; always use `visudo --check` or `validate` parameter\n- Tasks modifying PAM configuration (`/etc/pam.d/`) \u2014 misconfiguration can lock out all logins, including SSH\n- Tasks modifying SSH configuration (`/etc/ssh/sshd_config`) without `validate: 'sshd -t -f %s'` \u2014 syntax error locks out SSH access permanently on next restart\n- Tasks modifying firewall rules (`iptables`, `ufw`, `firewalld`) without ensuring SSH port remains open \u2014 can lock out Ansible itself\n\n### Best practices\n- Use `become: true` at task level, not play level \u2014 only escalate when needed\n- Always use `validate` parameter on critical config files: `validate: 'nginx -t -c %s'`, `validate: 'sshd -t -f %s'`, `validate: 'visudo -cf %s'`\n- Use `backup: yes` on critical file modifications \u2014 creates timestamped backup before overwriting\n\n## Variable precedence risks\n\n### Precedence conflicts (MEDIUM)\n- `extra-vars` (-e) always win over everything \u2014 can silently override role defaults, group_vars, and host_vars without warning\n- `group_vars/all` overriding role defaults unexpectedly \u2014 role developers expect their defaults to apply unless explicitly overridden\n- Multiple group memberships causing variable merging conflicts \u2014 host in both `webservers` and `production` groups may get conflicting variable values depending on group file ordering\n- `set_fact` overriding variables mid-play \u2014 subsequent tasks get the new value, but handlers still see the original value\n- `include_vars` with `hash_behaviour: merge` vs `replace` \u2014 merge combines dictionaries, replace overwrites entirely; default is replace, which can lose nested keys\n\n### Variable safety\n- Sensitive variables (`passwords`, `api_keys`, `tokens`) should be in `ansible-vault` encrypted files, not plain text group_vars\n- Variables with `no_log: true` on tasks that handle secrets \u2014 prevents secret values from appearing in Ansible output and logs\n- `default()` filter on optional variables \u2014 prevents `undefined variable` errors from crashing the play\n\n## Handler ordering risks\n\n### Handler pitfalls (MEDIUM)\n- Handlers execute in DEFINITION order, not notification order \u2014 if handler A is defined after handler B but notified first, B still runs before A\n- Multiple tasks notifying the same handler \u2014 handler runs only ONCE at the end of the block, not once per notification; this is usually desired but can surprise operators\n- `flush_handlers` in the middle of a play \u2014 forces all pending handlers to execute immediately; if a handler fails, subsequent tasks in the play still run (handlers don't block play execution by default)\n- Handler that restarts a service while a health check task follows immediately \u2014 the service may not be ready when the health check runs; add a `wait_for` task after flush\n- Handlers in included roles \u2014 handler names must be globally unique across all roles; duplicate names cause only one handler to execute\n\n### Block error handling\n- `rescue` block only catches task failures, not handler failures \u2014 a failing handler in a `block` is NOT caught by `rescue`\n- `always` block runs regardless of task or rescue outcome \u2014 use for cleanup that must happen (remove temp files, restore config)\n- Nested blocks with error handling \u2014 inner rescue can mask errors from outer block; keep nesting shallow\n\n## Ansible-specific failure modes\n\n- SSH connection timeout on large inventories \u2014 default `forks: 5` means only 5 hosts processed in parallel; increase for large inventories but watch for connection storms\n- `gather_facts: true` (default) adds 5-15 seconds per host \u2014 disable with `gather_facts: false` when facts aren't needed; use `setup` module selectively for specific fact subsets\n- Ansible control node running out of memory on very large inventories (>1000 hosts) \u2014 each host fork uses 50-100MB; limit forks or use `mitogen` strategy for lower memory footprint\n- Python version mismatch between control node and target \u2014 Ansible requires Python on target hosts; Python 2 EOL means some modules behave differently on legacy hosts\n- `ansible.cfg` in current directory overriding system config \u2014 CI/CD runners may pick up unexpected config from repo checkout",
      "skill_id": "ansible",
      "tags": [
        "ansible",
        "automation",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/ansible",
      "token_budget": 1600,
      "tool": null,
      "tool_label": "ansible",
      "trigger_content_patterns": [
        "hosts",
        "tasks",
        "roles",
        "become",
        "ansible",
        "playbook",
        "handlers"
      ],
      "triggers": [
        ".yml",
        ".yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "ArgoCD sync and application-set guidance for GitOps delivery changes across shared clusters.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "argocd",
      "path": "skills/argocd",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- `syncPolicy.automated.prune: true` in shared namespaces can delete resources outside the immediate change set = HIGH\n- `selfHeal: true` can instantly revert emergency hotfixes and confuse incident response = MEDIUM\n- ApplicationSet generator changes fan out to every generated application and can widen blast radius cluster-wide = CRITICAL\n- ArgoCD Project repository or destination allowlists widened to `*` remove key deployment guardrails = HIGH\n\n## Review cues\n\n- Check whether ArgoCD automation settings change the blast radius beyond the named application.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "argocd",
      "tags": [
        "argocd",
        "gitops",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/argocd",
      "token_budget": 1350,
      "tool": null,
      "tool_label": "argocd",
      "trigger_content_patterns": [
        "argoproj.io/v1alpha1"
      ],
      "triggers": [
        "argocd-application.yaml",
        "app-of-apps.yaml",
        "argocd-project.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "AWS CDK guidance for logical IDs, removal policies, and synth-time environment drift.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "aws-cdk",
      "path": "skills/aws-cdk",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Logical ID changes force CloudFormation replacement even when code edits look cosmetic = HIGH\n- `RemovalPolicy.DESTROY` on stateful constructs risks permanent data loss = CRITICAL\n- Context lookups can differ between synth environments and change plans unexpectedly = MEDIUM\n- Broad IAM `PolicyStatement` grants expose entire accounts or regions = HIGH\n\n## Review cues\n\n- Review synthesized logical IDs, removal policies, and IAM grants together for CDK changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "aws-cdk",
      "tags": [
        "aws",
        "cdk",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/aws-cdk",
      "token_budget": 1350,
      "tool": null,
      "tool_label": "aws-cdk",
      "trigger_content_patterns": [
        "aws-cdk-lib",
        "PolicyStatement"
      ],
      "triggers": [
        "cdk.json",
        "cdk.context.json"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Azure Bicep guidance for deployment modes, secret exposure, and subscription-target drift.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "bicep",
      "path": "skills/bicep",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Resource or module renames can change deployment identity and trigger destructive replacement = HIGH\n- `existing` resources pointed at the wrong subscription or resource group create broken references = HIGH\n- Key Vault secrets or sensitive outputs emitted as plain strings leak credentials = CRITICAL\n- Complete-mode deployments delete unmanaged resources in the target scope = CRITICAL\n\n## Review cues\n\n- Review Bicep deployment mode, target scope, and secret handling together before approving.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "bicep",
      "tags": [
        "azure",
        "bicep",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/bicep",
      "token_budget": 1300,
      "tool": null,
      "tool_label": "bicep",
      "trigger_content_patterns": [],
      "triggers": [
        "main.bicep",
        "infra.bicep",
        ".bicep"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Cert-Manager issuance and renewal guidance for issuer, solver, and secret-rotation changes.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "cert-manager",
      "path": "skills/cert-manager",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Replacing a ClusterIssuer can break renewals for every dependent certificate = CRITICAL\n- DNS01 or HTTP01 solver changes can stall issuance and leave certificates to expire = HIGH\n- Secret name rotation without workload coordination causes immediate TLS outages = HIGH\n- Certificate durations outside issuer policy can trigger noisy renewal loops = MEDIUM\n\n## Review cues\n\n- Review issuer scope, solver reachability, and secret consumers together for cert-manager changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "cert-manager",
      "tags": [
        "cert-manager",
        "tls",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/cert-manager",
      "token_budget": 1250,
      "tool": null,
      "tool_label": "cert-manager",
      "trigger_content_patterns": [
        "cert-manager.io/"
      ],
      "triggers": [
        "cert-manager-certificate.yaml",
        "clusterissuer.yaml",
        "issuer.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Deep CloudFormation risk intelligence covering resource replacement detection, deletion policies, drift patterns, stack dependencies, IAM resource risks, and service quota awareness.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "cloudformation",
      "path": "skills/cloudformation",
      "scenario_count": 1,
      "skill_body": "## Resource update behavior\n\n### Replacement-required updates (CRITICAL)\n- Any resource change that triggers `Replacement` instead of `Update` = CRITICAL \u2014 CloudFormation destroys the old resource and creates a new one; for stateful resources (RDS, DynamoDB, ElastiCache) this means DATA LOSS unless DeletionPolicy is Retain or Snapshot\n- Common replacement triggers:\n  - RDS: `Engine`, `DBInstanceIdentifier`, `MasterUsername`, `AvailabilityZone`, `StorageEncrypted` changes all force replacement\n  - DynamoDB: `TableName`, `KeySchema` (partition/sort key), `BillingMode` change from PAY_PER_REQUEST to PROVISIONED (on some versions)\n  - EC2: `InstanceType` change on instances without stop/start support, `ImageId` change, `AvailabilityZone` change\n  - ElastiCache: `Engine`, `CacheNodeType` on some cluster modes, `NumCacheClusters` reduction\n  - Lambda: `FunctionName` change forces replacement; `Runtime`, `Handler`, `Code` are in-place updates\n  - S3: `BucketName` change = REPLACEMENT \u2014 and S3 bucket names are globally unique; the old name may not be reclaimable\n\n### In-place update risks (HIGH)\n- RDS `MultiAZ` change = triggers brief failover (60-120 seconds downtime)\n- RDS `AllocatedStorage` increase = online for most engines; DECREASE is not supported and fails\n- EC2 `InstanceType` change = requires stop/start; brief downtime during transition\n- Auto Scaling Group `LaunchTemplate` version change = triggers rolling replacement of instances based on update policy\n- ECS `TaskDefinition` revision change = triggers new deployment; old tasks drain based on `DeregistrationDelay`\n\n### No-interruption updates (LOW)\n- Tag changes on most resources = no interruption, no replacement\n- CloudWatch alarm threshold changes = immediate effect\n- Lambda environment variable changes = next invocation uses new values\n- SNS topic `DisplayName` change = no operational impact\n\n## Deletion policy analysis\n\n### Missing DeletionPolicy (CRITICAL)\n- RDS instance without `DeletionPolicy: Snapshot` or `DeletionPolicy: Retain` = CRITICAL \u2014 stack deletion or resource removal destroys the database with no backup; DEFAULT behavior is Delete\n- DynamoDB table without `DeletionPolicy: Retain` = CRITICAL \u2014 table and all data destroyed on removal\n- S3 bucket without `DeletionPolicy: Retain` = HIGH \u2014 bucket must be empty before CloudFormation can delete it; if not empty, stack deletion fails and leaves the stack in DELETE_FAILED state\n- EBS volumes without `DeletionPolicy: Snapshot` = HIGH \u2014 data volumes destroyed without backup\n- ElastiCache without `DeletionPolicy: Snapshot` = MEDIUM \u2014 cache data lost; acceptable if cache is rebuilt from persistent data source\n- Elasticsearch/OpenSearch domain without snapshot = HIGH \u2014 index data lost\n\n### DeletionPolicy changes (HIGH)\n- `DeletionPolicy` changed from `Retain` to `Delete` = CRITICAL \u2014 previously protected resource is now vulnerable to stack deletion\n- `DeletionPolicy` removed entirely = CRITICAL \u2014 reverts to default behavior which is Delete for most resources\n- `UpdateReplacePolicy` missing when DeletionPolicy is set = MEDIUM \u2014 DeletionPolicy protects against stack deletion, but UpdateReplacePolicy protects against in-stack replacement; they serve different purposes and both should be set on stateful resources\n\n## Drift-prone patterns\n\n### Console-modified resources (HIGH)\n- Resources frequently modified via AWS Console (security groups, IAM policies, S3 bucket policies) \u2014 console changes create drift between actual state and template state; next stack update may REVERT console changes silently\n- Resources with `Metadata` section that doesn't match actual metadata = drift indicator\n- Auto Scaling Group with manually adjusted `DesiredCapacity` \u2014 stack update resets to template value\n- RDS instance with manually applied parameter group changes \u2014 stack update may revert to template-defined parameter group\n\n### Parameter default masking (MEDIUM)\n- Parameters with `Default` values that operators override via console \u2014 the override is not captured in the template; stack recreation uses the default, not the overridden value\n- `Conditions` that reference parameters \u2014 changing a parameter default can flip conditions, enabling or disabling entire resource branches\n- `Mappings` used with `AWS::Region` \u2014 ensure all required regions have entries; deploying to a new region with a missing mapping entry causes creation failure\n\n## Stack dependency risks\n\n### Cross-stack references (HIGH)\n- `Fn::ImportValue` referencing another stack's `Export` = HIGH \u2014 creates an implicit dependency; the exporting stack cannot be updated to change or remove the export while any stack imports it; this creates stack lock-in\n- `Export` name change = CRITICAL \u2014 all importing stacks immediately break; CloudFormation prevents this but nested reference chains can be complex\n- Stack output used as parameter in another stack (manual wiring) = MEDIUM \u2014 less tightly coupled but requires deployment ordering discipline\n\n### Nested stack risks (HIGH)\n- Nested stack template URL change = HIGH \u2014 triggers nested stack update; ALL resources in the nested stack are re-evaluated\n- Nested stack parameter change = propagates through the nested stack; side effects depend on which resources consume the parameter\n- Parent stack rollback with nested stacks = complex \u2014 nested stack may succeed while parent fails; leaves inconsistent state\n- Nested stack with `DeletionPolicy: Retain` on resources inside = the nested stack itself may be deleted but retained resources persist as orphans outside CloudFormation management\n\n## IAM resource risks\n\n### Overly broad policies (CRITICAL)\n- `AWS::IAM::Policy` with `Action: \"*\"` = CRITICAL \u2014 administrative access to every AWS service\n- `AWS::IAM::Policy` with `Resource: \"*\"` on sensitive actions = CRITICAL \u2014 action applies to every resource in the account\n- `AWS::IAM::Role` with trust policy allowing `Principal: \"*\"` = CRITICAL \u2014 any AWS account can assume the role\n- `AWS::IAM::Role` trust policy without `Condition` restricting `sts:ExternalId` for cross-account roles = HIGH \u2014 susceptible to confused deputy attack\n- `AWS::IAM::ManagedPolicy` with `Path: \"/\"` attached to multiple roles = MEDIUM \u2014 broad attachment scope; change affects all attached roles\n\n### IAM boundary and constraints\n- Missing `PermissionsBoundary` on IAM roles created by the stack = MEDIUM \u2014 roles have no upper bound on permissions; a policy change can escalate beyond intended scope\n- `AWS::IAM::InstanceProfile` change = HIGH \u2014 requires instance stop/start or replacement to take effect\n- Service-linked role manipulation = HIGH \u2014 these are managed by AWS services; manual modification can break service functionality\n\n## Service quota risks\n\n### Capacity limits (MEDIUM)\n- Creating multiple resources of the same type \u2014 check against account-level service quotas:\n  - VPCs per region: default 5\n  - Elastic IPs per region: default 5\n  - RDS instances per region: default 40\n  - Lambda concurrent executions: default 1000\n  - S3 buckets per account: default 100\n  - CloudFormation stacks per region: default 200\n- Resources approaching quota limits may succeed in development (fewer resources) but fail in production (more resources)\n- Quota increases require AWS support requests and take 1-5 business days\n\n### Template limits\n- Template body size limit: 51,200 bytes (direct upload) or 460,800 bytes (S3 URL)\n- Resource limit per template: 500 resources \u2014 approaching this limit indicates the stack should be decomposed\n- Output limit: 200 outputs per stack\n- Parameter limit: 200 parameters per stack\n- Mapping limit: 200 mappings with 200 key-value pairs each\n\n## CloudFormation-specific failure modes\n\n### Stack update failures\n- Stack in `UPDATE_ROLLBACK_FAILED` state = CRITICAL \u2014 stack is stuck; requires manual intervention via `ContinueUpdateRollback` with resources to skip, or potentially stack recreation\n- Circular dependency between resources = creation failure \u2014 `DependsOn` chains that form a cycle prevent stack creation; CloudFormation detects this at validation time\n- Insufficient IAM permissions for CloudFormation execution role = creation/update failure \u2014 CloudFormation needs permission to create EVERY resource type in the template\n- Resource creation order dependency not expressed in template \u2014 CloudFormation creates resources in parallel unless `DependsOn` specifies ordering; missing DependsOn on dependent resources causes race conditions\n\n### Rollback risks\n- Stack update with `--no-rollback` flag = CRITICAL \u2014 if update fails, stack remains in `UPDATE_FAILED` state with partial changes applied; no automatic recovery\n- Stack creation with `--on-failure DO_NOTHING` = HIGH \u2014 failed stack persists with partial resources; useful for debugging but dangerous in automation\n- Rollback of a stack that created resources with external dependencies (data was written to the new database, DNS was pointed to new load balancer) = the rollback destroys the new resources but the external dependencies are not rolled back\n- Change set with `Replacement: True` on multiple resources = HIGH \u2014 if any replacement fails mid-update, rollback must recreate the ORIGINAL resources which may fail (name conflicts, quota limits)",
      "skill_id": "cloudformation",
      "tags": [
        "cloudformation",
        "aws",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/cloudformation",
      "token_budget": 1500,
      "tool": null,
      "tool_label": "cloudformation",
      "trigger_content_patterns": [
        "AWSTemplateFormatVersion",
        "Resources",
        "AWS::",
        "CloudFormation"
      ],
      "triggers": [
        ".yaml",
        ".yml",
        ".json",
        ".template"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Crossplane composition guidance for control-plane fan-out, provider config, and managed resource safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "crossplane",
      "path": "skills/crossplane",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Composition patch changes on network or database fields can fan out to every bound claim = CRITICAL\n- Removing fields from an XRD schema breaks existing claims and composition compatibility = HIGH\n- ProviderConfig credential-source changes can orphan reconciles or point resources at the wrong account = HIGH\n- `deletionPolicy: Delete` on managed resources means claim removal deletes cloud assets too = CRITICAL\n\n## Review cues\n\n- Review Crossplane changes as control-plane mutations, not just single-resource YAML edits.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "crossplane",
      "tags": [
        "crossplane",
        "platform",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/crossplane",
      "token_budget": 1450,
      "tool": null,
      "tool_label": "crossplane",
      "trigger_content_patterns": [
        "apiextensions.crossplane.io",
        "pkg.crossplane.io"
      ],
      "triggers": [
        "composition.yaml",
        "xrd.yaml",
        "claim.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Datadog monitor guidance for threshold drift, no-data handling, and alert-routing changes.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "datadog-monitors",
      "path": "skills/datadog-monitors",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Loosening alert thresholds or evaluation windows can suppress paging on real incidents = HIGH\n- Disabling `notify_no_data` on heartbeat-style monitors hides telemetry loss = HIGH\n- Removing renotify behavior on sev1 services lengthens incident response = MEDIUM\n- Composite monitor dependency changes can invert alert logic for multiple downstream teams = HIGH\n\n## Review cues\n\n- Review query logic, no-data handling, and alert routing together for Datadog monitor changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "datadog-monitors",
      "tags": [
        "datadog",
        "monitoring",
        "alerts"
      ],
      "test_suite_path": "tests/skill-tests/datadog-monitors",
      "token_budget": 1200,
      "tool": null,
      "tool_label": "datadog-monitors",
      "trigger_content_patterns": [],
      "triggers": [
        "datadog-monitor.json",
        "datadog-monitor.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Container image and build risk knowledge covering Dockerfile security patterns, image provenance, multi-stage build risks, compose file analysis, and runtime container security.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "docker",
      "path": "skills/docker",
      "scenario_count": 1,
      "skill_body": "## Dockerfile risk patterns\n\n### Running as root (CRITICAL)\n- No `USER` instruction in the Dockerfile = CRITICAL \u2014 container runs as root by default; if the container is compromised, the attacker has root-level access within the container and potentially the host (with certain volume mounts or kernel exploits)\n- `USER root` set after a non-root USER instruction = HIGH \u2014 reverts to root for subsequent layers; common mistake when installing system packages late in the build\n- `USER` instruction only in intermediate build stage, not in final runtime stage = CRITICAL \u2014 multi-stage build where the builder runs as non-root but the final image runs as root\n\n### Dangerous instructions (HIGH)\n- `COPY . .` or `ADD . .` without `.dockerignore` = HIGH \u2014 copies EVERYTHING from the build context including `.git/`, `.env`, `node_modules/`, credentials files, and local development configs into the image\n- `ADD` with remote URL = HIGH \u2014 downloads and extracts files from the internet at build time; URL content can change between builds (non-deterministic); use `COPY` + explicit `curl`/`wget` with checksum verification instead\n- `RUN chmod 777` on any directory = HIGH \u2014 world-writable directories inside the container; any process can modify files\n- `RUN apt-get install -y` without `--no-install-recommends` = MEDIUM \u2014 installs unnecessary recommended packages, increasing image size and attack surface\n- `RUN curl | sh` or `RUN wget -O - | sh` = CRITICAL \u2014 executes remote scripts without verification; classic supply chain attack vector; download, verify checksum, then execute as separate steps\n- `ENV` with secret values (passwords, tokens, API keys) = CRITICAL \u2014 environment variables are baked into the image layer metadata; anyone with image access can extract them with `docker inspect` or `docker history`\n\n### Unpinned dependencies (HIGH)\n- Base image with `latest` tag: `FROM node:latest` = HIGH \u2014 image content changes without notice; builds are non-reproducible; a base image update can break the application silently\n- Base image with major version only: `FROM python:3` = MEDIUM \u2014 resolves to latest 3.x minor and patch; `FROM python:3.11.9-slim-bookworm` is deterministic\n- Base image without digest: prefer `FROM python:3.11.9@sha256:abc123...` = MEDIUM \u2014 even pinned tags can be overwritten in registries; digest pinning guarantees exact image content\n- `RUN pip install package` without version pin = MEDIUM \u2014 installs latest version at build time; breaks when package releases a backward-incompatible update\n- `RUN npm install` without `package-lock.json` copied first = MEDIUM \u2014 dependency resolution is non-deterministic without lockfile; different builds may get different dependency versions\n- `RUN apt-get update && apt-get install -y package` without version pin = LOW-MEDIUM \u2014 acceptable for system packages but note that builds are not perfectly reproducible\n\n### Layer optimization (LOW-MEDIUM)\n- Separate `RUN apt-get update` and `RUN apt-get install` = MEDIUM \u2014 if the install layer is cached but update is not, the cache uses a stale package index; always combine: `RUN apt-get update && apt-get install -y ... && rm -rf /var/lib/apt/lists/*`\n- Package manager cache not cleaned in the same RUN instruction = LOW \u2014 increases image size unnecessarily; cache files persist in the layer\n- `COPY` of large directories before `COPY` of dependency lockfiles = MEDIUM \u2014 invalidates Docker layer cache on every source code change; copy lockfiles first, install dependencies, then copy source code\n- More than 15 layers in the final image = LOW \u2014 excessive layers increase pull time and storage; combine related RUN instructions\n\n## Image provenance\n\n### Registry trust (HIGH)\n- Image from Docker Hub without official or verified publisher status = MEDIUM \u2014 community images may contain malware, crypto miners, or backdoors; prefer official images or your organization's private registry\n- Image from a public registry not in the organization's approved list = HIGH \u2014 supply chain risk; all base images should come from a curated allowlist\n- Image without a vulnerability scan report = MEDIUM \u2014 use `docker scout`, `trivy`, or `grype` to scan before deployment\n- Image with known CRITICAL CVEs in scan results = HIGH \u2014 vulnerabilities in base image or installed packages\n\n### Image signing and verification\n- Images not signed with Docker Content Trust or Cosign/Sigstore = MEDIUM \u2014 no guarantee the image wasn't tampered with between build and deployment\n- Image pulled by tag (not digest) in production K8s manifests = MEDIUM \u2014 tag can be overwritten in the registry; use `image@sha256:...` for immutable references\n- Multi-arch image used without specifying platform = LOW \u2014 Docker selects the platform automatically, but behavior may vary between build environments (CI vs local development)\n\n## Multi-stage build risks\n\n### Secret leakage between stages (CRITICAL)\n- `COPY --from=builder /app/.env /app/.env` = CRITICAL \u2014 copies secrets from build stage into runtime image; secrets that should only exist during build (npm tokens, pip credentials) end up in the final image\n- Build arguments (`ARG`) with secret values without using `--mount=type=secret` = HIGH \u2014 ARG values are visible in image layer metadata; use BuildKit secrets mount: `RUN --mount=type=secret,id=npm,target=/root/.npmrc npm install`\n- `COPY --from=builder /root/.aws /root/.aws` = CRITICAL \u2014 AWS credentials from builder stage leaked into runtime image\n\n### Stage dependency issues\n- Final stage `COPY --from=builder` missing critical runtime files = causes runtime failure (not a security risk but a reliability risk)\n- Final stage inheriting a different base image than expected \u2014 `FROM node:slim` (runtime) vs `FROM node:latest` (build); native dependencies compiled in build stage may not work on slim runtime (missing shared libraries)\n- Build stage with `RUN npm run build` but missing build artifacts in COPY = build succeeds but deployed image is broken\n\n## Docker Compose analysis\n\n### Security risks in Compose (HIGH)\n- `privileged: true` on any service = CRITICAL \u2014 container gets full access to host kernel; equivalent to running as root on the host machine\n- `network_mode: host` = HIGH \u2014 container shares the host's network namespace; can bind to any host port and see all network traffic\n- `pid: host` = HIGH \u2014 container can see all host processes; combined with `privileged`, allows container escape\n- `volumes` mounting sensitive host paths:\n  - `/var/run/docker.sock:/var/run/docker.sock` = CRITICAL \u2014 container can control the Docker daemon; can spawn new privileged containers, access any volume, and effectively has root on the host\n  - `/etc/shadow`, `/etc/passwd` = CRITICAL \u2014 host authentication files exposed\n  - `/` or `/root` = CRITICAL \u2014 entire host filesystem or root home directory accessible\n  - `/var/log` = MEDIUM \u2014 host logs may contain sensitive information\n- `cap_add: [SYS_ADMIN]` or `cap_add: [ALL]` = HIGH \u2014 Linux capabilities that approach root-level access\n\n### Configuration risks (MEDIUM)\n- `restart: always` without health check = MEDIUM \u2014 container restarts indefinitely even if the application is broken; combine with `healthcheck` to prevent restart loops\n- `ports` exposing database or cache ports to the host (3306, 5432, 6379, 27017) = HIGH \u2014 database accessible from the host network; should use internal Docker networking only\n- `environment` section with inline secrets instead of `secrets` or `.env` file reference = HIGH \u2014 secrets visible in `docker compose config` output and in the compose file in version control\n- Missing `mem_limit` or `deploy.resources.limits.memory` = MEDIUM \u2014 container can consume all host memory; always set memory limits in production\n- `depends_on` without `condition: service_healthy` = MEDIUM \u2014 dependent service starts before dependency is ready; only checks that the container started, not that the application inside is healthy\n\n### Networking risks\n- Services on the default bridge network in production = MEDIUM \u2014 all containers can communicate with each other; use custom networks to isolate services\n- `expose` without corresponding `ports` = informational \u2014 container port is documented but not published to host; not a risk, but verify it's intentional\n- Multiple services binding to the same host port = will fail at startup \u2014 `ports: \"8080:80\"` on two services causes port conflict; use different host ports or a reverse proxy\n\n## Runtime container security\n\n### Volume and data risks\n- Named volume removed from compose file = HIGH \u2014 data in the volume persists on the host but is no longer managed by compose; orphaned data\n- Anonymous volumes (no name specified) = MEDIUM \u2014 data is lost when the container is recreated; not suitable for persistent data\n- `tmpfs` mount for `/tmp` = positive \u2014 prevents temporary file persistence; good security practice\n- Volume mount with `:rw` (default) when `:ro` would suffice = LOW \u2014 principle of least privilege; read-only mounts reduce the risk of container writing to host filesystem\n\n### Container lifecycle\n- `stop_grace_period` too short (< 10s) for services with in-flight requests = MEDIUM \u2014 application may not finish processing requests before SIGKILL; default 10s is usually sufficient but increase for long-running operations\n- No `logging` driver configuration = LOW \u2014 defaults to `json-file` which can fill disk; configure `max-size` and `max-file` options or use a centralized logging driver\n- `init: true` not set for services that spawn child processes = MEDIUM \u2014 zombie processes accumulate; `init: true` adds a tiny init system that reaps child processes properly\n\n## Dockerfile change risk assessment\n\n| Change type | Risk level | Rationale |\n|---|---|---|\n| Base image tag change | HIGH | New OS, new packages, new vulnerabilities, potential breaking changes |\n| Base image digest change | MEDIUM | Controlled update, but content changes |\n| USER instruction added/changed | HIGH | Affects permission model for all subsequent instructions |\n| EXPOSE port change | MEDIUM | May require corresponding K8s service/ingress update |\n| COPY/ADD source path change | MEDIUM | Different files included in image |\n| RUN with package install | MEDIUM | New dependencies, new attack surface |\n| ENV change | LOW-HIGH | Depends on variable (PORT vs SECRET) |\n| ENTRYPOINT/CMD change | HIGH | Changes how the container starts; wrong entrypoint = broken container |\n| HEALTHCHECK change | MEDIUM | Affects readiness detection in orchestrators |\n| .dockerignore change | MEDIUM | Affects what enters the build context |",
      "skill_id": "docker",
      "tags": [
        "docker",
        "containers",
        "supply-chain"
      ],
      "test_suite_path": "tests/skill-tests/docker",
      "token_budget": 1200,
      "tool": null,
      "tool_label": "docker",
      "trigger_content_patterns": [],
      "triggers": [
        "Dockerfile",
        ".dockerfile",
        "docker-compose.yml",
        "docker-compose.yaml",
        "compose.yml",
        "compose.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Flux GitOps guidance for reconciliation, pruning, and source-driven rollout safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "flux",
      "path": "skills/flux",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Flux Kustomization `prune: true` can delete manually-created but still-needed resources = HIGH\n- Suspending or resuming the wrong source blocks reconciliation across multiple environments = HIGH\n- HelmRelease `valuesFrom` changes can silently swap runtime configuration = MEDIUM\n- Aggressive reconcile interval reductions can thundering-herd the control plane = MEDIUM\n\n## Review cues\n\n- Inspect source objects, reconciliation intervals, and pruning behavior together for Flux changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "flux",
      "tags": [
        "flux",
        "gitops",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/flux",
      "token_budget": 1300,
      "tool": null,
      "tool_label": "flux",
      "trigger_content_patterns": [
        "source.toolkit.fluxcd.io",
        "helm.toolkit.fluxcd.io",
        "kustomize.toolkit.fluxcd.io"
      ],
      "triggers": [
        "gotk-sync.yaml",
        "flux-kustomization.yaml",
        "helmrelease.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": true,
      "author": "DeployWhisper",
      "description": "Git-based change context intelligence covering commit analysis, sensitive file detection, branch risk signals, author patterns, and co-change analysis. This skill is always loaded because Git context enriches every other tool's analysis.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "git",
      "path": "skills/git",
      "scenario_count": 1,
      "skill_body": "## Sensitive file detection\n\n### Auto-block from LLM transmission (CRITICAL)\nThese file patterns must NEVER be sent to external LLM providers. If detected in uploaded files, exclude content from LLM payload and display a prominent warning to the user.\n\n- `.env`, `.env.local`, `.env.production`, `.env.*` \u2014 environment variables often contain API keys, database credentials, and secrets\n- `*.pem`, `*.key`, `*.crt`, `*.p12`, `*.pfx` \u2014 TLS certificates and private keys\n- `id_rsa`, `id_ed25519`, `id_ecdsa`, `*.pub` (private key counterparts) \u2014 SSH keys\n- `kubeconfig`, `.kube/config`, `kube.config` \u2014 Kubernetes cluster credentials with admin access\n- `credentials`, `credentials.json`, `credentials.xml` \u2014 generic credential stores\n- `*.tfstate`, `*.tfstate.backup` \u2014 Terraform state files contain every resource attribute including passwords, connection strings, and private IPs\n- `terraform.tfvars` containing `password`, `secret`, `token`, `key` variables \u2014 inspect variable names, not just filename\n- `aws_credentials`, `.aws/credentials`, `.aws/config` \u2014 AWS access keys and session tokens\n- `gcloud-service-account*.json`, `*-sa-key.json` \u2014 GCP service account keys\n- `.npmrc`, `.pypirc` with auth tokens \u2014 package registry credentials\n- `vault-token`, `.vault-token`, `vault.json` \u2014 HashiCorp Vault access tokens\n- `docker-compose*.yml` with `environment` sections containing hardcoded secrets\n\n### Sensitive content patterns (HIGH)\nEven in non-sensitive filenames, flag if content contains:\n- Strings matching `AKIA[0-9A-Z]{16}` \u2014 AWS access key ID pattern\n- Strings matching `ghp_[a-zA-Z0-9]{36}` \u2014 GitHub personal access token\n- Strings matching `sk-[a-zA-Z0-9]{48}` \u2014 OpenAI API key pattern\n- Strings matching `xox[bpors]-[a-zA-Z0-9-]+` \u2014 Slack token pattern\n- Variables named `password`, `secret`, `token`, `api_key`, `apikey`, `access_key`, `private_key` with string literal values\n- Base64-encoded blocks longer than 100 characters in YAML/JSON values \u2014 may be encoded credentials\n\n## Commit context analysis\n\n### Commit message signals (MEDIUM)\n- Commit message containing `hotfix`, `urgent`, `emergency`, `ASAP`, `quick fix` = MEDIUM \u2014 rushed changes are more likely to have errors; flag for extra review attention\n- Commit message containing `revert`, `rollback`, `undo` = informational \u2014 indicates a previous change caused problems; the original change and this revert should be understood together\n- Commit message referencing a ticket/issue (e.g., `JIRA-1234`, `#456`, `fixes #789`) = positive signal \u2014 change is tracked and has context\n- Commit message with no ticket reference on infrastructure files = MEDIUM \u2014 untracked infrastructure change; may bypass change management process\n- Very short commit messages (`fix`, `update`, `test`, `wip`) on production infrastructure files = MEDIUM \u2014 suggests the change wasn't carefully considered\n- Commit message mentioning `temporary`, `hack`, `workaround`, `TODO` = MEDIUM \u2014 indicates technical debt being introduced intentionally\n\n### Change magnitude signals\n- Single commit touching more than 10 infrastructure files = HIGH \u2014 large blast radius change; should be broken into smaller, reviewable chunks\n- Single commit mixing infrastructure changes with application code changes = MEDIUM \u2014 infrastructure and application should be deployed independently; coupled changes increase rollback complexity\n- Commit with more lines deleted than added in infrastructure files = informational \u2014 net reduction in infrastructure; verify nothing critical was removed\n- Empty commit or merge commit with no diff = informational \u2014 may be a pipeline trigger or branch synchronization; no analysis needed\n\n## Branch risk signals\n\n### Deployment source risks (HIGH)\n- Deploying from a branch other than `main`, `master`, or `release/*` = HIGH \u2014 non-standard deployment source; may contain unreviewed or in-progress changes\n- Deploying from a branch with `force-push` in its recent history = CRITICAL \u2014 force-push rewrites history; commits may have been removed or altered without review; the deployment artifact may not match what was code-reviewed\n- Deploying from a branch that is behind `main` by more than 20 commits = MEDIUM \u2014 stale branch; infrastructure may have changed significantly since the branch was created; merge conflicts or missing dependencies are likely\n- Deploying from a branch with unresolved merge conflict markers (`<<<<<<<`, `=======`, `>>>>>>>`) = CRITICAL \u2014 conflict markers in YAML or HCL files cause parse failures; in some cases they may be silently ignored by lenient parsers, creating malformed infrastructure\n\n### Branch hygiene signals\n- Branch name containing `experiment`, `test`, `poc`, `spike` = HIGH if deploying to production \u2014 these branch names indicate exploratory work that shouldn't reach production\n- Multiple unreviewed commits on the branch = MEDIUM \u2014 changes may not have been peer-reviewed; especially risky for infrastructure modifications\n- Branch with no associated pull request = MEDIUM \u2014 bypasses the code review process; direct pushes to deploy branches should require justification\n\n## Author risk signals\n\n### Author context (MEDIUM)\n- First-time contributor to infrastructure files = HIGH \u2014 new to the IaC codebase; changes deserve extra review attention regardless of the contributor's seniority\n- Changes to infrastructure files by an author who primarily commits application code = MEDIUM \u2014 infrastructure requires different expertise; application developers may not understand Terraform state implications or K8s resource management nuances\n- Commits outside normal working hours (22:00-06:00 local time) on non-incident branches = MEDIUM \u2014 late-night changes correlate with higher error rates; not a hard rule but worth flagging\n- Multiple authors modifying the same infrastructure file in the same PR = informational \u2014 indicates collaborative infrastructure change; verify changes don't conflict\n\n### Review status\n- PR with zero approvals deploying to production = HIGH \u2014 no peer review on infrastructure change\n- PR with approvals from team members who don't own the affected infrastructure = MEDIUM \u2014 approval may lack domain expertise\n- PR approved more than 5 days ago without re-approval = MEDIUM \u2014 infrastructure context may have changed since the review; consider re-review\n\n## Co-change analysis\n\n### Missing co-changes (HIGH)\n- Terraform security group change without corresponding application configuration change = informational \u2014 new ports opened may need application config to use them\n- Kubernetes deployment image change without ConfigMap or Secret update = informational \u2014 new application version may expect new configuration\n- Ansible role update without inventory change = informational \u2014 new role variables may need inventory-level overrides\n- Jenkinsfile deploy stage change without corresponding infrastructure change = informational \u2014 deployment process changed but infrastructure target is the same; verify compatibility\n- Dockerfile base image change without Kubernetes resource limit update = MEDIUM \u2014 new base image may have different memory/CPU footprint; resource limits may need adjustment\n\n### Historical co-change patterns\n- Files that historically change together (detected from git log) but only one is present in the current changeset = MEDIUM \u2014 potential missing co-change; common example: Terraform module source changed but module variable file not updated\n- Infrastructure files that always change in pairs (e.g., `main.tf` and `variables.tf`) but only one is modified = LOW \u2014 may indicate incomplete change",
      "skill_id": "git",
      "tags": [
        "git",
        "diff",
        "review"
      ],
      "test_suite_path": "tests/skill-tests/git",
      "token_budget": 1200,
      "tool": null,
      "tool_label": "git",
      "trigger_content_patterns": [],
      "triggers": [
        ".diff",
        ".patch",
        ".gitdiff"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Helm chart rollout guidance covering hooks, chart drift, and value-driven production failures.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "helm",
      "path": "skills/helm",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- `post-install` or `pre-upgrade` hooks that mutate shared databases can turn a chart upgrade into a production outage = HIGH\n- Service selector or immutable field changes force resource replacement and can strand live traffic = HIGH\n- Floating chart dependencies or `image.tag: latest` break reproducibility between environments = MEDIUM\n- Scaling critical workloads to zero through values files removes live capacity immediately = HIGH\n\n## Review cues\n\n- Review rendered manifests, hooks, and dependency updates together before approving a Helm rollout.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "helm",
      "tags": [
        "helm",
        "kubernetes",
        "gitops"
      ],
      "test_suite_path": "tests/skill-tests/helm",
      "token_budget": 1400,
      "tool": null,
      "tool_label": "helm",
      "trigger_content_patterns": [
        "apiVersion: v2",
        "dependencies:"
      ],
      "triggers": [
        "Chart.yaml",
        "values.yaml",
        "values-production.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Helmfile guidance for environment inheritance, release targeting, and shared values safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "helmfile",
      "path": "skills/helmfile",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Environment value inheritance can change many releases at once = HIGH\n- `helmDefaults.atomic: false` leaves partial failed rollouts behind = MEDIUM\n- Release selector or ordering changes can affect unintended workloads = HIGH\n- Shared secret or values file path changes break multiple environments together = HIGH\n\n## Review cues\n\n- Review environment inheritance and release targeting together before approving Helmfile changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "helmfile",
      "tags": [
        "helmfile",
        "helm",
        "gitops"
      ],
      "test_suite_path": "tests/skill-tests/helmfile",
      "token_budget": 1300,
      "tool": null,
      "tool_label": "helmfile",
      "trigger_content_patterns": [],
      "triggers": [
        "helmfile.yaml",
        "helmfile.yml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Istio traffic-management and policy guidance for routing, mTLS, and authorization changes.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "istio",
      "path": "skills/istio",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- VirtualService host, match, or gateway rewrites can blackhole traffic across multiple services = CRITICAL\n- DestinationRule TLS mode mismatches commonly surface as cascading 503 errors = HIGH\n- AuthorizationPolicy allow rules with broad principals or namespaces expand lateral access = HIGH\n- PeerAuthentication set to `STRICT` before workloads are mesh-ready can trigger downtime = HIGH\n\n## Review cues\n\n- Review Istio routing and policy changes together because safe config depends on mesh-wide consistency.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "istio",
      "tags": [
        "istio",
        "service-mesh",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/istio",
      "token_budget": 1350,
      "tool": null,
      "tool_label": "istio",
      "trigger_content_patterns": [
        "networking.istio.io",
        "security.istio.io"
      ],
      "triggers": [
        "virtualservice.yaml",
        "destinationrule.yaml",
        "authorizationpolicy.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Deep Jenkins pipeline safety knowledge covering approval gate analysis, credential exposure patterns, agent security, deployment stage risks, and shared library vulnerabilities.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "jenkins",
      "path": "skills/jenkins",
      "scenario_count": 1,
      "skill_body": "## Approval gate analysis\n\n### Removed gates (CRITICAL)\n- `input` step removed from before a production deploy stage = CRITICAL \u2014 human approval was required and is now bypassed; deployments go straight to production without review\n- `input` step with `submitter` parameter removed or changed = HIGH \u2014 changes who can approve; removing submitter means ANYONE can approve\n- `timeout` removed from `input` step = MEDIUM \u2014 without timeout, a stale approval request blocks the pipeline forever; with timeout, auto-rejection prevents zombie pipelines\n- `input` step moved from before deploy to after deploy = CRITICAL \u2014 approval happens after the change is already in production; defeats the purpose entirely\n\n### Weakened gates (HIGH)\n- `input` step moved inside a `parallel` block = HIGH \u2014 approval may be requested simultaneously for multiple environments; operator can accidentally approve prod thinking they're approving staging\n- `input` message changed to be less specific = MEDIUM \u2014 vague approval messages (\"Continue?\") don't communicate what's being approved; should specify environment, version, and change summary\n- `when` condition on deploy stage changed from `branch 'main'` to `branch '*'` = HIGH \u2014 production deploy now triggers from any branch, including feature branches and PRs\n- `when` condition removed entirely from deploy stage = CRITICAL \u2014 deploy stage runs unconditionally on every pipeline execution\n\n## Credential exposure patterns\n\n### Direct exposure (CRITICAL)\n- Environment variable set to credential value without `credentials()` helper \u2014 `env.DB_PASSWORD = 'actual-password'` exposes the password in pipeline logs, Blue Ocean UI, and build artifacts\n- `echo` or `println` of a variable that contains credentials = CRITICAL \u2014 Jenkins masks known credential IDs but not arbitrary variables containing secrets\n- `sh` step with credentials interpolated in the command string \u2014 `sh \"curl -u ${USER}:${PASSWORD} https://api...\"` exposes credentials in the shell process listing and Jenkins build log\n- `writeFile` with credential content without subsequent `archiveArtifacts` exclusion = HIGH \u2014 credential file may end up in build artifacts\n\n### Proper credential patterns\n- `credentials('credential-id')` binding in `environment` block \u2014 Jenkins auto-masks the value in logs\n- `withCredentials([...])` block wrapping only the steps that need access \u2014 limits exposure scope\n- `sshagent(['ssh-key-id'])` for SSH operations \u2014 key is loaded into agent memory, not written to disk\n- `secretText` / `usernamePassword` / `certificate` credential types \u2014 each has specific binding syntax\n\n### CI/CD secret leaks (HIGH)\n- `Jenkinsfile` committed to repo with hardcoded secrets = CRITICAL \u2014 secrets are in version control history permanently; even after removal, they exist in git history\n- `parameters` block with `defaultValue` containing secrets = HIGH \u2014 default values are visible in the Jenkins UI and API\n- `stash`/`unstash` of files containing secrets between stages = MEDIUM \u2014 stashed files are stored on the Jenkins controller; accessible to other builds if not cleaned up\n- `archiveArtifacts` including `.env` files, config files with secrets, or credential files = CRITICAL \u2014 artifacts are downloadable by anyone with build access\n\n## Agent security\n\n### Execution environment risks (HIGH)\n- `agent any` in production pipeline = HIGH \u2014 pipeline runs on ANY available agent, including untrusted or shared agents; production pipelines should target specific labeled agents\n- `agent { label 'master' }` or `agent { label 'built-in' }` = CRITICAL \u2014 running on the Jenkins controller exposes the entire Jenkins configuration, credential store, and all job definitions\n- `agent { docker { image 'untrusted:latest' } }` = HIGH \u2014 running build in an untrusted container image; supply chain attack vector\n- `agent` changed from specific label to `any` = HIGH \u2014 regression in agent targeting; production builds may run on development agents\n\n### Sandbox restrictions\n- `@NonCPS` annotation on methods = HIGH \u2014 bypasses the Groovy sandbox; can execute arbitrary code on the Jenkins controller\n- `@Grab` annotation importing external dependencies = CRITICAL \u2014 downloads and executes arbitrary code from Maven Central at runtime\n- Script approval requests in the Jenkins admin console = MEDIUM \u2014 indicates the pipeline is using APIs not in the sandbox whitelist; review what API access is being requested\n- `load` step loading external Groovy scripts = HIGH \u2014 loaded scripts bypass the Jenkinsfile sandbox unless also sandboxed\n\n## Deployment stage risks\n\n### Deployment patterns (HIGH)\n- Deploy stage without preceding test stage = HIGH \u2014 deploying untested code; the pipeline structure should enforce test \u2192 build \u2192 deploy ordering\n- Deploy stage in `parallel` block with test stage = HIGH \u2014 deploy runs simultaneously with tests, not after them; deploy may complete before tests fail\n- Deploy stage without `post { failure { ... } }` rollback block = HIGH \u2014 if deployment fails mid-way, there's no automated recovery\n- Deploy stage using `sh 'kubectl apply'` without `--dry-run` validation step first = MEDIUM \u2014 applying directly without pre-validation; add a dry-run stage before the actual apply\n\n### Rollback and recovery\n- Missing `post { failure { } }` block = MEDIUM \u2014 no automated action on pipeline failure; at minimum should notify a Slack channel or PagerDuty\n- Missing `post { always { } }` cleanup block = MEDIUM \u2014 temporary files, Docker containers, and test artifacts are not cleaned up; causes disk space issues over time\n- `post { unstable { } }` not defined = LOW \u2014 unstable builds (test failures that don't fail the pipeline) may need different handling than full failures\n- Retry block with `retry(count)` > 3 = MEDIUM \u2014 excessive retries can mask transient errors and delay failure notification; 3 retries is usually sufficient\n\n### Canary and progressive deployment\n- Deploy stage going from canary/blue-green to direct deployment = CRITICAL \u2014 regression in deployment safety; removes the ability to test with a subset of traffic before full rollout\n- `sleep` step used for manual canary validation = MEDIUM \u2014 fragile; pipeline blocks for a fixed duration regardless of whether canary is healthy; use a health check loop instead\n- Weight-based traffic shifting without health check gate between increments = HIGH \u2014 traffic shifts to new version even if it's unhealthy\n\n## Shared library risks\n\n### Library version changes (HIGH)\n- `@Library('my-shared-lib@main')` \u2014 always uses latest main branch; library change affects ALL pipelines using it simultaneously; a bug in the library breaks every pipeline at once\n- Library version changed from pinned tag to branch reference = HIGH \u2014 moves from deterministic to non-deterministic behavior\n- Library version changed from one tag to another = MEDIUM \u2014 review the library changelog for breaking changes\n- New `@Library` import added = MEDIUM \u2014 introduces dependency on external code; verify the library source is trusted\n- Library function signature changed \u2014 downstream pipelines using old signature will fail; library authors should maintain backward compatibility\n\n### Library security\n- Shared library with `@Grab` dependencies = CRITICAL \u2014 external dependency resolution at runtime; can be hijacked via dependency confusion\n- Shared library accessing `Jenkins.instance` = CRITICAL \u2014 has full access to the Jenkins controller, all credentials, all job configurations\n- Shared library modifying global state = HIGH \u2014 `env` modifications in a library function affect all subsequent stages in the calling pipeline\n- Untrusted shared library (not configured as \"trusted\" in Jenkins admin) = runs in sandbox \u2014 but sandbox escapes exist; review carefully\n\n## Pipeline structure risks\n\n### Resource management\n- No `timeout(time: X, unit: 'MINUTES')` on the pipeline = MEDIUM \u2014 a stuck build can run indefinitely, consuming an agent slot\n- No `timestamps()` option = LOW \u2014 makes debugging timing issues difficult; always enable\n- `disableConcurrentBuilds()` removed = HIGH \u2014 concurrent builds of the same pipeline can cause race conditions on shared resources (same deployment target, same Docker registry tag)\n- `buildDiscarder(logRotator(...))` removed or retention increased significantly = MEDIUM \u2014 Jenkins controller disk fills up with build logs and artifacts\n\n### Parameter risks\n- Pipeline parameter type changed from `choice` to `string` = MEDIUM \u2014 removes input validation; users can now enter arbitrary values instead of picking from approved list\n- New `booleanParam` defaulting to `true` for a destructive action = HIGH \u2014 destructive action is ON by default; operators must actively opt out\n- Parameter name changed \u2014 all upstream jobs, trigger configurations, and scripts referencing the old parameter name will break silently (pass null/empty instead of failing)",
      "skill_id": "jenkins",
      "tags": [
        "jenkins",
        "ci-cd",
        "pipelines"
      ],
      "test_suite_path": "tests/skill-tests/jenkins",
      "token_budget": 1400,
      "tool": null,
      "tool_label": "jenkins",
      "trigger_content_patterns": [
        "pipeline",
        "stage",
        "agent",
        "steps",
        "post",
        "input",
        "parallel"
      ],
      "triggers": [
        "Jenkinsfile",
        ".jenkinsfile",
        "jenkins.groovy"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Jsonnet guidance for import-graph drift, hidden defaults, and rendered secret exposure.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "jsonnet",
      "path": "skills/jsonnet",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Import graph changes can rewrite many generated manifests indirectly = HIGH\n- Hidden fields or local mixins can mask production-only config drift = MEDIUM\n- Evaluated secret literals leak directly into rendered output and review artifacts = CRITICAL\n- Deleting list elements in generators can remove live permissions or routes = HIGH\n\n## Review cues\n\n- Review rendered output and source-level abstraction changes together for Jsonnet edits.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "jsonnet",
      "tags": [
        "jsonnet",
        "templating",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/jsonnet",
      "token_budget": 1200,
      "tool": null,
      "tool_label": "jsonnet",
      "trigger_content_patterns": [],
      "triggers": [
        "jsonnetfile.json",
        "config.jsonnet",
        "main.jsonnet",
        ".jsonnet"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Deep Kubernetes operational knowledge covering workload safety, rolling update risks, RBAC escalation, network policy gaps, and resource management pitfalls.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "kubernetes",
      "path": "skills/kubernetes",
      "scenario_count": 1,
      "skill_body": "## Critical risk patterns\n\n### Workload security (CRITICAL)\n- Container running as root (`securityContext.runAsUser: 0` or missing `runAsNonRoot: true`) = CRITICAL \u2014 container escape vulnerabilities grant host-level access\n- Privileged container (`securityContext.privileged: true`) = CRITICAL \u2014 full access to host kernel, devices, and network stack; equivalent to root on the node\n- `hostNetwork: true` = CRITICAL \u2014 container shares the node's network namespace; can intercept traffic from other pods on the same node\n- `hostPID: true` or `hostIPC: true` = HIGH \u2014 container can see and signal all processes on the host node\n- Missing `securityContext.readOnlyRootFilesystem: true` = MEDIUM \u2014 writable filesystem increases attack surface for malware persistence\n- Container image with `latest` tag = HIGH \u2014 non-deterministic deployments; the same manifest can produce different containers on different nodes\n- Container image from untrusted registry (not your private ECR/GCR/ACR) = HIGH \u2014 supply chain attack vector\n- Image without digest pinning (using tag instead of `image@sha256:...`) = MEDIUM \u2014 tag can be overwritten in the registry\n\n### Missing resource controls (HIGH)\n- No `resources.limits.memory` set = HIGH \u2014 a single pod can consume all node memory and trigger OOM kills on other pods via the kernel OOM killer\n- No `resources.limits.cpu` set = MEDIUM \u2014 pod can starve other workloads of CPU; less severe than memory because CPU is compressible\n- No `resources.requests` set = HIGH \u2014 scheduler cannot make informed placement decisions; pods may land on overcommitted nodes\n- `resources.requests` much lower than `resources.limits` (>4x ratio) = MEDIUM \u2014 indicates over-commitment; the pod claims little but uses a lot, causing node pressure\n- `resources.limits.memory` lower than application baseline = HIGH \u2014 pod will be OOM-killed repeatedly, causing CrashLoopBackOff\n\n### Replica and availability risks (HIGH)\n- `spec.replicas: 1` in production = HIGH \u2014 single point of failure; any pod disruption causes full outage\n- `spec.replicas` reduced from current value = MEDIUM \u2014 capacity reduction during a change is risky; validate that remaining capacity handles peak load\n- No `PodDisruptionBudget` for production workloads = HIGH \u2014 voluntary disruptions (node drains, cluster upgrades) can evict all pods simultaneously\n- PDB with `maxUnavailable: 100%` or `minAvailable: 0` = CRITICAL \u2014 defeats the purpose of the PDB; all pods can be evicted at once\n\n## Rolling update risks\n\n### Deployment strategy\n- `strategy.rollingUpdate.maxUnavailable` set too high (>25%) = HIGH \u2014 too many pods terminate before replacements are ready; causes capacity dip during rollout\n- `strategy.rollingUpdate.maxSurge: 0` with `maxUnavailable: 0` = CRITICAL \u2014 deadlock; Kubernetes cannot create new pods or remove old ones\n- `strategy.type: Recreate` in production = CRITICAL \u2014 all old pods are killed before new pods start; guarantees downtime during deployment\n- Missing `minReadySeconds` = MEDIUM \u2014 new pods are considered ready immediately; a pod that passes readiness probe once but fails under load will still receive traffic\n\n### Probe configuration\n- No `readinessProbe` defined = CRITICAL \u2014 Kubernetes sends traffic to pods that may not be ready to serve; causes errors during rollout and after restarts\n- No `livenessProbe` defined = MEDIUM \u2014 stuck/deadlocked pods are never restarted; process is running but not functional\n- `livenessProbe` with aggressive timing (`periodSeconds < 5`, `failureThreshold < 3`) = HIGH \u2014 healthy but briefly slow pods get killed unnecessarily, causing restart loops\n- `livenessProbe` and `readinessProbe` pointing to the same endpoint with same thresholds = MEDIUM \u2014 when the service is degraded, you want it removed from load balancer (readiness) but not killed (liveness); same config means degraded = killed\n- `startupProbe` missing on slow-starting applications = HIGH \u2014 liveness probe kills the pod before the application finishes initialization\n- `initialDelaySeconds` too short for applications with long startup (JVM, .NET, ML model loading) = HIGH \u2014 pod killed during warmup\n\n### Image and container changes\n- Image tag change (e.g., `v2.14.1` \u2192 `v2.15.0`) = MEDIUM-HIGH \u2014 new code rolling into production; risk scales with change magnitude\n- Base image change (e.g., `node:18-alpine` \u2192 `node:20-alpine`) = HIGH \u2014 runtime version change can introduce subtle behavior differences\n- `imagePullPolicy: Never` with a tag (not digest) = HIGH \u2014 uses whatever image is cached on the node; different nodes may run different versions\n- `imagePullPolicy: Always` with `latest` tag = CRITICAL \u2014 every pod restart pulls whatever is currently tagged latest; non-deterministic\n\n## RBAC and access control\n\n### Role escalation risks\n- `ClusterRole` with `verbs: [\"*\"]` on any resource = CRITICAL \u2014 wildcard permissions grant full control\n- `ClusterRole` with `resources: [\"*\"]` = CRITICAL \u2014 applies to every resource type in the cluster\n- `ClusterRoleBinding` granting cluster-admin to a ServiceAccount used by a workload = CRITICAL \u2014 compromised pod gets full cluster access\n- New `RoleBinding` or `ClusterRoleBinding` creation = HIGH \u2014 always review who/what is getting access and to what resources\n- ServiceAccount with `automountServiceAccountToken: true` (default) in pods that don't need API access = MEDIUM \u2014 unnecessary credential exposure\n\n### Secret management\n- `Secret` data changed = HIGH \u2014 verify the secret content is correct; wrong database password or API key causes runtime failures across all pods mounting the secret\n- `Secret` referenced in environment variables instead of volume mounts = MEDIUM \u2014 environment variables appear in process listings, crash dumps, and log output\n- `ConfigMap` change that is mounted as a volume = MEDIUM \u2014 existing pods see the change after kubelet sync delay (60-90 seconds by default); no restart needed but timing is unpredictable\n- `ConfigMap` change referenced via `envFrom` = HIGH \u2014 requires pod restart to pick up changes; running pods continue with old values until restarted\n\n## Network policy risks\n\n- Production namespace without any `NetworkPolicy` = HIGH \u2014 all pods can communicate with all other pods in the cluster; no microsegmentation\n- `NetworkPolicy` with empty `ingress` or `egress` rules = MEDIUM \u2014 blocks all traffic in that direction; can isolate pods unintentionally\n- `NetworkPolicy` with `podSelector: {}` (empty selector) = note \u2014 selects ALL pods in the namespace; verify this is intentional\n- Removing a `NetworkPolicy` = HIGH \u2014 instantly opens traffic that was previously restricted\n- `NetworkPolicy` referencing a label that no pod currently has = MEDIUM \u2014 policy exists but has no effect; may indicate a misconfiguration\n\n## Resource management pitfalls\n\n### HPA and scaling\n- `HorizontalPodAutoscaler` targeting the same deployment as a manual `replicas` field = CRITICAL \u2014 HPA and manual replica count fight each other; HPA overwrites manual changes\n- HPA `minReplicas: 1` in production = HIGH \u2014 autoscaler can scale down to single instance, creating a SPOF\n- HPA `maxReplicas` too high without corresponding node capacity = MEDIUM \u2014 pods will be stuck in Pending if cluster autoscaler can't provision nodes fast enough\n- HPA `targetCPUUtilizationPercentage` too low (<30%) = MEDIUM \u2014 wasteful over-provisioning; too high (>80%) = HIGH \u2014 insufficient headroom for traffic spikes, pods may become unresponsive before new ones are ready\n- VPA and HPA targeting the same resource on CPU = CRITICAL \u2014 conflicting recommendations cause flapping\n\n### Storage\n- `PersistentVolumeClaim` access mode change = HIGH \u2014 may require PV recreation, causing data access interruption\n- `PersistentVolume` reclaim policy `Delete` on production data = CRITICAL \u2014 volume and data destroyed when PVC is deleted\n- `StorageClass` change on existing PVC = not supported \u2014 requires PVC recreation and data migration\n- `emptyDir` used for data that must survive pod restarts = HIGH \u2014 data is lost when pod is evicted, rescheduled, or OOM-killed\n\n## Namespace and context\n\n- Changes targeting `kube-system` namespace = CRITICAL \u2014 core cluster components; mistakes here affect the entire cluster\n- Changes targeting `default` namespace in production = MEDIUM \u2014 indicates poor namespace hygiene; production workloads should have dedicated namespaces\n- Resource quotas or limit ranges being reduced = HIGH \u2014 may cause existing pods to exceed new limits; new pods may fail to schedule\n- Namespace deletion = CRITICAL \u2014 destroys all resources in the namespace including persistent volume claims (data loss)",
      "skill_id": "kubernetes",
      "tags": [
        "kubernetes",
        "containers",
        "orchestration"
      ],
      "test_suite_path": "tests/skill-tests/kubernetes",
      "token_budget": 1800,
      "tool": null,
      "tool_label": "kubernetes",
      "trigger_content_patterns": [
        "apiVersion",
        "kind",
        "metadata",
        "spec.containers",
        "spec.replicas"
      ],
      "triggers": [
        ".yaml",
        ".yml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Kustomize overlay guidance for name transforms, patch targeting, and namespace drift.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "kustomize",
      "path": "skills/kustomize",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- NamePrefix or NameSuffix changes rewrite object identities and can break references = HIGH\n- Broad strategic merge or JSON patches can mutate unintended resources = HIGH\n- Image tag swaps without digest pinning reintroduce unreviewed drift = MEDIUM\n- Namespace transformer changes can re-home shared resources unexpectedly = HIGH\n\n## Review cues\n\n- Review Kustomize overlays as graph-wide mutations rather than isolated file edits.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "kustomize",
      "tags": [
        "kustomize",
        "kubernetes",
        "gitops"
      ],
      "test_suite_path": "tests/skill-tests/kustomize",
      "token_budget": 1250,
      "tool": null,
      "tool_label": "kustomize",
      "trigger_content_patterns": [],
      "triggers": [
        "kustomization.yaml",
        "kustomization.yml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Nginx Ingress controller guidance for routing, annotations, and TLS handling.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "nginx-ingress",
      "path": "skills/nginx-ingress",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- `nginx.ingress.kubernetes.io/configuration-snippet` can inject unsafe directives and bypass shared controls = HIGH\n- Path regex or default-backend rewrites can shadow unrelated routes and break many services = HIGH\n- Missing body-size or timeout tuning on large uploads creates production-only failures = MEDIUM\n- TLS host and secret mismatches cause certificate fallback and immediate client trust errors = HIGH\n\n## Review cues\n\n- Check Nginx annotations and route precedence, not just the host/path diff, before approving.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "nginx-ingress",
      "tags": [
        "nginx",
        "ingress",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/nginx-ingress",
      "token_budget": 1250,
      "tool": null,
      "tool_label": "nginx-ingress",
      "trigger_content_patterns": [
        "nginx.ingress.kubernetes.io"
      ],
      "triggers": [
        "nginx-ingress.yaml",
        "ingress-nginx.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "OPA Gatekeeper policy guidance for deny rollouts, match scope, and inventory sync safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "opa-gatekeeper",
      "path": "skills/opa-gatekeeper",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- ConstraintTemplate or rego errors can remove enforcement when audit failures are ignored = HIGH\n- Broad match exclusions take critical namespaces out of policy coverage = HIGH\n- Rolling out deny policies without dry-run validation can block deployments cluster-wide = CRITICAL\n- Sync config omissions mean policies evaluate stale inventory and create false confidence = MEDIUM\n\n## Review cues\n\n- Review Gatekeeper changes as policy rollouts with cluster-wide blast radius, not isolated YAML edits.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "opa-gatekeeper",
      "tags": [
        "opa",
        "gatekeeper",
        "policy"
      ],
      "test_suite_path": "tests/skill-tests/opa-gatekeeper",
      "token_budget": 1350,
      "tool": null,
      "tool_label": "opa-gatekeeper",
      "trigger_content_patterns": [
        "templates.gatekeeper.sh",
        "constraints.gatekeeper.sh"
      ],
      "triggers": [
        "constrainttemplate.yaml",
        "constraint.yaml",
        "gatekeeper-policy.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Prometheus rule guidance for alert timing, recording rules, and query-cardinality safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "prometheus-rules",
      "path": "skills/prometheus-rules",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Extending alert `for:` windows delays paging and can hide fast-burning incidents = HIGH\n- Recording rule renames break downstream dashboards, alerts, and SLO calculations = MEDIUM\n- Unbounded joins or label expansions can overload Prometheus and remote-write pipelines = HIGH\n- Severity label downgrades reduce response urgency even when the underlying blast radius is unchanged = HIGH\n\n## Review cues\n\n- Review Prometheus rule semantics, cardinality impact, and downstream consumers before merging.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "prometheus-rules",
      "tags": [
        "prometheus",
        "monitoring",
        "alerts"
      ],
      "test_suite_path": "tests/skill-tests/prometheus-rules",
      "token_budget": 1250,
      "tool": null,
      "tool_label": "prometheus-rules",
      "trigger_content_patterns": [
        "kind: PrometheusRule"
      ],
      "triggers": [
        "prometheus-rules.yaml",
        "alerting-rules.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Pulumi stack guidance for aliasing, protection, and stateful replacement risks.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "pulumi",
      "path": "skills/pulumi",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Resource renames without aliases force replacements and can recreate live infrastructure unexpectedly = HIGH\n- Turning `protect` off on databases, buckets, or queues removes a key deletion backstop = HIGH\n- Promoting secret config into plain-text stack values leaks sensitive data into state and logs = CRITICAL\n- Preview output can miss provider-computed replacements, so review replacement plans conservatively = MEDIUM\n\n## Review cues\n\n- Look for alias coverage, stack-secret handling, and protection changes before approving Pulumi updates.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "pulumi",
      "tags": [
        "pulumi",
        "iac",
        "cloud"
      ],
      "test_suite_path": "tests/skill-tests/pulumi",
      "token_budget": 1400,
      "tool": null,
      "tool_label": "pulumi",
      "trigger_content_patterns": [
        "pulumi config",
        "@pulumi/"
      ],
      "triggers": [
        "Pulumi.yaml",
        "Pulumi.dev.yaml",
        "Pulumi.prod.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Pulumi Azure guidance for resource-group blast radius, identities, and recovery settings.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "pulumi-azure",
      "path": "skills/pulumi-azure",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Resource group replacement cascades to every contained Azure resource = CRITICAL\n- Managed identity or role-assignment drift can break runtime access immediately = HIGH\n- Key Vault soft-delete or purge-protection changes alter recovery guarantees = HIGH\n- Regional SKU changes can require destructive replacement instead of in-place updates = HIGH\n\n## Review cues\n\n- Review resource-group scope, identity changes, and recovery settings together for Pulumi Azure updates.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "pulumi-azure",
      "tags": [
        "pulumi",
        "azure",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/pulumi-azure",
      "token_budget": 1400,
      "tool": null,
      "tool_label": "pulumi-azure",
      "trigger_content_patterns": [
        "@pulumi/azure-native",
        "pulumi_azure_native"
      ],
      "triggers": [
        "Pulumi.azure.yaml",
        "pulumi-azure.ts"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Pulumi GCP guidance for IAM authority, project targeting, and state exposure risks.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "pulumi-gcp",
      "path": "skills/pulumi-gcp",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Authoritative IAM bindings can remove required members and lock out workloads or operators = HIGH\n- Cloud SQL or GKE replacements from region or name drift introduce avoidable downtime = HIGH\n- Project or folder target changes move blast radius to the wrong tenant = CRITICAL\n- Decrypting secrets into plain config or logs exposes sensitive state = CRITICAL\n\n## Review cues\n\n- Review project targeting, IAM authority, and replacement indicators together for Pulumi GCP changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "pulumi-gcp",
      "tags": [
        "pulumi",
        "gcp",
        "iac"
      ],
      "test_suite_path": "tests/skill-tests/pulumi-gcp",
      "token_budget": 1400,
      "tool": null,
      "tool_label": "pulumi-gcp",
      "trigger_content_patterns": [
        "@pulumi/gcp",
        "pulumi_gcp"
      ],
      "triggers": [
        "Pulumi.gcp.yaml",
        "pulumi-gcp.ts"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Tanka guidance for environment fan-out, cluster targeting, and Jsonnet-driven drift.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "tanka",
      "path": "skills/tanka",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Environment-level library changes fan out to every rendered object = HIGH\n- Server-side apply or force behavior can overwrite fields owned by other controllers = HIGH\n- Namespace or environment target drift deploys to the wrong cluster = CRITICAL\n- Hidden Jsonnet defaults make destructive deletions hard to spot in review = MEDIUM\n\n## Review cues\n\n- Review Tanka environment targets and rendered diff shape together before merge.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "tanka",
      "tags": [
        "tanka",
        "jsonnet",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/tanka",
      "token_budget": 1300,
      "tool": null,
      "tool_label": "tanka",
      "trigger_content_patterns": [
        "tanka.dev/v1alpha1",
        "tk.libsonnet"
      ],
      "triggers": [
        "spec.json",
        "environments/default/main.jsonnet"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Tekton pipeline guidance for credentials, finally tasks, and shared-workspace safety.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "tekton",
      "path": "skills/tekton",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- Mounting shared credentials into every task leaks secrets far beyond the intended build step = HIGH\n- Changes to `finally` tasks can skip cleanup, approval, or promotion gates = HIGH\n- Shared PVC workspaces across concurrent runs create artifact races and nondeterministic builds = MEDIUM\n- Floating task image tags change pipeline behavior outside code review = HIGH\n\n## Review cues\n\n- Review credential scope, shared workspace usage, and finally-task behavior together for Tekton changes.\n- Prefer deterministic roll-forward or rollback steps over hand-wavy remediation notes.",
      "skill_id": "tekton",
      "tags": [
        "tekton",
        "ci-cd",
        "kubernetes"
      ],
      "test_suite_path": "tests/skill-tests/tekton",
      "token_budget": 1250,
      "tool": null,
      "tool_label": "tekton",
      "trigger_content_patterns": [
        "tekton.dev/"
      ],
      "triggers": [
        "pipeline.yaml",
        "pipelinerun.yaml",
        "task.yaml"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "DeployWhisper",
      "description": "Deep Terraform risk knowledge covering provider-specific patterns, state operations, lifecycle rules, and common failure modes across AWS, GCP, and Azure.",
      "featured": false,
      "license": "MIT",
      "maintainer": null,
      "name": "terraform",
      "path": "skills/terraform",
      "scenario_count": 1,
      "skill_body": "## Critical risk patterns\n\n### Security exposure (CRITICAL)\n- Security group or firewall rule with `0.0.0.0/0` or `::/0` on any port other than 80/443 = CRITICAL \u2014 database ports (3306, 5432, 6379, 27017) exposed to the internet are an immediate data breach risk\n- IAM policy with `Action: \"*\"` or `Resource: \"*\"` = CRITICAL \u2014 grants god-mode access; every resource in the account is exposed\n- S3 bucket without `block_public_access` enabled = CRITICAL \u2014 data exfiltration risk; default should always be block-all\n- S3 bucket policy with `Principal: \"*\"` = CRITICAL \u2014 public read/write to the bucket\n- KMS key policy with overly broad access grants = HIGH \u2014 encryption key compromise affects all encrypted resources\n- RDS instance with `publicly_accessible = true` = CRITICAL \u2014 database directly reachable from the internet\n- EC2 instance or cloud VM in a public subnet without security group restriction = HIGH\n- IAM role with `sts:AssumeRole` trust policy allowing external accounts without conditions = CRITICAL \u2014 cross-account privilege escalation\n\n### Data loss risk (CRITICAL)\n- RDS instance without `deletion_protection = true` = CRITICAL \u2014 a `terraform destroy` or accidental removal deletes the database permanently\n- RDS instance without `final_snapshot_identifier` = HIGH \u2014 no backup taken before deletion\n- DynamoDB table without `point_in_time_recovery` enabled = HIGH \u2014 no recovery from accidental data corruption\n- S3 bucket without versioning enabled containing production data = HIGH \u2014 no recovery from accidental overwrites or deletes\n- EBS volume with `delete_on_termination = true` on production data volumes = HIGH\n- ElastiCache cluster without `snapshot_retention_limit > 0` = MEDIUM \u2014 no recovery from cache corruption or accidental flush\n\n### Network risks (HIGH)\n- VPC peering connection or transit gateway route change affecting production subnets = HIGH \u2014 can break inter-service communication\n- Route table modification removing default routes = CRITICAL \u2014 isolates all resources in the subnet\n- NAT gateway removal or replacement = HIGH \u2014 breaks outbound internet access for private subnets\n- Load balancer listener rule changes = MEDIUM \u2014 can misroute traffic if ordering is wrong\n- DNS record changes (Route53, Cloud DNS) = HIGH \u2014 propagation delays mean rollback takes 60-300 seconds depending on TTL\n\n## State-sensitive operations\n\n### Lifecycle rules\n- Resources with `prevent_destroy = true` \u2014 flag as CRITICAL if a `destroy` action appears for this resource; Terraform will error but CI pipelines may not surface this clearly\n- Resources with `ignore_changes` on security-relevant attributes \u2014 warn that drift may exist between actual state and desired state; the ignored attributes could have been manually changed to something dangerous\n- Resources with `create_before_destroy = true` \u2014 during replacement, both old and new resources exist simultaneously; watch for naming conflicts, IP address changes, and brief service duplication\n- Resources with `replace_triggered_by` \u2014 replacement cascades to dependent resources; verify the full chain is understood\n\n### State operations\n- Backend configuration changes (switching from local to S3, changing bucket/key, changing workspace) = CRITICAL \u2014 risk of state loss, state corruption, or resources becoming orphaned (exist in cloud but unknown to Terraform)\n- `terraform state mv` or `terraform state rm` commands in CI/CD = CRITICAL \u2014 manual state manipulation can orphan or duplicate resources\n- State lock contention \u2014 concurrent `terraform apply` from multiple CI pipelines on the same state file causes lock failures and potential corruption\n- Large state files (>10MB) indicate poor module decomposition and increase plan/apply latency\n\n### Provider version changes\n- Provider major version upgrade (e.g., AWS provider 4.x \u2192 5.x) = HIGH \u2014 may change resource behavior, rename attributes, or require state migration\n- Provider minor version upgrade = LOW \u2014 but check changelog for deprecation warnings\n- Terraform core version upgrade = MEDIUM \u2014 new versions may change plan behavior, especially around `moved` blocks and refactoring\n\n## Common failure modes\n\n### Plan/apply divergence\n- Data sources that reference frequently changing values (e.g., `aws_ami` with `most_recent = true`) \u2014 the AMI ID can change between plan and apply, causing unexpected instance recreation\n- Resources depending on external state that changes outside Terraform \u2014 plan shows no-op but apply triggers changes\n- `timestamp()` or `uuid()` functions in resource attributes \u2014 forces replacement on every apply\n\n### Module pitfalls\n- Module version upgrade that changes resource addresses internally \u2014 forces recreation of resources that should be updated in-place (e.g., renaming a resource inside a module from `aws_instance.web` to `aws_instance.app` destroys and recreates the instance)\n- `count` vs `for_each` migration \u2014 switching from `count` to `for_each` on existing resources forces destruction and recreation of ALL instances because the state key format changes (numeric index vs string key)\n- `count` index shift \u2014 removing an item from a list used with `count` causes all subsequent resources to shift indices, triggering cascading destroys and recreates (e.g., removing server[1] causes server[2] to become server[1], which Terraform interprets as a replacement)\n\n### Timing and ordering\n- Resources that require propagation time \u2014 IAM policies in AWS take 5-10 seconds to propagate; applying an EC2 instance immediately after an IAM role creation may fail with access denied\n- RDS modifications with `apply_immediately = true` \u2014 triggers immediate reboot; without it, changes are deferred to the next maintenance window\n- Auto Scaling Group changes \u2014 `min_size`/`max_size`/`desired_capacity` changes take effect gradually; setting desired to 0 kills all instances immediately\n- ECS service deployments \u2014 if `deployment_minimum_healthy_percent` is too low, rolling deployments may cause downtime\n\n## Provider-specific risks\n\n### AWS\n- Eventual consistency on IAM \u2014 role/policy propagation takes 5-10 seconds across all regions; race conditions are common in rapid provisioning\n- RDS Multi-AZ failover during modification \u2014 some changes trigger an automatic failover, causing 60-120 seconds of downtime\n- Lambda function code changes \u2014 zip hash changes trigger a full redeployment; cold start latency spikes during transition\n- ECS task definition \u2014 new revision creates a new resource, old revision is deregistered; in-flight requests may be dropped if drain timeout is too short\n- CloudFront distribution changes take 15-30 minutes to propagate globally\n\n### GCP\n- Project-level IAM bindings are AUTHORITATIVE \u2014 `google_project_iam_binding` removes all other bindings for the specified role; use `google_project_iam_member` for additive bindings\n- GKE cluster upgrades are destructive \u2014 node pool replacement may drain all pods; surge upgrade settings control blast radius\n- Cloud SQL instance name is globally unique and reserved for 7 days after deletion \u2014 you cannot recreate with the same name immediately\n\n### Azure\n- Resource group deletion cascades to ALL contained resources \u2014 deleting a resource group is equivalent to deleting every resource inside it\n- Azure Policy assignments take 5-15 minutes to evaluate \u2014 new resources may temporarily violate policies\n- Key Vault soft-delete is enabled by default \u2014 deleted vaults retain the name for 90 days, blocking recreation with the same name\n- App Service plan tier changes may cause brief downtime during scale operation\n\n## Risk weight reference\n\n| Resource type | Base risk weight | Rationale |\n|---|---|---|\n| Security group / firewall rule | 0.90 | Direct network exposure |\n| IAM policy / role | 0.90 | Access control, blast radius if compromised |\n| RDS / Cloud SQL / database | 0.95 | Data loss, downtime |\n| S3 / GCS / storage bucket | 0.80 | Data exposure, lifecycle |\n| VPC / network | 0.85 | Infrastructure connectivity |\n| EC2 / VM / compute | 0.50 | Replaceable, stateless (usually) |\n| Lambda / Cloud Function | 0.40 | Stateless, fast rollback |\n| Load balancer | 0.70 | Traffic routing, potential downtime |\n| DNS record | 0.75 | Propagation delay makes rollback slow |\n| Tags / labels | 0.05 | Cosmetic, no operational impact |\n| CloudWatch / monitoring | 0.15 | Observability, not runtime |\n| SNS / SQS / messaging | 0.60 | Message loss potential |",
      "skill_id": "terraform",
      "tags": [
        "terraform",
        "iac",
        "infrastructure"
      ],
      "test_suite_path": "tests/skill-tests/terraform",
      "token_budget": 1800,
      "tool": null,
      "tool_label": "terraform",
      "trigger_content_patterns": [],
      "triggers": [
        ".tf",
        ".tfvars",
        ".tfvars.json",
        "terraform-plan.json",
        "tfplan.json"
      ],
      "version": "1.0.0"
    },
    {
      "always_load": false,
      "author": "Infrastructure Guild Community",
      "description": "Community-authored Terragrunt guidance for include hierarchy drift, dependency output coupling, and run-all blast radius review.",
      "featured": true,
      "license": "MIT",
      "maintainer": "Terragrunt Maintainers",
      "name": "terragrunt",
      "path": "skills/terragrunt",
      "scenario_count": 3,
      "skill_body": "## Critical risk patterns\n\n- `run-all apply` or root include changes can fan out across many stacks at once = HIGH\n- Dependency output contract changes can break downstream stacks even when the local diff looks small = HIGH\n- Backend, remote state, or generate block changes can orphan state or rewrite provider configuration = CRITICAL\n- Deep include hierarchy overrides can silently change locals, inputs, or hooks for every child module = HIGH\n\n## Review cues\n\n- Review parent includes, dependency blocks, and generated provider files together before approving Terragrunt changes.\n- Prefer stack-by-stack blast-radius notes over generic Terraform advice when the change alters shared Terragrunt scaffolding.",
      "skill_id": "terragrunt",
      "tags": [
        "terragrunt",
        "terraform",
        "iac",
        "community"
      ],
      "test_suite_path": "tests/skill-tests/terragrunt",
      "token_budget": 1400,
      "tool": "terraform",
      "tool_label": "terraform",
      "trigger_content_patterns": [],
      "triggers": [
        "terragrunt.hcl"
      ],
      "version": "1.0.0"
    }
  ]
}
