Skip to content

Build the CI/CD workflow

The CI/CD workflow runs the instrumented pipeline and, on failure, queries Okahu for traces and calls the Kahu SRE Agent for root cause analysis.

Workflow overview

name: CI/CD Deploy Issue Summary
on:
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

Step 1: Install and configure

      - name: Install dependencies
        run: pip install monocle_apptrace pyyaml

      - name: Load .env and set exporters
        run: |
          EXPORTERS="file"
          if [ -n "$OKAHU_API_KEY" ]; then
            EXPORTERS="file,okahu"
          fi
          echo "MONOCLE_EXPORTER=$EXPORTERS" >> $GITHUB_ENV

Step 2: Run the pipeline with Monocle

      - name: Run deployment pipeline
        id: deploy
        continue-on-error: true
        run: |
          python -m monocle_apptrace --config okahu.yaml deploy_app.py \
            > output.txt 2>&1; EXIT_CODE=$?
          cat output.txt
          echo "deploy_exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
          exit $EXIT_CODE

The continue-on-error: true pattern lets the workflow proceed to trace collection and SRE agent analysis after a failure.

Step 3: Query Okahu for traces

      - name: Resolve traces for this run
        if: steps.deploy.outcome == 'failure'
        run: |
          # Wait for trace ingestion
          sleep 10

          # Query by GitHub run ID
          TRACES_RESPONSE=$(curl -s \
            "${API_BASE}/api/v1/apps/${APP}/traces\
?duration_fact=test_runs&fact_ids=github_${GITHUB_RUN_ID}" \
            -H "x-api-key: ${OKAHU_API_KEY}")

          TRACE_IDS=$(echo "$TRACES_RESPONSE" \
            | jq -r '.traces[]?.trace_id // empty')

Monocle automatically tags traces with the GitHub run ID, making them queryable via the Okahu API.

Step 4: Call the Kahu SRE Agent

      - name: Call Kahu SRE Agent
        if: steps.deploy.outcome == 'failure'
        run: |
          QUERY="Investigate deploy failure for this CI/CD run. \
For each trace, look at the spans in detail — especially \
the data.output events which contain the full console output \
and error messages from each deployment step."

          RESPONSE=$(curl -s -X POST "${SRE_URL}" \
            -H "Content-Type: application/json" \
            -H "x-api-key: ${OKAHU_API_KEY}" \
            -d "$(jq -n --arg q "$QUERY" '{query: $q}')")

Step 5: Create GitHub issue with analysis

      - name: Create GitHub Issue with Kahu response
        if: steps.deploy.outcome == 'failure'
        run: |
          gh issue create \
            --title "Deploy Failure: deploy_app.py - $(date -u)" \
            --body-file issue_body.md

The issue contains the Kahu SRE Agent's analysis — a structured summary with root cause, affected steps, and recommended fixes.

No manual investigation needed

The full loop runs automatically: deploy → fail → collect traces → AI analysis → GitHub issue with root cause. The on-call engineer gets an issue with actionable analysis, not just a stack trace.