Troubleshooting Guide - Deployment & Operations

Common deployment and runtime issues when running Quarkus Flow workflow applications with Data Index.

This guide is for operators and administrators deploying and running Data Index with Quarkus Flow applications. For build and development issues, see Developer Troubleshooting.

Deployment Issues

Pods stuck in ImagePullBackOff

Symptom:

$ kubectl get pods -n workflows
NAME                            READY   STATUS             RESTARTS   AGE
my-workflow-app-xxx             0/1     ImagePullBackOff   0          2m

Cause: Container image not available in the cluster.

Solution:

For KIND clusters, load the image manually:

# Check image name
docker images | grep my-workflow-app

# Load to KIND
kind load docker-image local/my-workflow-app:1.0.0 --name data-index-test

For production clusters, verify: - Image registry is accessible - Image pull secrets are configured - Image name and tag are correct

The quarkus-kind extension does NOT automatically load images for local KIND development. Manual loading is required.

Pods crash with Exit Code 137

Symptom:

$ kubectl describe pod my-workflow-app-xxx
...
Exit Code:    137
Reason:       Error

Cause: Out of memory (OOM) - container killed by Kubernetes.

Solution:

Increase memory limits in the deployment manifest:

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

Quarkus Flow apps typically need at least 512Mi to start reliably.

If using Quarkus Kubernetes extension, configure in application.properties:

quarkus.kubernetes.resources.requests.memory=512Mi
quarkus.kubernetes.resources.limits.memory=1Gi

Data Index service not starting

Symptom:

Data Index pod in CrashLoopBackOff or failing readiness probes.

Solution:

Check logs for errors:

kubectl logs -n data-index -l app=data-index-service

Verify database connection:

# Check if database pod is running
kubectl get pods -n postgresql

# Test connection from Data Index pod
kubectl exec -n data-index deployment/data-index-service -- \
  curl -v telnet://postgresql.postgresql.svc.cluster.local:5432

Verify environment variables are set:

kubectl get deployment -n data-index data-index-service -o yaml | grep -A 10 env:

Event Collection Issues

Events in logs but not in Data Index

Symptom:

JSON events appear in workflow pod logs, but GraphQL queries return empty.

Debug steps:

Check FluentBit is running:

kubectl get pods -n logging
kubectl logs -n logging -l app=fluent-bit --tail=50

Check FluentBit is collecting events:
```
kubectl logs -n logging -l app=fluent-bit | grep "your-workflow-name"
```
If empty, FluentBit is not capturing events. Check:
- Workflow pod is in workflows namespace (FluentBit filters by namespace)
- FluentBit has correct log path: /var/log/containers/workflows.log
- FluentBit DaemonSet is running on the same node as workflow pods
Check raw events in PostgreSQL:
```
kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "SELECT COUNT(*) FROM workflow_events_raw;"
```
If zero, FluentBit is not inserting to database. Check:
- PostgreSQL connection from FluentBit pod
- FluentBit output configuration
- PostgreSQL credentials in FluentBit ConfigMap
Check normalized tables:
```
kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "SELECT id, name, status FROM workflow_instances;"
```
If raw tables have data but normalized tables are empty, triggers may have failed. Check:
- PostgreSQL logs for trigger errors: kubectl logs -n postgresql postgresql-0
- Trigger functions exist: \df normalize_workflow_event
- Triggers are attached: \d workflow_events_raw

Wrong namespace - FluentBit not collecting

Symptom:

Events in pod logs, but FluentBit not collecting them.

Cause: Workflow pods not in workflows namespace, but FluentBit only watches workflows.

Solution:

Option A: Deploy workflow app to workflows namespace:

kubectl create namespace workflows
kubectl apply -f workflow-app.yaml -n workflows

Option B: Update FluentBit to watch your namespace:

# Edit FluentBit ConfigMap
kubectl edit configmap -n logging fluent-bit-config

# Change log path filter:
# From: Path /var/log/containers/*_workflows_*.log
# To:   Path /var/log/containers/*_YOUR_NAMESPACE_*.log

# Restart FluentBit pods
kubectl rollout restart daemonset/fluent-bit -n logging

Events delayed or not appearing

Symptom:

Workflow executes, but Data Index doesn’t show it for 10+ seconds.

Cause: Normal propagation delay.

Expected latency:

FluentBit flush interval: 1 second
PostgreSQL insert + trigger: < 1ms
Total end-to-end: 5-10 seconds

If latency is higher:

Check FluentBit flush interval:

kubectl get configmap -n logging fluent-bit-config -o yaml | grep Flush

Check PostgreSQL performance:

kubectl top pod -n postgresql
kubectl logs -n postgresql postgresql-0 | grep -i slow

Check network latency:

# From FluentBit pod to PostgreSQL
kubectl exec -n logging <fluentbit-pod> -- \
  time nc -zv postgresql.postgresql.svc.cluster.local 5432

Data Query Issues

GraphQL schema empty or returns null

Symptom:

GraphQL UI loads but schema is empty, or queries return {"data": null}.

Cause: Storage adapter not initialized or database connection failed.

Solution:

Check Data Index logs:

kubectl logs -n data-index -l app=data-index-service | grep -i "error\|exception"

Verify database schema is initialized:

kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "\dt"

# Should show:
# - workflow_events_raw
# - task_events_raw
# - workflow_instances
# - task_instances

If tables are missing, run manual schema initialization:

# See deployment guide for schema initialization
kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -f /path/to/V1__initial_schema.sql

Verify database connection:

kubectl exec -n data-index deployment/data-index-service -- \
  env | grep QUARKUS_DATASOURCE

Duplicate task executions in GraphQL

Symptom:

GraphQL returns multiple TaskExecution objects for the same task position.

Cause: Trigger normalization logic may not be handling task events correctly.

Workaround:

Filter by latest endDate or status=COMPLETED on client side:

{
  getWorkflowInstance(id: "xxx") {
    taskExecutions(filter: {status: COMPLETED}) {
      taskPosition
      status
    }
  }
}

Long-term fix:

Verify database trigger uses UPDATE-first pattern with "end" IS NULL condition. See database schema documentation.

Missing task executions

Symptom:

GraphQL returns workflow but taskExecutions is empty.

Cause: Foreign key relationship or data issue.

Debug:

Check raw data exists:

kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "SELECT task_position, status FROM task_instances WHERE instance_id = 'YOUR_INSTANCE_ID';"

Check foreign key relationship:

kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "\d task_instances"

Verify instance_id has foreign key to workflow_instances(id).

Check for orphaned tasks:

kubectl exec -n postgresql postgresql-0 -- \
  env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \
  -c "SELECT COUNT(*) FROM task_instances t WHERE NOT EXISTS (SELECT 1 FROM workflow_instances w WHERE w.id = t.instance_id);"

Component Health Checks

Check all components

# Data Index service
kubectl get pods -n data-index
kubectl logs -n data-index -l app=data-index-service --tail=50

# FluentBit
kubectl get pods -n logging
kubectl logs -n logging -l app=fluent-bit --tail=50

# PostgreSQL
kubectl get pods -n postgresql
kubectl logs -n postgresql postgresql-0 --tail=50

# Workflow applications
kubectl get pods -n workflows
kubectl logs -n workflows -l app.kubernetes.io/name=<your-app> --tail=50

Check service connectivity

# GraphQL API
kubectl port-forward -n data-index svc/data-index-service 8080:8080
curl http://localhost:8080/q/health

# PostgreSQL from Data Index
kubectl exec -n data-index deployment/data-index-service -- \
  nc -zv postgresql.postgresql.svc.cluster.local 5432

# PostgreSQL from FluentBit
kubectl exec -n logging <fluentbit-pod> -- \
  nc -zv postgresql.postgresql.svc.cluster.local 5432

Elasticsearch-Specific Issues

Transform in failed state

Symptom:

$ kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/_transform/workflow-instances-transform/_stats' | jq '.transforms[0].state'
"failed"

Cause: Transform script error, usually due to null field values or incorrect field references.

Solution:

Check transform error details:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/_transform/workflow-instances-transform' | \
  jq '.transforms[0].reason'

Check Elasticsearch logs:

kubectl logs -n elasticsearch elasticsearch-0 | grep -i "transform.*failed"

Stop and delete failed transform:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -X POST 'http://localhost:9200/_transform/workflow-instances-transform/_stop?force=true'

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -X DELETE 'http://localhost:9200/_transform/workflow-instances-transform'

Restart Data Index service to recreate transform:

kubectl rollout restart deployment/data-index-service -n data-index

Events in Elasticsearch but normalized indices empty

Symptom:

Raw indices (workflow-events-, task-events-) have events, but normalized indices (workflow-instances, task-executions) are empty.

Debug steps:

Check raw events exist:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/workflow-events-*/_count'

Check transform status:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/_transform/workflow-instances-transform/_stats' | \
  jq '{state: .transforms[0].state, docs_processed: .transforms[0].stats.documents_processed}'

If transform stopped, start it:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -X POST 'http://localhost:9200/_transform/workflow-instances-transform/_start'

If transform failed, see "Transform in failed state" above.

FluentBit cannot connect to Elasticsearch

Symptom:

FluentBit logs show connection errors:

[warn] [net] getaddrinfo(host='...', err=4): Domain name not found
[error] [output:es:es.0] HTTP status=0 - No response from server

Cause: Incorrect Elasticsearch hostname or port configuration.

Solution:

Verify Elasticsearch service:
```
kubectl get svc -n elasticsearch
```

Update FluentBit DaemonSet environment variables:

# For ECK (Elastic Cloud on Kubernetes)
kubectl set env daemonset/fluent-bit -n logging \
  ELASTICSEARCH_HOST=<elasticsearch-service-name>.<namespace>.svc.cluster.local

# For plain StatefulSet
kubectl set env daemonset/fluent-bit -n logging \
  ELASTICSEARCH_HOST=elasticsearch-0.elasticsearch-es-http.elasticsearch.svc.cluster.local

Restart FluentBit:

kubectl rollout restart daemonset/fluent-bit -n logging

Workflow status showing null

Symptom:

GraphQL returns status: null for workflows or tasks that should have status.

Cause: Transform using wrong field name for status aggregation.

Expected behavior: Status should show COMPLETED, RUNNING, FAULTED, FAILED, etc.

Debug:

Check raw event status field:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/workflow-events-*/_search?size=1' | \
  jq '.hits.hits[0]._source.status'

Check normalized instance status:

kubectl exec -n elasticsearch elasticsearch-0 -- \
  curl -s 'http://localhost:9200/workflow-instances/_search?size=1' | \
  jq '.hits.hits[0]._source.status'

If normalized status is an object instead of string, the transform needs updating. Redeploy Data Index service with latest schema.

Getting Help

For build and development issues: - See: Developer Troubleshooting

For configuration: - See: Configuration Reference - See: PostgreSQL Deployment - See: Elasticsearch Deployment - See: Local Development

For architecture: - See: PostgreSQL Mode Architecture - See: Elasticsearch Mode Architecture