Troubleshooting Guide - Deployment & Operations
Common deployment and runtime issues when running Quarkus Flow workflow applications with Data Index.
|
This guide is for operators and administrators deploying and running Data Index with Quarkus Flow applications. For build and development issues, see Developer Troubleshooting. |
Deployment Issues
Pods stuck in ImagePullBackOff
Symptom:
$ kubectl get pods -n workflows
NAME READY STATUS RESTARTS AGE
my-workflow-app-xxx 0/1 ImagePullBackOff 0 2m
Cause: Container image not available in the cluster.
Solution:
For KIND clusters, load the image manually:
# Check image name
docker images | grep my-workflow-app
# Load to KIND
kind load docker-image local/my-workflow-app:1.0.0 --name data-index-test
For production clusters, verify: - Image registry is accessible - Image pull secrets are configured - Image name and tag are correct
|
The |
Pods crash with Exit Code 137
Symptom:
$ kubectl describe pod my-workflow-app-xxx
...
Exit Code: 137
Reason: Error
Cause: Out of memory (OOM) - container killed by Kubernetes.
Solution:
Increase memory limits in the deployment manifest:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
Quarkus Flow apps typically need at least 512Mi to start reliably.
If using Quarkus Kubernetes extension, configure in application.properties:
quarkus.kubernetes.resources.requests.memory=512Mi
quarkus.kubernetes.resources.limits.memory=1Gi
Data Index service not starting
Symptom:
Data Index pod in CrashLoopBackOff or failing readiness probes.
Solution:
-
Check logs for errors:
kubectl logs -n data-index -l app=data-index-service -
Verify database connection:
# Check if database pod is running kubectl get pods -n postgresql # Test connection from Data Index pod kubectl exec -n data-index deployment/data-index-service -- \ curl -v telnet://postgresql.postgresql.svc.cluster.local:5432 -
Verify environment variables are set:
kubectl get deployment -n data-index data-index-service -o yaml | grep -A 10 env:
Event Collection Issues
Events in logs but not in Data Index
Symptom:
JSON events appear in workflow pod logs, but GraphQL queries return empty.
Debug steps:
-
Check FluentBit is running:
kubectl get pods -n logging kubectl logs -n logging -l app=fluent-bit --tail=50 -
Check FluentBit is collecting events:
kubectl logs -n logging -l app=fluent-bit | grep "your-workflow-name"If empty, FluentBit is not capturing events. Check:
-
Workflow pod is in
workflowsnamespace (FluentBit filters by namespace) -
FluentBit has correct log path:
/var/log/containers/workflows.log -
FluentBit DaemonSet is running on the same node as workflow pods
-
-
Check raw events in PostgreSQL:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "SELECT COUNT(*) FROM workflow_events_raw;"If zero, FluentBit is not inserting to database. Check:
-
PostgreSQL connection from FluentBit pod
-
FluentBit output configuration
-
PostgreSQL credentials in FluentBit ConfigMap
-
-
Check normalized tables:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "SELECT id, name, status FROM workflow_instances;"If raw tables have data but normalized tables are empty, triggers may have failed. Check:
-
PostgreSQL logs for trigger errors:
kubectl logs -n postgresql postgresql-0 -
Trigger functions exist:
\df normalize_workflow_event -
Triggers are attached:
\d workflow_events_raw
-
Wrong namespace - FluentBit not collecting
Symptom:
Events in pod logs, but FluentBit not collecting them.
Cause: Workflow pods not in workflows namespace, but FluentBit only watches workflows.
Solution:
Option A: Deploy workflow app to workflows namespace:
kubectl create namespace workflows
kubectl apply -f workflow-app.yaml -n workflows
Option B: Update FluentBit to watch your namespace:
# Edit FluentBit ConfigMap
kubectl edit configmap -n logging fluent-bit-config
# Change log path filter:
# From: Path /var/log/containers/*_workflows_*.log
# To: Path /var/log/containers/*_YOUR_NAMESPACE_*.log
# Restart FluentBit pods
kubectl rollout restart daemonset/fluent-bit -n logging
Events delayed or not appearing
Symptom:
Workflow executes, but Data Index doesn’t show it for 10+ seconds.
Cause: Normal propagation delay.
Expected latency:
-
FluentBit flush interval: 1 second
-
PostgreSQL insert + trigger: < 1ms
-
Total end-to-end: 5-10 seconds
If latency is higher:
-
Check FluentBit flush interval:
kubectl get configmap -n logging fluent-bit-config -o yaml | grep Flush -
Check PostgreSQL performance:
kubectl top pod -n postgresql kubectl logs -n postgresql postgresql-0 | grep -i slow -
Check network latency:
# From FluentBit pod to PostgreSQL kubectl exec -n logging <fluentbit-pod> -- \ time nc -zv postgresql.postgresql.svc.cluster.local 5432
Data Query Issues
GraphQL schema empty or returns null
Symptom:
GraphQL UI loads but schema is empty, or queries return {"data": null}.
Cause: Storage adapter not initialized or database connection failed.
Solution:
-
Check Data Index logs:
kubectl logs -n data-index -l app=data-index-service | grep -i "error\|exception" -
Verify database schema is initialized:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "\dt" # Should show: # - workflow_events_raw # - task_events_raw # - workflow_instances # - task_instancesIf tables are missing, run manual schema initialization:
# See deployment guide for schema initialization kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -f /path/to/V1__initial_schema.sql -
Verify database connection:
kubectl exec -n data-index deployment/data-index-service -- \ env | grep QUARKUS_DATASOURCE
Duplicate task executions in GraphQL
Symptom:
GraphQL returns multiple TaskExecution objects for the same task position.
Cause: Trigger normalization logic may not be handling task events correctly.
Workaround:
Filter by latest endDate or status=COMPLETED on client side:
{
getWorkflowInstance(id: "xxx") {
taskExecutions(filter: {status: COMPLETED}) {
taskPosition
status
}
}
}
Long-term fix:
Verify database trigger uses UPDATE-first pattern with "end" IS NULL condition. See database schema documentation.
Missing task executions
Symptom:
GraphQL returns workflow but taskExecutions is empty.
Cause: Foreign key relationship or data issue.
Debug:
-
Check raw data exists:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "SELECT task_position, status FROM task_instances WHERE instance_id = 'YOUR_INSTANCE_ID';" -
Check foreign key relationship:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "\d task_instances"Verify
instance_idhas foreign key toworkflow_instances(id). -
Check for orphaned tasks:
kubectl exec -n postgresql postgresql-0 -- \ env PGPASSWORD=dataindex123 psql -U dataindex -d dataindex \ -c "SELECT COUNT(*) FROM task_instances t WHERE NOT EXISTS (SELECT 1 FROM workflow_instances w WHERE w.id = t.instance_id);"
Component Health Checks
Check all components
# Data Index service
kubectl get pods -n data-index
kubectl logs -n data-index -l app=data-index-service --tail=50
# FluentBit
kubectl get pods -n logging
kubectl logs -n logging -l app=fluent-bit --tail=50
# PostgreSQL
kubectl get pods -n postgresql
kubectl logs -n postgresql postgresql-0 --tail=50
# Workflow applications
kubectl get pods -n workflows
kubectl logs -n workflows -l app.kubernetes.io/name=<your-app> --tail=50
Check service connectivity
# GraphQL API
kubectl port-forward -n data-index svc/data-index-service 8080:8080
curl http://localhost:8080/q/health
# PostgreSQL from Data Index
kubectl exec -n data-index deployment/data-index-service -- \
nc -zv postgresql.postgresql.svc.cluster.local 5432
# PostgreSQL from FluentBit
kubectl exec -n logging <fluentbit-pod> -- \
nc -zv postgresql.postgresql.svc.cluster.local 5432
Elasticsearch-Specific Issues
Transform in failed state
Symptom:
$ kubectl exec -n elasticsearch elasticsearch-0 -- \
curl -s 'http://localhost:9200/_transform/workflow-instances-transform/_stats' | jq '.transforms[0].state'
"failed"
Cause: Transform script error, usually due to null field values or incorrect field references.
Solution:
-
Check transform error details:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -s 'http://localhost:9200/_transform/workflow-instances-transform' | \ jq '.transforms[0].reason' -
Check Elasticsearch logs:
kubectl logs -n elasticsearch elasticsearch-0 | grep -i "transform.*failed" -
Stop and delete failed transform:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -X POST 'http://localhost:9200/_transform/workflow-instances-transform/_stop?force=true' kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -X DELETE 'http://localhost:9200/_transform/workflow-instances-transform' -
Restart Data Index service to recreate transform:
kubectl rollout restart deployment/data-index-service -n data-index
Events in Elasticsearch but normalized indices empty
Symptom:
Raw indices (workflow-events-, task-events-) have events, but normalized indices (workflow-instances, task-executions) are empty.
Debug steps:
-
Check raw events exist:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -s 'http://localhost:9200/workflow-events-*/_count' -
Check transform status:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -s 'http://localhost:9200/_transform/workflow-instances-transform/_stats' | \ jq '{state: .transforms[0].state, docs_processed: .transforms[0].stats.documents_processed}' -
If transform stopped, start it:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -X POST 'http://localhost:9200/_transform/workflow-instances-transform/_start' -
If transform failed, see "Transform in failed state" above.
FluentBit cannot connect to Elasticsearch
Symptom:
FluentBit logs show connection errors:
[warn] [net] getaddrinfo(host='...', err=4): Domain name not found
[error] [output:es:es.0] HTTP status=0 - No response from server
Cause: Incorrect Elasticsearch hostname or port configuration.
Solution:
-
Verify Elasticsearch service:
kubectl get svc -n elasticsearch -
Update FluentBit DaemonSet environment variables:
# For ECK (Elastic Cloud on Kubernetes) kubectl set env daemonset/fluent-bit -n logging \ ELASTICSEARCH_HOST=<elasticsearch-service-name>.<namespace>.svc.cluster.local # For plain StatefulSet kubectl set env daemonset/fluent-bit -n logging \ ELASTICSEARCH_HOST=elasticsearch-0.elasticsearch-es-http.elasticsearch.svc.cluster.local -
Restart FluentBit:
kubectl rollout restart daemonset/fluent-bit -n logging
Workflow status showing null
Symptom:
GraphQL returns status: null for workflows or tasks that should have status.
Cause: Transform using wrong field name for status aggregation.
Expected behavior: Status should show COMPLETED, RUNNING, FAULTED, FAILED, etc.
Debug:
-
Check raw event status field:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -s 'http://localhost:9200/workflow-events-*/_search?size=1' | \ jq '.hits.hits[0]._source.status' -
Check normalized instance status:
kubectl exec -n elasticsearch elasticsearch-0 -- \ curl -s 'http://localhost:9200/workflow-instances/_search?size=1' | \ jq '.hits.hits[0]._source.status' -
If normalized status is an object instead of string, the transform needs updating. Redeploy Data Index service with latest schema.
Getting Help
For build and development issues: - See: Developer Troubleshooting
For configuration: - See: Configuration Reference - See: PostgreSQL Deployment - See: Elasticsearch Deployment - See: Local Development
For architecture: - See: PostgreSQL Mode Architecture - See: Elasticsearch Mode Architecture