How Data Index Works
Data Index is a read-only query service that captures and normalizes workflow execution events from Quarkus Flow applications.
High-Level Architecture
Data Index supports two storage backends, each with different characteristics:
Quarkus Flow App
↓ (structured logging → stdout)
FluentBit DaemonSet
↓ (tail container logs)
Storage Backend
├─ PostgreSQL (triggers) → < 1ms normalization
└─ Elasticsearch (transforms) → ~1s normalization
↓
GraphQL API
Storage Backends
Data Index provides two production-ready storage options:
| Backend | Best For | Architecture | Status |
|---|---|---|---|
PostgreSQL (MODE 1) |
< 50K workflows/day, ACID transactions, simple deployment |
✅ Production Ready |
|
Elasticsearch (MODE 2) |
100K+ workflows/day, full-text search, analytics |
✅ Production Ready |
Decision Matrix
Choose based on your requirements:
| Requirement | PostgreSQL | Elasticsearch |
|---|---|---|
ACID transactions |
✅ Yes |
❌ Eventual consistency |
Real-time normalization |
✅ < 1ms |
⚠️ ~1s |
Full-text search |
⚠️ Limited (LIKE queries) |
✅ Advanced |
Throughput |
< 50K workflows/day |
100K+ workflows/day |
Deployment complexity |
⭐⭐ Medium |
⭐⭐⭐ Higher |
Horizontal scaling |
❌ Vertical only |
✅ Yes |
JSON field queries |
✅ JSONB operators |
✅ Nested field queries |
Analytics |
⚠️ Basic aggregations |
✅ Elasticsearch aggregations |
Operational familiarity |
✅ Common skill set |
⚠️ Requires ES expertise |
Quick recommendations:
-
Start with PostgreSQL if you’re unsure - easier to operate and sufficient for most use cases
-
Choose Elasticsearch if you need full-text search, advanced analytics, or high throughput (>50K workflows/day)
-
Both backends provide the same GraphQL API - you can switch later if needed
Key Components
1. Quarkus Flow Applications
Applications with structured logging enabled write JSON events to stdout:
{"instanceId":"01KQ...", "eventType":"io.serverlessworkflow.workflow.started.v1", "timestamp":1777298089.549604, ...}
Critical configuration:
-
quarkus.flow.structured-logging.enabled=true -
quarkus.flow.structured-logging.timestamp-format=epoch-seconds -
Console handler for JSON output
2. FluentBit DaemonSet
FluentBit runs on each Kubernetes node and:
-
Tails
/var/log/containers/workflows.log(pods inworkflowsnamespace) -
Filters lines to extract only JSON events (not regular app logs)
-
Parses JSON and extracts fields
-
Forwards to storage backend (PostgreSQL or Elasticsearch)
Configuration:
-
fluent-bit.conf- Input, filter, output configuration -
flatten-event.lua- Lua script to flatten nested JSON -
Deployed via ConfigMap + DaemonSet
3. Storage Backend
PostgreSQL Mode
-
Raw tables - Store complete events in JSON format
-
Real-time normalization - Events normalized in less than 1ms
-
Normalized tables - Optimized for querying
See PostgreSQL Mode Architecture for details.
Elasticsearch Mode
-
Raw indices - Store complete events
-
Automated transforms - Extract fields and write to normalized indices (~1s)
-
Normalized indices - Optimized for querying and full-text search
See Elasticsearch Mode Architecture for details.
Event Lifecycle
-
Workflow executes in Quarkus Flow app
-
Structured logging writes JSON event to stdout
-
Kubernetes captures stdout to
/var/log/containers/POD_NAME.log -
FluentBit tails log file, extracts JSON events
-
Storage backend receives and normalizes events:
-
PostgreSQL: Events normalized in real-time (< 1ms)
-
Elasticsearch: Events normalized asynchronously (~1s)
-
-
Data Index queries normalized data via storage adapter
-
GraphQL returns data to user
Event Processing Time
| Metric | PostgreSQL (MODE 1) | Elasticsearch (MODE 2) |
|---|---|---|
Normalization |
< 1ms (database triggers) |
~1s (transforms) |
End-to-end latency |
5-10 seconds |
5-10 seconds |
FluentBit collection interval |
5 seconds |
5 seconds |
Data consistency |
Immediate (ACID) |
Eventual (~1s delay) |
Note: End-to-end latency includes FluentBit collection interval (5s), log parsing, network transit, and storage processing. The primary difference is normalization speed: PostgreSQL triggers are synchronous (< 1ms), while Elasticsearch transforms run asynchronously (~1s).
Key Design Features
Real-Time Processing
Events are normalized immediately as they arrive:
-
No separate event processor service required
-
No polling or batch processing
-
Sub-second latency from event to query
-
Handles duplicates and out-of-order events automatically
What Data Index Does NOT Do
|
Data Index is read-only. It does NOT:
|
Architecture Differences
PostgreSQL (MODE 1) - Trigger-Based Normalization
Data Flow:
FluentBit → PostgreSQL raw tables (workflow_events_raw, task_events_raw)
↓ (BEFORE INSERT triggers, < 1ms)
Normalized tables (workflow_instances, task_instances)
↓ (JPA/Hibernate)
GraphQL API
Key characteristics:
-
Synchronous normalization - Triggers execute during INSERT (< 1ms)
-
ACID guarantees - Transactions ensure consistency
-
Upsert logic - COALESCE handles out-of-order events
-
Single writer - Vertical scaling only
Elasticsearch (MODE 2) - Transform-Based Normalization
Data Flow:
FluentBit → Elasticsearch raw indices (workflow-instance-events-raw-*, task-execution-events-raw-*)
↓ (Continuous transforms, ~1s)
Normalized indices (workflow-instances, task-executions)
↓ (Elasticsearch Java Client)
GraphQL API
Key characteristics:
-
Asynchronous normalization - Transforms run independently (~1s delay)
-
Eventual consistency - Raw events appear before normalized documents
-
Aggregation logic - scripted_metric handles out-of-order events
-
Horizontal scaling - Distributed across Elasticsearch cluster
-
ILM retention - Automatic deletion of raw events after 7 days
Next Steps
-
PostgreSQL Mode Details - Real-time normalization
-
Elasticsearch Mode Details - High-throughput processing
-
Deployment Guide - Choose and deploy