How Data Index Works

Data Index is a read-only query service that captures and normalizes workflow execution events from Quarkus Flow applications.

High-Level Architecture

Data Index supports two storage backends, each with different characteristics:

Quarkus Flow App
    ↓ (structured logging → stdout)
FluentBit DaemonSet
    ↓ (tail container logs)
Storage Backend
    ├─ PostgreSQL (triggers) → < 1ms normalization
    └─ Elasticsearch (transforms) → ~1s normalization
    ↓
GraphQL API

Storage Backends

Data Index provides two production-ready storage options:

Backend	Best For	Architecture	Status
PostgreSQL (MODE 1)	< 50K workflows/day, ACID transactions, simple deployment	PostgreSQL Mode Details	✅ Production Ready
Elasticsearch (MODE 2)	100K+ workflows/day, full-text search, analytics	Elasticsearch Mode Details	✅ Production Ready

Backend

Best For

Architecture

Status

PostgreSQL (MODE 1)

< 50K workflows/day, ACID transactions, simple deployment

PostgreSQL Mode Details

✅ Production Ready

Elasticsearch (MODE 2)

100K+ workflows/day, full-text search, analytics

Elasticsearch Mode Details

✅ Production Ready

Decision Matrix

Choose based on your requirements:

Requirement	PostgreSQL	Elasticsearch
ACID transactions	✅ Yes	❌ Eventual consistency
Real-time normalization	✅ < 1ms	⚠️ ~1s
Full-text search	⚠️ Limited (LIKE queries)	✅ Advanced
Throughput	< 50K workflows/day	100K+ workflows/day
Deployment complexity	⭐⭐ Medium	⭐⭐⭐ Higher
Horizontal scaling	❌ Vertical only	✅ Yes
JSON field queries	✅ JSONB operators	✅ Nested field queries
Analytics	⚠️ Basic aggregations	✅ Elasticsearch aggregations
Operational familiarity	✅ Common skill set	⚠️ Requires ES expertise

Requirement

PostgreSQL

Elasticsearch

ACID transactions

✅ Yes

❌ Eventual consistency

Real-time normalization

✅ < 1ms

⚠️ ~1s

Full-text search

⚠️ Limited (LIKE queries)

✅ Advanced

Throughput

< 50K workflows/day

100K+ workflows/day

Deployment complexity

⭐⭐ Medium

⭐⭐⭐ Higher

Horizontal scaling

❌ Vertical only

✅ Yes

JSON field queries

✅ JSONB operators

✅ Nested field queries

Analytics

⚠️ Basic aggregations

✅ Elasticsearch aggregations

Operational familiarity

✅ Common skill set

⚠️ Requires ES expertise

Quick recommendations:

Start with PostgreSQL if you’re unsure - easier to operate and sufficient for most use cases
Choose Elasticsearch if you need full-text search, advanced analytics, or high throughput (>50K workflows/day)
Both backends provide the same GraphQL API - you can switch later if needed

Key Components

1. Quarkus Flow Applications

Applications with structured logging enabled write JSON events to stdout:

{"instanceId":"01KQ...", "eventType":"io.serverlessworkflow.workflow.started.v1", "timestamp":1777298089.549604, ...}

Critical configuration:

quarkus.flow.structured-logging.enabled=true
quarkus.flow.structured-logging.timestamp-format=epoch-seconds
Console handler for JSON output

2. FluentBit DaemonSet

FluentBit runs on each Kubernetes node and:

Tails /var/log/containers/workflows.log (pods in workflows namespace)
Filters lines to extract only JSON events (not regular app logs)
Parses JSON and extracts fields
Forwards to storage backend (PostgreSQL or Elasticsearch)

Configuration:

fluent-bit.conf - Input, filter, output configuration
flatten-event.lua - Lua script to flatten nested JSON
Deployed via ConfigMap + DaemonSet

3. Storage Backend

PostgreSQL Mode

Raw tables - Store complete events in JSON format
Real-time normalization - Events normalized in less than 1ms
Normalized tables - Optimized for querying

See PostgreSQL Mode Architecture for details.

Elasticsearch Mode

Raw indices - Store complete events
Automated transforms - Extract fields and write to normalized indices (~1s)
Normalized indices - Optimized for querying and full-text search

See Elasticsearch Mode Architecture for details.

4. Data Index Service

Quarkus application providing:

GraphQL API - Query workflow instances and task executions
Storage adapter - JPA for PostgreSQL, ES Client for Elasticsearch
SmallRye GraphQL - GraphQL schema and resolvers
Health checks - Liveness and readiness probes

Event Lifecycle

Workflow executes in Quarkus Flow app
Structured logging writes JSON event to stdout
Kubernetes captures stdout to /var/log/containers/POD_NAME.log
FluentBit tails log file, extracts JSON events
Storage backend receives and normalizes events:
- PostgreSQL: Events normalized in real-time (< 1ms)
- Elasticsearch: Events normalized asynchronously (~1s)
Data Index queries normalized data via storage adapter
GraphQL returns data to user

Event Processing Time

Metric	PostgreSQL (MODE 1)	Elasticsearch (MODE 2)
Normalization	< 1ms (database triggers)	~1s (transforms)
End-to-end latency	5-10 seconds	5-10 seconds
FluentBit collection interval	5 seconds	5 seconds
Data consistency	Immediate (ACID)	Eventual (~1s delay)

Metric

PostgreSQL (MODE 1)

Elasticsearch (MODE 2)

Normalization

< 1ms (database triggers)

~1s (transforms)

End-to-end latency

5-10 seconds

FluentBit collection interval

5 seconds

Data consistency

Immediate (ACID)

Eventual (~1s delay)

Note: End-to-end latency includes FluentBit collection interval (5s), log parsing, network transit, and storage processing. The primary difference is normalization speed: PostgreSQL triggers are synchronous (< 1ms), while Elasticsearch transforms run asynchronously (~1s).

Key Design Features

Real-Time Processing

Events are normalized immediately as they arrive:

No separate event processor service required
No polling or batch processing
Sub-second latency from event to query
Handles duplicates and out-of-order events automatically

Flexible Data Storage

Workflow input and output data stored as JSON:

Flexible - Any workflow schema supported
Queryable - Can filter by JSON fields when needed
GraphQL - Exposed as JSON strings for client parsing

Two-Stage Storage

Events stored in both raw and normalized formats:

Raw storage - Preserve original events for debugging and audit
Normalized storage - Optimized for fast querying
Replay capability - Can reprocess raw events if needed

What Data Index Does NOT Do

Data Index is read-only. It does NOT:

❌ Execute workflows
❌ Modify workflow state
❌ Provide workflow management (start/stop/retry)
❌ Store workflow definitions
❌ Require a separate event processor service

Architecture Differences

PostgreSQL (MODE 1) - Trigger-Based Normalization

Data Flow:

FluentBit → PostgreSQL raw tables (workflow_events_raw, task_events_raw)
                ↓ (BEFORE INSERT triggers, < 1ms)
            Normalized tables (workflow_instances, task_instances)
                ↓ (JPA/Hibernate)
            GraphQL API

Key characteristics:

Synchronous normalization - Triggers execute during INSERT (< 1ms)
ACID guarantees - Transactions ensure consistency
Upsert logic - COALESCE handles out-of-order events
Single writer - Vertical scaling only

Elasticsearch (MODE 2) - Transform-Based Normalization

Data Flow:

FluentBit → Elasticsearch raw indices (workflow-instance-events-raw-*, task-execution-events-raw-*)
                ↓ (Continuous transforms, ~1s)
            Normalized indices (workflow-instances, task-executions)
                ↓ (Elasticsearch Java Client)
            GraphQL API

Key characteristics:

Asynchronous normalization - Transforms run independently (~1s delay)
Eventual consistency - Raw events appear before normalized documents
Aggregation logic - scripted_metric handles out-of-order events
Horizontal scaling - Distributed across Elasticsearch cluster
ILM retention - Automatic deletion of raw events after 7 days

Next Steps

PostgreSQL Mode Details - Real-time normalization
Elasticsearch Mode Details - High-throughput processing
Deployment Guide - Choose and deploy