How Data Index Works

Data Index is a read-only query service that captures and normalizes workflow execution events from Quarkus Flow applications.

High-Level Architecture

Data Index supports two storage backends, each with different characteristics:

Quarkus Flow App
    ↓ (structured logging → stdout)
FluentBit DaemonSet
    ↓ (tail container logs)
Storage Backend
    ├─ PostgreSQL (triggers) → < 1ms normalization
    └─ Elasticsearch (transforms) → ~1s normalization
    ↓
GraphQL API

Storage Backends

Data Index provides two production-ready storage options:

Backend Best For Architecture Status

PostgreSQL (MODE 1)

< 50K workflows/day, ACID transactions, simple deployment

PostgreSQL Mode Details

✅ Production Ready

Elasticsearch (MODE 2)

100K+ workflows/day, full-text search, analytics

Elasticsearch Mode Details

✅ Production Ready

Decision Matrix

Choose based on your requirements:

Requirement PostgreSQL Elasticsearch

ACID transactions

✅ Yes

❌ Eventual consistency

Real-time normalization

✅ < 1ms

⚠️ ~1s

Full-text search

⚠️ Limited (LIKE queries)

✅ Advanced

Throughput

< 50K workflows/day

100K+ workflows/day

Deployment complexity

⭐⭐ Medium

⭐⭐⭐ Higher

Horizontal scaling

❌ Vertical only

✅ Yes

JSON field queries

✅ JSONB operators

✅ Nested field queries

Analytics

⚠️ Basic aggregations

✅ Elasticsearch aggregations

Operational familiarity

✅ Common skill set

⚠️ Requires ES expertise

Quick recommendations:

  • Start with PostgreSQL if you’re unsure - easier to operate and sufficient for most use cases

  • Choose Elasticsearch if you need full-text search, advanced analytics, or high throughput (>50K workflows/day)

  • Both backends provide the same GraphQL API - you can switch later if needed

Key Components

1. Quarkus Flow Applications

Applications with structured logging enabled write JSON events to stdout:

{"instanceId":"01KQ...", "eventType":"io.serverlessworkflow.workflow.started.v1", "timestamp":1777298089.549604, ...}

Critical configuration:

  • quarkus.flow.structured-logging.enabled=true

  • quarkus.flow.structured-logging.timestamp-format=epoch-seconds

  • Console handler for JSON output

2. FluentBit DaemonSet

FluentBit runs on each Kubernetes node and:

  • Tails /var/log/containers/workflows.log (pods in workflows namespace)

  • Filters lines to extract only JSON events (not regular app logs)

  • Parses JSON and extracts fields

  • Forwards to storage backend (PostgreSQL or Elasticsearch)

Configuration:

  • fluent-bit.conf - Input, filter, output configuration

  • flatten-event.lua - Lua script to flatten nested JSON

  • Deployed via ConfigMap + DaemonSet

3. Storage Backend

PostgreSQL Mode

  • Raw tables - Store complete events in JSON format

  • Real-time normalization - Events normalized in less than 1ms

  • Normalized tables - Optimized for querying

See PostgreSQL Mode Architecture for details.

Elasticsearch Mode

  • Raw indices - Store complete events

  • Automated transforms - Extract fields and write to normalized indices (~1s)

  • Normalized indices - Optimized for querying and full-text search

4. Data Index Service

Quarkus application providing:

  • GraphQL API - Query workflow instances and task executions

  • Storage adapter - JPA for PostgreSQL, ES Client for Elasticsearch

  • SmallRye GraphQL - GraphQL schema and resolvers

  • Health checks - Liveness and readiness probes

Event Lifecycle

  1. Workflow executes in Quarkus Flow app

  2. Structured logging writes JSON event to stdout

  3. Kubernetes captures stdout to /var/log/containers/POD_NAME.log

  4. FluentBit tails log file, extracts JSON events

  5. Storage backend receives and normalizes events:

    • PostgreSQL: Events normalized in real-time (< 1ms)

    • Elasticsearch: Events normalized asynchronously (~1s)

  6. Data Index queries normalized data via storage adapter

  7. GraphQL returns data to user

Event Processing Time

Metric PostgreSQL (MODE 1) Elasticsearch (MODE 2)

Normalization

< 1ms (database triggers)

~1s (transforms)

End-to-end latency

5-10 seconds

5-10 seconds

FluentBit collection interval

5 seconds

5 seconds

Data consistency

Immediate (ACID)

Eventual (~1s delay)

Note: End-to-end latency includes FluentBit collection interval (5s), log parsing, network transit, and storage processing. The primary difference is normalization speed: PostgreSQL triggers are synchronous (< 1ms), while Elasticsearch transforms run asynchronously (~1s).

Key Design Features

Real-Time Processing

Events are normalized immediately as they arrive:

  • No separate event processor service required

  • No polling or batch processing

  • Sub-second latency from event to query

  • Handles duplicates and out-of-order events automatically

Flexible Data Storage

Workflow input and output data stored as JSON:

  • Flexible - Any workflow schema supported

  • Queryable - Can filter by JSON fields when needed

  • GraphQL - Exposed as JSON strings for client parsing

Two-Stage Storage

Events stored in both raw and normalized formats:

  • Raw storage - Preserve original events for debugging and audit

  • Normalized storage - Optimized for fast querying

  • Replay capability - Can reprocess raw events if needed

What Data Index Does NOT Do

Data Index is read-only. It does NOT:

  • ❌ Execute workflows

  • ❌ Modify workflow state

  • ❌ Provide workflow management (start/stop/retry)

  • ❌ Store workflow definitions

  • ❌ Require a separate event processor service

Architecture Differences

PostgreSQL (MODE 1) - Trigger-Based Normalization

Data Flow:

FluentBit → PostgreSQL raw tables (workflow_events_raw, task_events_raw)
                ↓ (BEFORE INSERT triggers, < 1ms)
            Normalized tables (workflow_instances, task_instances)
                ↓ (JPA/Hibernate)
            GraphQL API

Key characteristics:

  • Synchronous normalization - Triggers execute during INSERT (< 1ms)

  • ACID guarantees - Transactions ensure consistency

  • Upsert logic - COALESCE handles out-of-order events

  • Single writer - Vertical scaling only

Elasticsearch (MODE 2) - Transform-Based Normalization

Data Flow:

FluentBit → Elasticsearch raw indices (workflow-instance-events-raw-*, task-execution-events-raw-*)
                ↓ (Continuous transforms, ~1s)
            Normalized indices (workflow-instances, task-executions)
                ↓ (Elasticsearch Java Client)
            GraphQL API

Key characteristics:

  • Asynchronous normalization - Transforms run independently (~1s delay)

  • Eventual consistency - Raw events appear before normalized documents

  • Aggregation logic - scripted_metric handles out-of-order events

  • Horizontal scaling - Distributed across Elasticsearch cluster

  • ILM retention - Automatic deletion of raw events after 7 days

Next Steps