PostgreSQL Mode Architecture

Data Index with PostgreSQL storage uses trigger-based normalization for real-time event processing.

Architecture Diagram

Quarkus Flow App
    ↓ (structured logging → stdout)
FluentBit DaemonSet
    ↓ (tail /var/log/containers/)
PostgreSQL Raw Tables (JSONB)
    ↓ (BEFORE INSERT triggers)
PostgreSQL Normalized Tables
    ↓ (JPA/Hibernate)
GraphQL API

Data Flow

  1. Quarkus Flow emits events - JSON to stdout

  2. Kubernetes captures logs - /var/log/containers/POD_NAME.log

  3. FluentBit collects - Tails log files, filters JSON events

  4. INSERT to raw tables - workflow_events_raw, task_events_raw (JSONB columns)

  5. Triggers fire immediately - Extract fields from JSONB

  6. UPSERT to normalized tables - workflow_instances, task_instances

  7. GraphQL queries - Via JPA entities

Key Characteristics

Characteristic Details

Latency

< 1ms for normalization, 5-10s end-to-end

Consistency

ACID transactions, guaranteed consistency

Throughput

< 50K workflows/day (PostgreSQL write limit)

Complexity

Simple - no separate event processor service

Search

Limited - basic filtering, no full-text search

Status

✅ Production Ready

Real-Time Normalization

Events are normalized in real-time as they arrive:

  1. FluentBit writes raw events - Complete events stored as JSON for debugging

  2. Events normalized automatically - Fields extracted and stored in optimized tables

  3. Immediate availability - Data ready for querying in less than 1ms

Benefits:

  • Real-time processing (<1ms latency)

  • No separate event processor service

  • ACID transaction guarantees

  • Handles duplicates and out-of-order events automatically

Trade-offs:

  • PostgreSQL-specific implementation

  • Limited throughput vs. Elasticsearch mode (< 50K workflows/day)

  • Schema changes require database updates

Raw Event Storage

Raw events are preserved in their original JSON format:

  • workflow_events_raw - All workflow-related events

  • task_events_raw - All task-related events

Benefits:

  • Debugging - Original events preserved for troubleshooting

  • Replay - Can reprocess if normalization logic changes

  • Audit - Complete event history maintained

  • Flexibility - Accepts any event structure without schema changes

Normalized Tables

Events are automatically normalized to optimized tables for querying:

  • workflow_instances - One row per workflow execution

  • task_instances - One row per task execution

How it works:

  • Fields extracted automatically from raw events

  • Duplicates handled transparently

  • Out-of-order events resolved correctly

  • Immutable fields (like start time) preserved from first event

  • Mutable fields (like status, output) updated from latest event

Configuration

Quarkus Flow Application

# Structured logging
quarkus.flow.structured-logging.enabled=true
quarkus.flow.structured-logging.timestamp-format=epoch-seconds

FluentBit

[INPUT]
    Name              tail
    Path              /var/log/containers/*_workflows_*.log
    Parser            docker

[FILTER]
    Name              grep
    Match             *
    Regex             log {".*eventType.*}

[OUTPUT]
    Name              pgsql
    Match             *
    Host              postgresql
    Database          dataindex
    Table             workflow_events_raw

Data Index Service

quarkus.datasource.db-kind=postgresql
quarkus.datasource.jdbc.url=jdbc:postgresql://postgresql:5432/dataindex
quarkus.hibernate-orm.database.generation=none

Schema initialization is performed manually in production using SQL migration scripts from the data-index-storage-migrations module. Development mode can optionally use Flyway for automatic migrations.

Scaling Considerations

PostgreSQL mode scales well for moderate workloads:

Vertical scaling: - Increase PostgreSQL instance size - More CPU/memory for trigger processing - SSD storage for write performance

Horizontal scaling: - Read replicas for GraphQL queries - Connection pooling in Data Index service - Multiple Data Index instances (stateless)

Limitations: - Single PostgreSQL writer (triggers can’t be distributed) - ~50K workflows/day practical limit - For higher throughput, consider Elasticsearch mode