Event Reliability and Loss Prevention

Data Index uses stdout-based log collection with FluentBit. This guide describes potential event loss scenarios and mitigation strategies.

Event Flow and Guarantees

Quarkus Flow App
    ↓ (stdout write - OS buffer)
Kernel Log Buffer
    ↓ (flush to disk)
/var/log/containers/<pod>_<namespace>_<container>.log
    ↓ (FluentBit tail with position tracking)
FluentBit Memory Buffer
    ↓ (Storage output with retry)
Storage Backend (PostgreSQL or Elasticsearch)
    ↓ (Normalization: triggers or transforms)
Normalized Data

Critical points of failure:

App crash before stdout flush
Node termination before log written to disk
Log rotation before FluentBit reads
FluentBit buffer overflow
FluentBit crash before committing position
Storage backend unavailability
Parse/filter errors

Event Loss Scenarios

1. Application Crashes Before Stdout Flush

Scenario:

App: workflow.started event → stdout buffer
App: CRASH (before buffer flush)
Result: Event never written to /var/log/containers/

Risk: Low

Reason: OS flushes stdout on newline for line-buffered output

Mitigation:

Quarkus Flow events always end with \n (newline)
OS typically flushes immediately
If app crashes mid-workflow, workflow will be re-executed (idempotent)

Monitoring:

kubectl get pods -n workflows --field-selector=status.phase=Failed

2. Node Termination Before Disk Write

Scenario:

App: Event → stdout → OS buffer → kernel
Node: SIGTERM (drain starts)
Node: SIGKILL after 30s (grace period)
Result: In-flight events lost if not flushed to /var/log/

Risk: Medium

Mitigation:

Set pod terminationGracePeriodSeconds:

spec:
  terminationGracePeriodSeconds: 60  # Allow more time
  containers:
  - name: workflow-app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Let stdout flush

Monitor pod evictions:

kubectl get events --field-selector reason=Evicted

3. Log Rotation Before FluentBit Reads

Scenario:

Kubernetes: Rotates /var/log/containers/pod.log (size > 10MB)
  - pod.log → pod.log.1
  - New pod.log created
FluentBit: Still reading pod.log.1 (tracked in DB)
Kubernetes: Deletes pod.log.6 (max 5 rotated files)
Result: Events in pod.log.6 lost if FluentBit didn't read them

Risk: High if FluentBit falls behind

Kubernetes defaults:

Max log file size: 10MB
Max backup files: 5
Total retention: ~50MB per container

Mitigation:

Increase Kubernetes Log Retention (node-level, requires cluster admin)

# /var/lib/kubelet/config.yaml
containerLogMaxSize: 100Mi    # Default: 10Mi
containerLogMaxFiles: 10      # Default: 5

Increase FluentBit Processing Speed

[INPUT]
    Refresh_Interval  1       # Check for new logs every 1s (default: 5s)
    Mem_Buf_Limit     20MB    # Larger memory buffer

[OUTPUT]
    Workers           2       # Parallel writes

Monitor FluentBit Lag

# Check FluentBit position tracking
kubectl exec -n logging <fluentbit-pod> -- \
  cat /tail-db/fluent-bit-kube.db

# Check metrics
curl http://<fluentbit-pod>:2020/api/v1/metrics/prometheus | grep input_bytes

Alert on High Log Rate

rate(fluentd_input_bytes_total[5m]) > 1000000  # > 1MB/s

4. FluentBit Buffer Overflow

Scenario:

FluentBit: Reading logs faster than storage can accept
FluentBit: Memory buffer fills up (Mem_Buf_Limit: 5MB)
FluentBit: Drops oldest records to make room
Result: Events lost

Risk: High under load

Mitigation:

Increase Memory Buffer

[INPUT]
    Mem_Buf_Limit     20MB    # Default: 5MB

[SERVICE]
    storage.metrics   on
    storage.path      /tail-db/storage
    storage.max_chunks_up  256

Enable Filesystem Buffering

[INPUT]
    storage.type      filesystem  # Spill to disk if memory full

DaemonSet configuration:

volumeMounts:
- name: storage-buffer
  mountPath: /tail-db/storage

volumes:
- name: storage-buffer
  emptyDir:
    sizeLimit: 1Gi  # Allow up to 1GB disk buffering

Monitor Buffer Usage

curl http://<fluentbit-pod>:2020/api/v1/metrics | grep buffer

5. FluentBit Crash Before Position Commit

Scenario:

FluentBit: Reads events from pod.log
FluentBit: Sends to storage successfully
FluentBit: CRASH before updating position in /tail-db/fluent-bit-kube.db
FluentBit: Restarts, re-reads from old position
Result: Duplicate events (NOT loss, but duplication)

Risk: Low (duplicates handled by normalization)

Mitigation:

PostgreSQL triggers use UPSERT: ON CONFLICT (id) DO UPDATE
Elasticsearch transforms use doc ID deduplication
Duplicate events update existing records (idempotent)

Monitor for crash loop:

kubectl get pods -n logging -l app=workflows-fluent-bit \
  --field-selector=status.phase=CrashLoopBackOff

6. Storage Backend Unavailability

Scenario:

FluentBit: Tries to write event to storage
Storage: Connection refused / timeout
FluentBit: Retries up to Retry_Limit (5)
FluentBit: Gives up after 5 retries
Result: Event lost

Risk: High if storage down for extended period

Current configuration:

[OUTPUT]
    Async           Off         # Blocking mode (wait for storage)
    Retry_Limit     5           # Retry 5 times before giving up

Mitigation:

Increase Retry Limit

[OUTPUT]
    Retry_Limit     False       # Infinite retries (wait forever)

This blocks FluentBit input if storage is down long-term, causing buffer overflow.

Use Storage Buffer (Recommended)

[INPUT]
    storage.type    filesystem   # Spill to disk during outages

[SERVICE]
    storage.max_chunks_up  512   # Large buffer

Storage High Availability

PostgreSQL: Use HA solution (Patroni, Stolon, CloudNativePG)
Elasticsearch: Use multi-node cluster with replicas
Connection pooling (PgBouncer for PostgreSQL)
Monitor storage health

7. JSON Parse Failures

Scenario:

App: Outputs truncated/malformed JSON
FluentBit: Fails to parse as JSON
FluentBit: Skips line
Result: Event lost

Risk: Low (Quarkus Flow uses structured logging library)

Mitigation:

Monitor Parse Errors

kubectl logs -n logging -l app=workflows-fluent-bit | grep -i "parser\|json.*error"

Preserve Unparsed Records

[FILTER]
    Name              parser
    Key_Name          log
    Parser            json
    Reserve_Data      On
    Preserve_Key      On      # Keep original if parse fails

Reliability Guarantees

What FluentBit Guarantees

✅ At-least-once delivery with Async Off + position tracking

✅ Duplicate handling via position DB

✅ Crash recovery from last committed position

✅ Retry on transient failures up to Retry_Limit

What FluentBit Does NOT Guarantee

❌ Event ordering - Events can arrive out of order (normalization handles this)

❌ Zero loss during node termination - In-flight events may be lost

❌ Infinite buffering - Buffer limits exist, overflow = loss

❌ Persistence across node loss - Position DB is per-node

Production Recommendations

Minimal Configuration (Acceptable Loss Risk)

Default configuration:

Mem_Buf_Limit: 5MB
Retry_Limit: 5
Async: Off
Position tracking enabled

Expected Loss: < 0.1% under normal conditions

Recommended Configuration (Low Loss Risk)

[INPUT]
    Mem_Buf_Limit     20MB
    storage.type      filesystem
    Refresh_Interval  1

[OUTPUT]
    Retry_Limit       False  # Infinite retries
    Workers           2

[SERVICE]
    storage.path             /tail-db/storage
    storage.max_chunks_up    512

DaemonSet resources:

volumeMounts:
- name: storage-buffer
  mountPath: /tail-db/storage

volumes:
- name: storage-buffer
  emptyDir:
    sizeLimit: 2Gi

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "1Gi"

Expected Loss: < 0.01% under normal conditions

High Reliability (Near-Zero Loss)

If event loss is unacceptable, consider:

Option 1: Dual Write (App-Level)

Quarkus Flow → stdout (for observability)
            ↓
            → Storage (direct insert via JDBC/ES Client)

Pros: No intermediary, guaranteed delivery

Cons: App coupled to data-index, requires connection pool

Option 2: File-Based with Persistent Volumes

Quarkus Flow → /data/events.log (PersistentVolume)
            ↓
FluentBit → tail → Storage

Pros: Events survive pod/node restarts

Cons: Requires PV provisioning, slower I/O

Monitoring and Alerting

Key Metrics

1. FluentBit Health

kubectl get pods -n logging -l app=workflows-fluent-bit

2. Buffer Usage

fluentbit_input_bytes_total - fluentbit_output_bytes_total

3. Retry Rate

rate(fluentbit_output_retries_total[5m]) > 0

4. Event Count (PostgreSQL example)

SELECT
  (SELECT COUNT(*) FROM workflow_events_raw) as raw_events,
  (SELECT COUNT(*) FROM workflow_instances) as workflows,
  (SELECT COUNT(*) FROM task_events_raw) as task_events,
  (SELECT COUNT(*) FROM task_instances) as tasks;

5. Log Rotation Rate

ls -lh /var/log/containers/*_workflows_*.log*

Recommended Alerts

Critical:

FluentBit pod not running
Storage connection failures > 1 min
Buffer overflow detected

Warning:

Retry rate > 10/min
Buffer usage > 80%
Log rotation faster than 1 file/min

Event Loss Detection

Check for Incomplete Workflows

-- Workflows that started but never completed/faulted
SELECT id, name, status, start, "end"
FROM workflow_instances
WHERE status = 'RUNNING'
  AND start < NOW() - INTERVAL '1 hour';

Verify Task Count

-- Compare task count to expected count for workflow
SELECT
  wi.id,
  wi.name,
  COUNT(ti.task_execution_id) as task_count
FROM workflow_instances wi
LEFT JOIN task_instances ti ON wi.id = ti.instance_id
WHERE wi.name = 'simple-set'
GROUP BY wi.id, wi.name
HAVING COUNT(ti.task_execution_id) != 2;  -- Expected: 2 tasks

Disaster Recovery

If events are lost:

1. Check FluentBit logs for errors

kubectl logs -n logging -l app=workflows-fluent-bit --tail=1000 > fluentbit.log
grep -i "error\|fail\|drop\|overflow" fluentbit.log

2. Check if events exist in container logs

kubectl exec -n logging <fluentbit-pod> -- \
  grep "eventType" /var/log/containers/*_workflows_*.log | tail -100

3. Manual replay from container logs (if still available)

# Extract missed events
kubectl exec -n logging <fluentbit-pod> -- \
  grep "eventType.*workflow.started" /var/log/containers/pod.log.2 > missed_events.json

# Insert manually (PostgreSQL example)
while IFS= read -r event; do
  echo "INSERT INTO workflow_events_raw (tag, time, data) VALUES ('workflow.started', NOW(), '${event}');" | \
    kubectl exec -n postgresql postgresql-0 -- psql -U dataindex -d dataindex
done < missed_events.json

Summary

Event loss is possible in stdout-based log collection, but can be minimized to < 0.01% with proper configuration and monitoring.

For most use cases, the recommended FluentBit configuration provides sufficient reliability with good operational simplicity.

For critical workflows where event loss is unacceptable, consider dual-write or file-based logging with persistent volumes.