Skip to content

Database Analysis for Atlas Verify

This document analyzes open source databases suitable for storing, querying, and updating bibliographic verification data on Kubernetes.

See also:

Requirements

Data Characteristics

AspectRequirement
Volume~10M raw records, ~1M profiles, ~100M potential candidates
VelocityInitial batch import, then ~10k updates/day
VarietyJSONB (heterogeneous raw data) + relational (verified links)
VeracityComplete audit trail, decision immutability

Access Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                          MAIN ACCESS PATTERNS                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  READS (80%)                                                                 │
│  ───────────                                                                 │
│  1. Pending candidates by profile (paginated, filtered, sorted)             │
│  2. Verified publications by profile (paginated, full-text search)          │
│  3. Candidate detail with joined raw sources                                │
│  4. Aggregated statistics by profile                                        │
│  5. Full-text search on titles and authors                                  │
│                                                                              │
│  WRITES (20%)                                                                │
│  ────────────                                                                │
│  1. Batch import of raw records (bulk INSERT)                               │
│  2. Candidate creation (bulk INSERT after matching)                         │
│  3. Verification decisions (INSERT + UPDATE status)                         │
│  4. User profile update (occasional UPDATE)                                 │
│                                                                              │
│  SPECIAL PATTERNS                                                            │
│  ────────────────                                                            │
│  1. Deduplication by checksum (UPSERT)                                      │
│  2. JSONB queries (field extraction, GIN indexes)                           │
│  3. Time-series for audit trail (append-only)                               │
│  4. Multilingual full-text search                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Kubernetes Requirements

AspectRequirement
High availabilitySynchronous replication, automatic failover
ScalabilityHorizontal scale for reads
BackupPoint-in-time recovery, S3 backups
ObservabilityPrometheus metrics, structured logs
OperationsMature Helm charts, Kubernetes operators

Solution Analysis

1. PostgreSQL

Overview

PostgreSQL is the most complete open source relational database, with excellent JSONB support and powerful extensions.

Strengths for Atlas Verify

AspectEvaluation
JSONBExcellent - GIN indexes, operators, functions
RelationalFull ACID, FK, triggers
Full-textGood - tsvector, multilingual
ScalabilityMedium - read replication, no native sharding
KubernetesMature operators (CloudNativePG, Zalando, CrunchyData)

Kubernetes Operators

┌─────────────────────────────────────────────────────────────────────────────┐
│                      POSTGRESQL KUBERNETES OPERATORS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CLOUDNATIVEPG                                                               │
│  ─────────────                                                               │
│  Maintained by: EDB (EnterpriseDB)                                          │
│  Maturity     : Production-ready, CNCF Sandbox                              │
│  Features     : HA, S3 backups, PITR, rolling updates, connection pooling   │
│  Helm         : cloudnative-pg/cloudnative-pg                               │
│  Docs         : https://cloudnative-pg.io                                   │
│                                                                              │
│  ZALANDO POSTGRES OPERATOR                                                   │
│  ────────────────────────                                                    │
│  Maintained by: Zalando                                                     │
│  Maturity     : Production-ready, used internally                           │
│  Features     : HA (Patroni), WAL-G backups, connection pooling (PgBouncer) │
│  Helm         : postgres-operator-charts/postgres-operator                  │
│  Docs         : https://github.com/zalando/postgres-operator                │
│                                                                              │
│  CRUNCHYDATA PGO                                                             │
│  ───────────────                                                             │
│  Maintained by: Crunchy Data                                                │
│  Maturity     : Enterprise-grade                                            │
│  Features     : HA, pgBackRest, built-in monitoring, pgBouncer              │
│  Helm         : Proprietary (kubectl apply)                                 │
│  Docs         : https://access.crunchydata.com/documentation/postgres-operator│
│                                                                              │
│  RECOMMENDATION: CloudNativePG                                               │
│  - Cloud-native design (not a port)                                         │
│  - Best Kubernetes integration (clean CRDs)                                 │
│  - Active community, CNCF backing                                           │
│  - Excellent documentation                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Optimized Schema

sql
-- Required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";     -- Text similarity
CREATE EXTENSION IF NOT EXISTS "btree_gin";   -- Composite indexes

-- ═══════════════════════════════════════════════════════════════════════════
-- RAW RECORDS (immutable data)
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE raw_records (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  source TEXT NOT NULL,
  source_id TEXT NOT NULL,
  entity_type TEXT NOT NULL,
  data JSONB NOT NULL,
  fetched_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  checksum TEXT NOT NULL,

  -- Uniqueness constraint for deduplication
  CONSTRAINT raw_records_source_unique UNIQUE (source, source_id, checksum)
);

-- GIN index for JSONB queries
CREATE INDEX idx_raw_records_data ON raw_records USING GIN (data);

-- Index for search by source
CREATE INDEX idx_raw_records_source ON raw_records (source, fetched_at DESC);

-- Date-based partitioning (optional for large volumes)
-- CREATE TABLE raw_records (...) PARTITION BY RANGE (fetched_at);

-- ═══════════════════════════════════════════════════════════════════════════
-- AUTHOR PROFILES
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE author_profiles (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  primary_orcid TEXT UNIQUE,
  display_name TEXT NOT NULL,
  email TEXT,
  institution_id UUID,
  avatar_url TEXT,
  bio TEXT,
  settings JSONB NOT NULL DEFAULT '{}',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_login_at TIMESTAMPTZ
);

CREATE INDEX idx_profiles_email ON author_profiles (email);
CREATE INDEX idx_profiles_name_trgm ON author_profiles
  USING GIN (display_name gin_trgm_ops);

-- ═══════════════════════════════════════════════════════════════════════════
-- PROFILE IDENTITIES
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE profile_identities (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  profile_id UUID NOT NULL REFERENCES author_profiles(id) ON DELETE CASCADE,
  identifier_type TEXT NOT NULL,
  identifier_value TEXT NOT NULL,
  is_primary BOOLEAN NOT NULL DEFAULT FALSE,
  verified_at TIMESTAMPTZ,
  verification_method TEXT,

  CONSTRAINT identity_unique UNIQUE (identifier_type, identifier_value)
);

CREATE INDEX idx_identities_profile ON profile_identities (profile_id);
CREATE INDEX idx_identities_lookup ON profile_identities (identifier_type, identifier_value);

-- ═══════════════════════════════════════════════════════════════════════════
-- CANDIDATE MATCHES
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE candidate_matches (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  raw_record_id UUID NOT NULL REFERENCES raw_records(id),
  profile_id UUID NOT NULL REFERENCES author_profiles(id) ON DELETE CASCADE,
  entity_type TEXT NOT NULL,
  match_score NUMERIC(4,3) NOT NULL CHECK (match_score BETWEEN 0 AND 1),
  match_reasons JSONB NOT NULL DEFAULT '[]',
  status TEXT NOT NULL DEFAULT 'pending',
  metadata JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  processed_at TIMESTAMPTZ,

  CONSTRAINT candidate_unique UNIQUE (raw_record_id, profile_id)
);

-- Main index for verification queue
CREATE INDEX idx_candidates_queue ON candidate_matches (profile_id, status, match_score DESC)
  WHERE status IN ('pending', 'uncertain');

-- Index for statistics
CREATE INDEX idx_candidates_stats ON candidate_matches (profile_id, status);

-- ═══════════════════════════════════════════════════════════════════════════
-- VERIFICATION ACTS (immutable audit trail)
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE verification_acts (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  candidate_match_id UUID NOT NULL REFERENCES candidate_matches(id),
  profile_id UUID NOT NULL REFERENCES author_profiles(id),
  decision TEXT NOT NULL,
  confidence TEXT,
  notes TEXT,
  evidence JSONB NOT NULL DEFAULT '[]',
  decided_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  decided_by TEXT NOT NULL,
  decided_via TEXT NOT NULL,
  client_info JSONB
);

-- Index for history by profile
CREATE INDEX idx_verifications_profile ON verification_acts (profile_id, decided_at DESC);

-- Index for audit by candidate
CREATE INDEX idx_verifications_candidate ON verification_acts (candidate_match_id, decided_at DESC);

-- ═══════════════════════════════════════════════════════════════════════════
-- CANONICAL WORKS
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE canonical_works (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  primary_doi TEXT UNIQUE,
  title TEXT NOT NULL,
  publication_date DATE,
  type TEXT,
  venue_id UUID,
  merged_from JSONB NOT NULL DEFAULT '[]',
  merge_status TEXT NOT NULL DEFAULT 'auto',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Full-text search on titles
CREATE INDEX idx_works_title_fts ON canonical_works
  USING GIN (to_tsvector('english', title));

CREATE INDEX idx_works_date ON canonical_works (publication_date DESC);

-- ═══════════════════════════════════════════════════════════════════════════
-- WORK AUTHORSHIPS
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE work_authorships (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  work_id UUID NOT NULL REFERENCES canonical_works(id) ON DELETE CASCADE,
  profile_id UUID NOT NULL REFERENCES author_profiles(id) ON DELETE CASCADE,
  position INTEGER,
  is_corresponding BOOLEAN DEFAULT FALSE,
  affiliation_at_time TEXT,
  verification_act_id UUID REFERENCES verification_acts(id),
  raw_contributions JSONB NOT NULL DEFAULT '[]',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  CONSTRAINT authorship_unique UNIQUE (work_id, profile_id)
);

CREATE INDEX idx_authorships_profile ON work_authorships (profile_id);
CREATE INDEX idx_authorships_work ON work_authorships (work_id);

-- ═══════════════════════════════════════════════════════════════════════════
-- NOTIFICATIONS
-- ═══════════════════════════════════════════════════════════════════════════

CREATE TABLE notifications (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  profile_id UUID NOT NULL REFERENCES author_profiles(id) ON DELETE CASCADE,
  type TEXT NOT NULL,
  title TEXT NOT NULL,
  body TEXT,
  resource_type TEXT,
  resource_id UUID,
  read BOOLEAN NOT NULL DEFAULT FALSE,
  read_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_notifications_unread ON notifications (profile_id, created_at DESC)
  WHERE read = FALSE;

-- ═══════════════════════════════════════════════════════════════════════════
-- MATERIALIZED VIEWS FOR STATISTICS
-- ═══════════════════════════════════════════════════════════════════════════

CREATE MATERIALIZED VIEW profile_stats AS
SELECT
  p.id AS profile_id,
  COUNT(DISTINCT wa.work_id) AS works_count,
  COUNT(DISTINCT cm.id) FILTER (WHERE cm.status = 'pending') AS pending_count,
  COUNT(DISTINCT cm.id) FILTER (WHERE cm.status IN ('confirmed', 'auto_confirmed')) AS confirmed_count,
  COALESCE(
    COUNT(DISTINCT cm.id) FILTER (WHERE cm.status IN ('confirmed', 'auto_confirmed', 'rejected'))::NUMERIC /
    NULLIF(COUNT(DISTINCT cm.id), 0),
    0
  ) AS verification_rate
FROM author_profiles p
LEFT JOIN candidate_matches cm ON cm.profile_id = p.id
LEFT JOIN work_authorships wa ON wa.profile_id = p.id
GROUP BY p.id;

CREATE UNIQUE INDEX idx_profile_stats ON profile_stats (profile_id);

-- Automatic refresh via pg_cron or application job
-- SELECT cron.schedule('refresh-stats', '*/15 * * * *', 'REFRESH MATERIALIZED VIEW CONCURRENTLY profile_stats');

CloudNativePG Manifest

yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: atlas-verify-db
  namespace: atlas-verify
spec:
  instances: 3

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"
      effective_cache_size: "1GB"
      work_mem: "16MB"
      maintenance_work_mem: "128MB"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      # JSONB optimizations
      default_statistics_target: "100"

  storage:
    size: 50Gi
    storageClass: local-path  # or ceph-block, longhorn, etc.

  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"
    limits:
      memory: "2Gi"
      cpu: "2"

  backup:
    barmanObjectStore:
      destinationPath: "s3://atlas-backups/postgres"
      endpointURL: "http://minio.minio:9000"
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
      data:
        compression: gzip
    retentionPolicy: "30d"

  monitoring:
    enablePodMonitor: true

2. CockroachDB

Overview

CockroachDB is a distributed PostgreSQL-compatible database with automatic sharding and multi-region replication.

Strengths

AspectEvaluation
JSONBGood - PostgreSQL compatible
RelationalGood - standard SQL, some limitations
Full-textMedium - basic, no tsvector
ScalabilityExcellent - auto sharding, horizontal scale
KubernetesGood - official operator

When to Choose CockroachDB

┌─────────────────────────────────────────────────────────────────────────────┐
│                        COCKROACHDB - USE CASES                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  RECOMMENDED IF:                                                             │
│  ───────────────                                                             │
│  ✓ Volume > 100M records with rapid growth                                  │
│  ✓ Multi-region (Europe, US, Asia)                                          │
│  ✓ Horizontal scale imperative                                              │
│  ✓ Team familiar with PostgreSQL                                            │
│                                                                              │
│  AVOID IF:                                                                   │
│  ──────────                                                                  │
│  ✗ Advanced full-text search required                                       │
│  ✗ Specific PostgreSQL extensions (pg_trgm, etc.)                           │
│  ✗ Cost constraints (more complex to operate)                               │
│  ✗ Volume < 10M records (over-engineering)                                  │
│                                                                              │
│  FOR ATLAS VERIFY:                                                           │
│  ─────────────────                                                           │
│  Estimated volume: 10M records → PostgreSQL sufficient                      │
│  Full-text: important → PostgreSQL preferred                                │
│  Regions: single-region likely → PostgreSQL sufficient                      │
│                                                                              │
│  VERDICT: Not recommended for v1, reconsider if massive scale needed        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3. MongoDB

Overview

MongoDB is the most popular document database, with a flexible model and excellent horizontal scalability.

Strengths

AspectEvaluation
DocumentsExcellent - native model, schemaless
RelationalWeak - no FK, limited joins
Full-textGood - Atlas Search / lucene-based
ScalabilityExcellent - native sharding
KubernetesGood - Community Operator, MongoDB Ops Manager

Document Model for Atlas Verify

javascript
// Collection: raw_records
{
  _id: ObjectId("..."),
  source: "openalex",
  sourceId: "W2741809807",
  entityType: "work",
  data: {
    // Complete raw data from source
  },
  fetchedAt: ISODate("2025-01-24T10:00:00Z"),
  checksum: "sha256:..."
}

// Collection: author_profiles
{
  _id: ObjectId("..."),
  primaryOrcid: "0000-0002-1825-0097",
  displayName: "Marie Curie",
  email: "marie.curie@univ.fr",
  identities: [
    { type: "orcid", value: "0000-0002-1825-0097", isPrimary: true },
    { type: "hal_author", value: "marie-curie" }
  ],
  settings: {
    autoConfirmThreshold: 0.95
  },
  createdAt: ISODate("..."),
  updatedAt: ISODate("...")
}

// Collection: candidate_matches
{
  _id: ObjectId("..."),
  rawRecordId: ObjectId("..."),
  profileId: ObjectId("..."),
  matchScore: 0.92,
  matchReasons: [
    { type: "orcid_claimed", weight: 0.95 },
    { type: "affiliation_match", weight: 0.4 }
  ],
  status: "pending",
  // Denormalization to avoid lookups
  rawRecordSummary: {
    title: "Attention Is All You Need",
    source: "openalex",
    doi: "10.48550/arXiv.1706.03762"
  },
  createdAt: ISODate("...")
}

// Collection: verification_acts (append-only)
{
  _id: ObjectId("..."),
  candidateMatchId: ObjectId("..."),
  profileId: ObjectId("..."),
  decision: "confirm",
  confidence: "certain",
  notes: "This is indeed my article",
  evidence: [],
  decidedAt: ISODate("..."),
  decidedBy: "user:123",
  decidedVia: "web"
}

Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MONGODB - ATLAS VERIFY ANALYSIS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ADVANTAGES:                                                                 │
│  ───────────                                                                 │
│  ✓ Natural model for raw_records (heterogeneous documents)                  │
│  ✓ Performant denormalization (embed summaries)                             │
│  ✓ Change streams for real-time                                             │
│  ✓ Excellent Atlas Search for full-text                                     │
│                                                                              │
│  DISADVANTAGES:                                                              │
│  ──────────────                                                              │
│  ✗ No true multi-document transactions (v4.0+ limited)                      │
│  ✗ Manual referential integrity                                             │
│  ✗ Less mature Kubernetes operator than PostgreSQL                          │
│  ✗ License cost for advanced features (Atlas)                               │
│  ✗ More complex to guarantee audit trail                                    │
│                                                                              │
│  VERDICT:                                                                    │
│  ────────                                                                    │
│  MongoDB would be suitable for raw_records storage only,                    │
│  but the need for strong integrity for verification decisions               │
│  favors PostgreSQL.                                                          │
│                                                                              │
│  HYBRID OPTION POSSIBLE:                                                     │
│  - MongoDB for raw_records (volume, flexibility)                            │
│  - PostgreSQL for profiles, candidates, verifications (integrity)           │
│  → Increased operational complexity, avoid unless proven need               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

4. ScyllaDB

Overview

ScyllaDB is a C++ rewrite of Cassandra, optimized for performance and massive scalability.

Strengths

AspectEvaluation
Wide columnsExcellent - Cassandra model
ScalabilityExcellent - linear scale
PerformanceExcellent - very low latency
FlexibilityWeak - rigid query model
KubernetesGood - Scylla Operator

Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SCYLLADB - ATLAS VERIFY ANALYSIS                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IDEAL USE CASES:                                                            │
│  ─────────────────                                                           │
│  - Very high-throughput time-series                                         │
│  - Logs / events                                                             │
│  - Persistent distributed cache                                             │
│                                                                              │
│  FOR ATLAS VERIFY:                                                           │
│  ─────────────────                                                           │
│  ✗ Query model too rigid for our case                                       │
│  ✗ Not suitable for ad-hoc queries (dynamic filters)                        │
│  ✗ No joins                                                                  │
│  ✗ Over-engineering for expected volume                                     │
│                                                                              │
│  POTENTIALLY USEFUL FOR:                                                     │
│  ─────────────────────────                                                   │
│  - Metrics storage (if huge volume)                                         │
│  - Audit logs (if very long retention)                                      │
│                                                                              │
│  VERDICT: Not recommended                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

5. Search Solutions

Meilisearch vs Elasticsearch vs Typesense

For full-text search on titles and author names, a dedicated solution can complement PostgreSQL.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SEARCH ENGINES COMPARED                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                   │ Meilisearch │ Elasticsearch │ Typesense │               │
│  ─────────────────┼─────────────┼───────────────┼───────────┤               │
│  Ease of use      │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐         │ ⭐⭐⭐⭐    │               │
│  Performance      │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐       │ ⭐⭐⭐⭐⭐  │               │
│  Features         │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐     │ ⭐⭐⭐⭐    │               │
│  Kubernetes       │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐     │ ⭐⭐⭐      │               │
│  Resources        │ ⭐⭐⭐⭐⭐    │ ⭐⭐          │ ⭐⭐⭐⭐    │               │
│  Typo-tolerance   │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐        │ ⭐⭐⭐⭐⭐  │               │
│  Multi-tenant     │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐     │ ⭐⭐⭐⭐    │               │
│  License          │ MIT          │ Dual (SSPL)   │ GPL-3      │               │
│  ─────────────────┴─────────────┴───────────────┴───────────┘               │
│                                                                              │
│  RECOMMENDATION: Meilisearch                                                 │
│  ────────────────────────────                                                │
│  - Very easy to configure and operate                                       │
│  - Excellent for title/name search                                          │
│  - Low memory footprint                                                      │
│  - Native typo-tolerance (important for author names)                       │
│  - Official Helm chart available                                             │
│                                                                              │
│  ALTERNATIVE: PostgreSQL full-text sufficient for v1                        │
│  - pg_trgm + tsvector cover 80% of needs                                    │
│  - Add Meilisearch if search UX insufficient                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Meilisearch Manifest

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: meilisearch
  namespace: atlas-verify
spec:
  serviceName: meilisearch
  replicas: 1
  selector:
    matchLabels:
      app: meilisearch
  template:
    metadata:
      labels:
        app: meilisearch
    spec:
      containers:
        - name: meilisearch
          image: getmeili/meilisearch:v1.6
          ports:
            - containerPort: 7700
          env:
            - name: MEILI_ENV
              value: "production"
            - name: MEILI_MASTER_KEY
              valueFrom:
                secretKeyRef:
                  name: meilisearch-secrets
                  key: master-key
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "1Gi"
              cpu: "1"
          volumeMounts:
            - name: data
              mountPath: /meili_data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

---
apiVersion: v1
kind: Service
metadata:
  name: meilisearch
  namespace: atlas-verify
spec:
  selector:
    app: meilisearch
  ports:
    - port: 7700
      targetPort: 7700

6. Redis

Redis is essential as a cache and job queue, not as a primary database.

Uses in Atlas Verify

UseJustification
SessionsJWT/session cookie storage
CacheMatching results, frequent profiles
Rate limitingCounters per user/source
Job queueBullMQ for imports and notifications
Pub/SubReal-time notifications (SSE backend)

Redis Manifest (Bitnami)

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atlas-verify

---
# Via Helm: bitnami/redis
# helm install redis bitnami/redis -n atlas-verify -f values.yaml

# values.yaml
architecture: standalone  # or 'replication' for HA

auth:
  enabled: true
  existingSecret: redis-secrets
  existingSecretPasswordKey: password

master:
  persistence:
    enabled: true
    size: 8Gi

  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "500m"

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

Final Recommendation

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RECOMMENDED DATA ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        KUBERNETES CLUSTER                            │   │
│  │                                                                      │   │
│  │  ┌─────────────────┐     ┌─────────────────┐                        │   │
│  │  │  PostgreSQL 16  │     │     Redis 7     │                        │   │
│  │  │  (CloudNativePG)│     │   (Bitnami)     │                        │   │
│  │  │                 │     │                 │                        │   │
│  │  │  • raw_records  │     │  • Sessions     │                        │   │
│  │  │  • profiles     │     │  • Cache        │                        │   │
│  │  │  • candidates   │     │  • Rate limits  │                        │   │
│  │  │  • verifications│     │  • BullMQ jobs  │                        │   │
│  │  │  • works        │     │                 │                        │   │
│  │  │                 │     │                 │                        │   │
│  │  │  3 instances HA │     │  1-3 instances  │                        │   │
│  │  │  50 Gi storage  │     │  8 Gi storage   │                        │   │
│  │  └────────┬────────┘     └────────┬────────┘                        │   │
│  │           │                       │                                  │   │
│  │           │    ┌──────────────────┘                                  │   │
│  │           │    │                                                     │   │
│  │           ▼    ▼                                                     │   │
│  │  ┌─────────────────┐     ┌─────────────────┐                        │   │
│  │  │   API Backend   │     │     Worker      │                        │   │
│  │  │     (Hono)      │     │   (BullMQ)      │                        │   │
│  │  │                 │     │                 │                        │   │
│  │  │  • REST API     │     │  • Imports      │                        │   │
│  │  │  • Auth ORCID   │     │  • Matching     │                        │   │
│  │  │  • SSE notifs   │     │  • Emails       │                        │   │
│  │  └─────────────────┘     └─────────────────┘                        │   │
│  │                                                                      │   │
│  │  ┌─────────────────┐     ┌─────────────────┐                        │   │
│  │  │    Frontend     │     │     MinIO       │                        │   │
│  │  │  (SvelteKit)    │     │   (Backups)     │                        │   │
│  │  └─────────────────┘     └─────────────────┘                        │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  OPTIONAL COMPONENTS (v2+):                                                 │
│  ──────────────────────────                                                  │
│  • Meilisearch: if PostgreSQL full-text search insufficient                │
│  • TimescaleDB: if time-series metrics important                           │
│  • Kafka/NATS: if event-driven architecture needed                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Justification

ChoiceReason
PostgreSQLJSONB + relational + full-text in single database, mature operators
CloudNativePGCloud-native operator, excellent documentation, CNCF backing
RedisStandard for cache/sessions, mature BullMQ
Not MongoDBTransactional integrity important, PostgreSQL JSONB sufficient
Not CockroachDBVolume too low to justify complexity
Not Meilisearch v1PostgreSQL pg_trgm + tsvector sufficient initially

Resource Estimation

ComponentInstancesCPUMemoryStorage
PostgreSQL32 cores4 Gi50 Gi
Redis1-30.5 core512 Mi8 Gi
API Backend2-40.5 core512 Mi-
Worker1-20.5 core512 Mi-
Frontend20.25 core256 Mi-
Total~10 pods~6 cores~8 Gi~60 Gi

Future Evolution

v1 (launch)          v2 (scale)              v3 (enterprise)
─────────────        ──────────              ────────────────
PostgreSQL           PostgreSQL              CockroachDB
Redis                Redis Cluster           Redis Cluster
-                    Meilisearch             Meilisearch
-                    -                       Kafka
-                    -                       Grafana Loki

Complete Kubernetes Infrastructure

Namespace and Secrets

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: atlas-verify
  labels:
    app.kubernetes.io/name: atlas-verify

---
apiVersion: v1
kind: Secret
metadata:
  name: atlas-verify-secrets
  namespace: atlas-verify
type: Opaque
stringData:
  ORCID_CLIENT_ID: "APP-XXXX"
  ORCID_CLIENT_SECRET: "secret"
  SESSION_SECRET: "random-32-char-secret"
  DATABASE_URL: "postgresql://user:pass@atlas-verify-db-rw:5432/atlas"
  REDIS_URL: "redis://:pass@redis-master:6379"

---
apiVersion: v1
kind: Secret
metadata:
  name: s3-creds
  namespace: atlas-verify
type: Opaque
stringData:
  ACCESS_KEY_ID: "minio-access-key"
  SECRET_ACCESS_KEY: "minio-secret-key"

API Deployment

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atlas-verify-api
  namespace: atlas-verify
spec:
  replicas: 2
  selector:
    matchLabels:
      app: atlas-verify-api
  template:
    metadata:
      labels:
        app: atlas-verify-api
    spec:
      containers:
        - name: api
          image: ghcr.io/univ-lehavre/atlas-verify-api:latest
          ports:
            - containerPort: 3000
          envFrom:
            - secretRef:
                name: atlas-verify-secrets
          env:
            - name: NODE_ENV
              value: "production"
            - name: PORT
              value: "3000"
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  name: atlas-verify-api
  namespace: atlas-verify
spec:
  selector:
    app: atlas-verify-api
  ports:
    - port: 3000
      targetPort: 3000

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atlas-verify
  namespace: atlas-verify
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  tls:
    - hosts:
        - verify.atlas.univ-lehavre.fr
      secretName: atlas-verify-tls
  rules:
    - host: verify.atlas.univ-lehavre.fr
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: atlas-verify-api
                port:
                  number: 3000

Worker Deployment

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atlas-verify-worker
  namespace: atlas-verify
spec:
  replicas: 1
  selector:
    matchLabels:
      app: atlas-verify-worker
  template:
    metadata:
      labels:
        app: atlas-verify-worker
    spec:
      containers:
        - name: worker
          image: ghcr.io/univ-lehavre/atlas-verify-worker:latest
          envFrom:
            - secretRef:
                name: atlas-verify-secrets
          env:
            - name: NODE_ENV
              value: "production"
            - name: WORKER_CONCURRENCY
              value: "5"
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"

Conclusion

For Atlas Verify, the recommendation is:

  1. PostgreSQL 16 with CloudNativePG as primary database

    • JSONB for heterogeneous raw data
    • Strict relational for verification integrity
    • Built-in full-text search
  2. Redis for cache, sessions, and jobs

  3. Modular monolithic architecture rather than microservices

    • Simpler to operate
    • Easy transactions
    • Evolvable to microservices if needed
  4. Meilisearch as option if PostgreSQL full-text search proves insufficient

This architecture comfortably supports 10M+ records and 100k+ users with modest resources (~6 cores, 8 Gi RAM).