Advanced Databases and Multi-source Search

This document analyzes specialized databases, search engines, and federation architectures to meet the complex needs of bibliographic search.

See also:
Databases - PostgreSQL/MongoDB analysis for primary storage
Author Verification - Data model and verification workflows
Researcher Profile - Career reconstruction, expertise, collaborations
Unified Schema - OpenAPI specification and entities

Researcher Use Cases

Identified Needs

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RESEARCHER NEEDS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DISCOVERY                                                                   │
│  ─────────                                                                   │
│  "Who are the experts in ethical machine learning in Paris?"                │
│  "Which articles cite my work and work on similar topics?"                  │
│  "What are the emerging trends in my field?"                                │
│  "Who collaborates with whom in my thematic network?"                       │
│                                                                              │
│  EXPLORATION                                                                 │
│  ───────────                                                                 │
│  "Show me the evolution of CRISPR publications since 2012"                  │
│  "Which European labs work on solid-state batteries?"                       │
│  "Visualize this researcher's co-author graph"                              │
│  "Which funders support this field?"                                        │
│                                                                              │
│  MONITORING                                                                  │
│  ──────────                                                                  │
│  "Alert me when someone cites my article"                                   │
│  "New preprints in my field this week"                                      │
│  "Evolution of my competitors' h-index"                                     │
│                                                                              │
│  ANALYSIS                                                                    │
│  ────────                                                                    │
│  "Semantic similarity between two article corpora"                          │
│  "Clustering of my laboratory's research topics"                            │
│  "Impact prediction for a preprint"                                         │
│                                                                              │
│  REPORTING                                                                   │
│  ─────────                                                                   │
│  "Bibliometric export for HCERES evaluation"                                │
│  "Publication statistics by team"                                           │
│  "Mapping of international collaborations"                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Translated Query Types

User Need	Data Dimensions	Required Indexes
Expert on a topic in a city	Text + Geo + Graph	Full-text, Spatial, Graph
Temporal evolution of a field	Time-series + Text	Temporal, Full-text
Similar articles	Semantic vectors	Vector (ANN)
Collaboration network	Graph	Graph traversal
Emerging trends	Time-series + NLP	Temporal, Vector
Impact prediction	ML features	Vector, Aggregations

Multi-model Databases

ArangoDB

Overview

ArangoDB is a native multi-model database combining documents, graphs, and search in a single system.

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ARANGODB                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SUPPORTED MODELS                                                            │
│  ────────────────                                                            │
│  ✓ Documents (JSON) - Like MongoDB                                          │
│  ✓ Graphs - Native traversals, faster than Neo4j in some cases             │
│  ✓ Key-Value - Built-in cache                                               │
│  ✓ Search (ArangoSearch) - Full-text with BM25 scoring                      │
│  ✓ GeoJSON - Spatial indexes                                                │
│                                                                              │
│  QUERY LANGUAGE: AQL                                                         │
│  ───────────────────                                                         │
│  Unified for all models, similar to SQL with graph extensions               │
│                                                                              │
│  EXAMPLE - Experts on a topic in a region:                                  │
│  ──────────────────────────────────────────────────────────────────────     │
│  FOR author IN authors                                                       │
│    FILTER GEO_DISTANCE(author.location, @parisCenter) < 50000               │
│    LET expertise = (                                                         │
│      FOR v, e IN 1..1 OUTBOUND author authored                              │
│        FOR work IN works FILTER work._id == v._id                           │
│          FILTER ANALYZER(work.title, "text_en") LIKE "%machine learning%"   │
│        RETURN 1                                                              │
│    )                                                                         │
│    FILTER LENGTH(expertise) > 5                                             │
│    SORT LENGTH(expertise) DESC                                               │
│    RETURN { author, expertiseCount: LENGTH(expertise) }                     │
│                                                                              │
│  KUBERNETES                                                                  │
│  ──────────                                                                  │
│  • ArangoDB Kubernetes Operator (official)                                  │
│  • Helm chart: arangodb/kube-arangodb                                       │
│  • Modes: Single, ActiveFailover, Cluster (sharding)                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Model for Atlas

javascript

// Collection: works (documents)
{
  _key: "W2741809807",
  title: "Attention Is All You Need",
  abstract: "The dominant sequence transduction models...",
  year: 2017,
  doi: "10.48550/arXiv.1706.03762",
  venue: "NeurIPS",
  citationCount: 89000,
  embedding: [0.12, -0.34, ...],  // SPECTER vector 768d
  location: null,  // For works with location (conferences)
  keywords: ["transformer", "attention", "NLP"],
  _source: "openalex"
}

// Collection: authors (documents)
{
  _key: "A123456",
  displayName: "Ashish Vaswani",
  orcid: "0000-0002-1234-5678",
  affiliations: [
    {
      institution: "Google Brain",
      ror: "03yrm5c26",
      location: { type: "Point", coordinates: [37.4220, -122.0841] },
      startYear: 2015
    }
  ],
  hIndex: 45,
  topics: ["deep learning", "NLP", "transformers"]
}

// Collection: institutions (documents)
{
  _key: "I27837315",
  name: "Google",
  ror: "03yrm5c26",
  type: "company",
  location: {
    type: "Point",
    coordinates: [-122.0841, 37.4220]  // Mountain View
  },
  country: "US"
}

// Edge collection: authored (graph)
{
  _from: "authors/A123456",
  _to: "works/W2741809807",
  position: 1,
  isCorresponding: true,
  affiliationAtTime: "Google Brain"
}

// Edge collection: cites (graph)
{
  _from: "works/W2741809807",
  _to: "works/W1234567890",
  context: "building upon the work of..."
}

// Edge collection: affiliated (graph)
{
  _from: "authors/A123456",
  _to: "institutions/I27837315",
  startYear: 2015,
  endYear: null,
  role: "Research Scientist"
}

// Edge collection: collaborated (graph, computed)
{
  _from: "authors/A123456",
  _to: "authors/A789012",
  coauthorships: 15,
  firstCollab: 2015,
  lastCollab: 2024
}

Advanced AQL Queries

sql

// 1. Find experts on a topic in a geographic area
FOR author IN authors
  // Geographic filter: authors within 50km radius of Paris
  FILTER author.affiliations[*].location != null
  LET currentAffiliation = FIRST(
    FOR aff IN author.affiliations
      FILTER aff.endYear == null
      RETURN aff
  )
  FILTER currentAffiliation != null
  FILTER GEO_DISTANCE(
    currentAffiliation.location,
    GEO_POINT(2.3522, 48.8566)  // Paris
  ) < 50000

  // Count publications on the topic
  LET relevantWorks = (
    FOR v IN 1..1 OUTBOUND author authored
      SEARCH ANALYZER(v.title IN TOKENS("machine learning ethics", "text_en"), "text_en")
         OR ANALYZER(v.abstract IN TOKENS("machine learning ethics", "text_en"), "text_en")
      RETURN v
  )

  FILTER LENGTH(relevantWorks) >= 3

  // Expertise score
  LET expertiseScore = SUM(relevantWorks[*].citationCount) / LENGTH(relevantWorks)

  SORT expertiseScore DESC
  LIMIT 20

  RETURN {
    author: author.displayName,
    orcid: author.orcid,
    institution: currentAffiliation.institution,
    publicationsOnTopic: LENGTH(relevantWorks),
    avgCitations: expertiseScore,
    hIndex: author.hIndex
  }

// 2. Co-author graph with depth
FOR author IN authors
  FILTER author._key == "A123456"

  // Traverse co-author graph up to depth 2
  FOR v, e, p IN 1..2 OUTBOUND author collaborated
    OPTIONS { bfs: true, uniqueVertices: "global" }

    LET distance = LENGTH(p.edges)

    RETURN {
      coauthor: v.displayName,
      distance: distance,
      sharedWorks: e.coauthorships,
      path: p.vertices[*].displayName
    }

// 3. Temporal evolution of a field
FOR work IN works
  SEARCH ANALYZER(work.title IN TOKENS("CRISPR gene editing", "text_en"), "text_en")
     OR ANALYZER(work.abstract IN TOKENS("CRISPR gene editing", "text_en"), "text_en")

  COLLECT year = work.year INTO worksPerYear

  SORT year ASC

  RETURN {
    year: year,
    count: LENGTH(worksPerYear),
    totalCitations: SUM(worksPerYear[*].work.citationCount),
    topPapers: (
      FOR w IN worksPerYear[*].work
        SORT w.citationCount DESC
        LIMIT 3
        RETURN { title: w.title, citations: w.citationCount }
    )
  }

// 4. Similar articles (nearest neighbors on embeddings)
LET targetWork = DOCUMENT("works/W2741809807")

FOR work IN works
  FILTER work._key != targetWork._key

  // Cosine distance on embeddings
  LET similarity = (
    LET dotProduct = SUM(
      FOR i IN 0..LENGTH(targetWork.embedding)-1
        RETURN targetWork.embedding[i] * work.embedding[i]
    )
    LET normA = SQRT(SUM(FOR x IN targetWork.embedding RETURN x*x))
    LET normB = SQRT(SUM(FOR x IN work.embedding RETURN x*x))
    RETURN dotProduct / (normA * normB)
  )[0]

  FILTER similarity > 0.7

  SORT similarity DESC
  LIMIT 10

  RETURN {
    title: work.title,
    year: work.year,
    similarity: similarity,
    doi: work.doi
  }

// 5. Citation path between two articles
FOR path IN OUTBOUND K_SHORTEST_PATHS
  "works/W2741809807" TO "works/W9999999"
  cites
  OPTIONS { weightAttribute: "weight" }
  LIMIT 5

  RETURN {
    length: LENGTH(path.edges),
    papers: path.vertices[*].title
  }

ArangoDB Evaluation

Aspect	Score	Notes
Native multi-model	⭐⭐⭐⭐⭐	Unified Documents + Graph + Search
Graph	⭐⭐⭐⭐	Very good, but Neo4j more mature
Full-text	⭐⭐⭐⭐	ArangoSearch based on IResearch
Geospatial	⭐⭐⭐⭐	Native GeoJSON
Vector search	⭐⭐⭐	Possible but not optimized (no ANN)
Kubernetes	⭐⭐⭐⭐	Mature official operator
Scalability	⭐⭐⭐⭐	Sharding, but complex
Community	⭐⭐⭐	Smaller than Postgres/Mongo

SurrealDB

Overview

SurrealDB is a next-generation multi-model database with SQL-like queries and modern features.

┌─────────────────────────────────────────────────────────────────────────────┐
│                             SURREALDB                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CHARACTERISTICS                                                             │
│  ───────────────                                                             │
│  • Documents, Graphs, Relations, Time-series                                │
│  • SurrealQL (SQL-like with extensions)                                     │
│  • Granular row-level permissions                                           │
│  • Real-time subscriptions (WebSocket)                                      │
│  • Embeddable or server                                                      │
│                                                                              │
│  STRENGTHS                                                                   │
│  ─────────                                                                   │
│  ✓ Record links (native relations)                                          │
│  ✓ Computed fields                                                           │
│  ✓ Events and triggers                                                       │
│  ✓ Built-in full-text search                                                │
│  ✓ Geospatial                                                                │
│                                                                              │
│  WEAKNESSES                                                                  │
│  ──────────                                                                  │
│  ✗ Very young (v1.0 end of 2023)                                            │
│  ✗ Not yet production-ready for large volumes                               │
│  ✗ No official Kubernetes operator                                          │
│  ✗ Basic vector search                                                       │
│                                                                              │
│  VERDICT: Worth watching, not recommended for immediate production          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Specialized Databases

Vector Databases (Embeddings / Semantic Search)

Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                       VECTOR DATABASES COMPARED                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                 │ Milvus  │ Qdrant  │ Weaviate │ Pinecone │ pgvector │     │
│  ───────────────┼─────────┼─────────┼──────────┼──────────┼──────────┤     │
│  Performance    │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐⭐│ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐     │     │
│  Scalability    │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐     │     │
│  Kubernetes     │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐   │ N/A      │ ⭐⭐⭐⭐⭐  │     │
│  Open Source    │ ✅       │ ✅       │ ✅        │ ❌       │ ✅        │     │
│  Hybrid filters │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐⭐│ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐   │     │
│  Multi-tenancy  │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐⭐│ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐     │     │
│  PG Integration │ ❌       │ ❌       │ ❌        │ ❌       │ ✅        │     │
│  Simplicity     │ ⭐⭐⭐    │ ⭐⭐⭐⭐  │ ⭐⭐⭐     │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐⭐  │     │
│  ───────────────┴─────────┴─────────┴──────────┴──────────┴──────────┘     │
│                                                                              │
│  ATLAS RECOMMENDATION                                                        │
│  ────────────────────                                                        │
│                                                                              │
│  Option A: pgvector (simplicity)                                            │
│  ───────────────────────────────                                             │
│  • PostgreSQL extension, no new database                                    │
│  • Sufficient for < 10M vectors                                             │
│  • HNSW index available since v0.5                                          │
│  • Native SQL filtering                                                      │
│                                                                              │
│  Option B: Qdrant (performance + features)                                  │
│  ─────────────────────────────────────────                                   │
│  • Best performance/simplicity ratio                                        │
│  • Excellent hybrid filtering                                               │
│  • Rust, very memory efficient                                              │
│  • Official Helm chart                                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Qdrant for Atlas

yaml

# Helm install
# helm repo add qdrant https://qdrant.github.io/qdrant-helm
# helm install qdrant qdrant/qdrant -n atlas-verify -f values.yaml

# values.yaml
replicaCount: 1

persistence:
  enabled: true
  size: 20Gi

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "2"

config:
  cluster:
    enabled: false  # Single node to start
  storage:
    performance:
      optimizers_count: 2

typescript

// Qdrant client
import { QdrantClient } from '@qdrant/js-client-rest';

const qdrant = new QdrantClient({ url: 'http://qdrant:6333' });

// Create collection for article embeddings
await qdrant.createCollection('work_embeddings', {
  vectors: {
    specter: { size: 768, distance: 'Cosine' },   // SPECTER embeddings
    title: { size: 384, distance: 'Cosine' },     // Title embeddings (MiniLM)
  },
  // Payload for hybrid filtering
  payload_schema: {
    year: { data_type: 'Integer', indexed: true },
    source: { data_type: 'Keyword', indexed: true },
    venue_type: { data_type: 'Keyword', indexed: true },
    citation_count: { data_type: 'Integer', indexed: true },
  },
});

// Semantic search with filters
const results = await qdrant.search('work_embeddings', {
  vector: {
    name: 'specter',
    vector: queryEmbedding,  // 768d vector
  },
  filter: {
    must: [
      { key: 'year', range: { gte: 2020 } },
      { key: 'source', match: { value: 'openalex' } },
    ],
  },
  limit: 20,
  with_payload: true,
});

pgvector (integrated alternative)

sql

-- pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embeddings
ALTER TABLE canonical_works
  ADD COLUMN embedding vector(768);

-- HNSW index for efficient ANN search
CREATE INDEX ON canonical_works
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Similarity search with SQL filters
SELECT
  id, title, year,
  1 - (embedding <=> $1::vector) AS similarity
FROM canonical_works
WHERE year >= 2020
  AND type = 'article'
ORDER BY embedding <=> $1::vector
LIMIT 20;

Time-series Databases

InfluxDB vs TimescaleDB vs QuestDB

┌─────────────────────────────────────────────────────────────────────────────┐
│                      TIME-SERIES DATABASES COMPARED                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                   │ InfluxDB 3 │ TimescaleDB │ QuestDB │ ClickHouse │       │
│  ─────────────────┼────────────┼─────────────┼─────────┼────────────┤       │
│  Language         │ SQL + Flux │ SQL         │ SQL     │ SQL        │       │
│  Compression      │ ⭐⭐⭐⭐⭐   │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐⭐   │       │
│  Ingestion        │ ⭐⭐⭐⭐⭐   │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐⭐   │       │
│  Aggregations     │ ⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐  │ ⭐⭐⭐⭐⭐   │       │
│  PostgreSQL compat│ ❌          │ ✅          │ ✅       │ ❌         │       │
│  Kubernetes       │ ⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐    │ ⭐⭐⭐⭐    │       │
│  Open Source      │ ⚠️ (OSS 2) │ ✅          │ ✅       │ ✅         │       │
│  ─────────────────┴────────────┴─────────────┴─────────┴────────────┘       │
│                                                                              │
│  ATLAS USE CASES                                                             │
│  ───────────────                                                             │
│                                                                              │
│  Metrics to store:                                                           │
│  • Citation evolution by article (daily)                                    │
│  • Publications by field (weekly)                                           │
│  • Author h-index (monthly)                                                 │
│  • Thematic trends (rolling window)                                         │
│  • Verification audit logs                                                   │
│                                                                              │
│  RECOMMENDATION: TimescaleDB                                                 │
│  ────────────────────────────                                                │
│  • PostgreSQL extension → single database to manage                         │
│  • Standard SQL                                                              │
│  • Continuous aggregates for rollups                                        │
│  • Automatic compression                                                     │
│  • Official Helm chart (timescale/timescaledb-single)                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

TimescaleDB Schema for Metrics

sql

-- TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;

-- Citation metrics by article
CREATE TABLE citation_metrics (
  time TIMESTAMPTZ NOT NULL,
  work_id UUID NOT NULL,
  citation_count INTEGER,
  reference_count INTEGER,
  influential_count INTEGER,
  source TEXT
);

SELECT create_hypertable('citation_metrics', 'time');

-- Compression after 7 days
ALTER TABLE citation_metrics SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'work_id'
);

SELECT add_compression_policy('citation_metrics', INTERVAL '7 days');

-- Continuous aggregate for weekly trends
CREATE MATERIALIZED VIEW citation_weekly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 week', time) AS week,
  work_id,
  MAX(citation_count) AS max_citations,
  AVG(citation_count) AS avg_citations
FROM citation_metrics
GROUP BY week, work_id;

-- Metrics by field
CREATE TABLE domain_metrics (
  time TIMESTAMPTZ NOT NULL,
  domain TEXT NOT NULL,
  publication_count INTEGER,
  preprint_count INTEGER,
  avg_citations NUMERIC,
  top_venues JSONB
);

SELECT create_hypertable('domain_metrics', 'time');

-- Query: field evolution
SELECT
  time_bucket('1 month', time) AS month,
  domain,
  SUM(publication_count) AS total_publications,
  AVG(avg_citations) AS avg_citations
FROM domain_metrics
WHERE domain = 'machine learning'
  AND time > NOW() - INTERVAL '5 years'
GROUP BY month, domain
ORDER BY month;

Geospatial Databases

PostGIS vs H3 vs DuckDB Spatial

┌─────────────────────────────────────────────────────────────────────────────┐
│                          GEOSPATIAL OPTIONS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  POSTGIS (PostgreSQL Extension)                                              │
│  ──────────────────────────────                                              │
│  ✓ De facto standard                                                         │
│  ✓ All geometry types                                                       │
│  ✓ Very efficient GIST indexes                                              │
│  ✓ Complete spatial functions                                               │
│  → RECOMMENDED for Atlas (already on PostgreSQL)                            │
│                                                                              │
│  H3 (Uber Hexagonal Grid)                                                    │
│  ────────────────────────                                                    │
│  ✓ Hierarchical hexagonal indexing                                          │
│  ✓ Excellent for geo aggregations                                           │
│  ✓ PostgreSQL extension available                                           │
│  → Useful for heatmaps and geographic clustering                            │
│                                                                              │
│  ATLAS QUERIES                                                               │
│  ─────────────                                                               │
│  • Institutions within a radius                                             │
│  • Geographic collaborations                                                 │
│  • Publications heatmap by region                                           │
│  • Lab clustering by proximity                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

PostGIS Schema for Atlas

sql

CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS h3;  -- Optional for aggregations

-- Institutions with location
ALTER TABLE institutions
  ADD COLUMN location GEOMETRY(Point, 4326);

CREATE INDEX idx_institutions_geo ON institutions USING GIST (location);

-- Query: labs within 100km radius of Paris
SELECT
  i.name,
  i.country,
  ST_Distance(
    i.location::geography,
    ST_MakePoint(2.3522, 48.8566)::geography
  ) / 1000 AS distance_km,
  COUNT(DISTINCT wa.work_id) AS publications
FROM institutions i
JOIN work_authorships wa ON wa.affiliation_institution_id = i.id
WHERE ST_DWithin(
  i.location::geography,
  ST_MakePoint(2.3522, 48.8566)::geography,
  100000  -- 100km in meters
)
GROUP BY i.id
ORDER BY distance_km;

-- H3 aggregation for heatmap
SELECT
  h3_lat_lng_to_cell(
    ST_Y(i.location),
    ST_X(i.location),
    5  -- H3 resolution
  ) AS h3_cell,
  COUNT(*) AS institution_count,
  SUM(i.works_count) AS total_works
FROM institutions i
WHERE i.location IS NOT NULL
GROUP BY h3_cell;

Search Engines

Elasticsearch vs OpenSearch vs Meilisearch

┌─────────────────────────────────────────────────────────────────────────────┐
│                       SEARCH ENGINES COMPARED                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                   │ OpenSearch  │ Elasticsearch │ Meilisearch │ Typesense │ │
│  ─────────────────┼─────────────┼───────────────┼─────────────┼───────────┤ │
│  License          │ Apache 2.0  │ SSPL/Elastic  │ MIT         │ GPL-3     │ │
│  Full-text        │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐  │ │
│  Aggregations     │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐        │ ⭐⭐⭐     │ │
│  Vector search    │ ⭐⭐⭐⭐     │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐        │ ⭐⭐⭐     │ │
│  Multitenancy     │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐⭐      │ ⭐⭐⭐⭐   │ │
│  Kubernetes       │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐⭐      │ ⭐⭐⭐     │ │
│  Resources        │ ⭐⭐         │ ⭐⭐           │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐   │ │
│  Simplicity       │ ⭐⭐⭐       │ ⭐⭐⭐         │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐   │ │
│  Advanced scoring │ ⭐⭐⭐⭐⭐    │ ⭐⭐⭐⭐⭐      │ ⭐⭐⭐        │ ⭐⭐⭐     │ │
│  Built-in ML      │ ⭐⭐⭐⭐     │ ⭐⭐⭐⭐⭐      │ ❌           │ ❌        │ │
│  ─────────────────┴─────────────┴───────────────┴─────────────┴───────────┘ │
│                                                                              │
│  ATLAS RECOMMENDATION                                                        │
│  ────────────────────                                                        │
│                                                                              │
│  OpenSearch for advanced search                                              │
│  ────────────────────────────────                                            │
│  • Apache 2.0 fork of Elasticsearch                                         │
│  • Equivalent features                                                       │
│  • No license risk                                                           │
│  • Excellent AWS support (if cloud)                                          │
│  • Helm chart: opensearch-project/opensearch                                │
│  • Operator: opensearch-project/opensearch-k8s-operator                     │
│                                                                              │
│  Meilisearch as complement for instant UX                                   │
│  ────────────────────────────────────────────                                │
│  • Ultra-fast autocomplete                                                  │
│  • Typo-tolerance                                                           │
│  • Very low footprint                                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

OpenSearch Architecture for Atlas

yaml

# OpenSearch cluster manifest
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: atlas-search
  namespace: atlas-verify
spec:
  general:
    serviceName: atlas-search
    version: 2.11.0
    httpPort: 9200
    pluginsList:
      - analysis-icu
      - analysis-kuromoji  # Japanese
      - analysis-smartcn   # Chinese

  dashboards:
    enable: true
    replicas: 1

  nodePools:
    - component: masters
      replicas: 3
      roles:
        - master
      resources:
        requests:
          memory: "1Gi"
          cpu: "500m"
      persistence:
        emptyDir: {}

    - component: data
      replicas: 2
      roles:
        - data
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"
      persistence:
        pvc:
          storageClass: local-path
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 50Gi

OpenSearch Multi-entity Mapping

json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "title_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "synonym_filter"]
        },
        "author_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "name_synonym_filter"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "ML, machine learning",
            "DL, deep learning",
            "NLP, natural language processing",
            "AI, artificial intelligence"
          ]
        },
        "name_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "university, univ",
            "laboratory, lab"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "_entity_type": { "type": "keyword" },

      // ══════════════════════════════════════════════════════
      // WORK FIELDS
      // ══════════════════════════════════════════════════════
      "title": {
        "type": "text",
        "analyzer": "title_analyzer",
        "fields": {
          "exact": { "type": "keyword" },
          "suggest": {
            "type": "completion",
            "contexts": [
              { "name": "source", "type": "category" }
            ]
          }
        }
      },
      "abstract": {
        "type": "text",
        "analyzer": "english"
      },
      "year": { "type": "integer" },
      "doi": { "type": "keyword" },
      "work_type": { "type": "keyword" },
      "venue": {
        "type": "text",
        "fields": { "exact": { "type": "keyword" } }
      },
      "citation_count": { "type": "integer" },
      "keywords": { "type": "keyword" },
      "source": { "type": "keyword" },

      // Embeddings for vector search
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib"
        }
      },

      // ══════════════════════════════════════════════════════
      // AUTHOR FIELDS
      // ══════════════════════════════════════════════════════
      "display_name": {
        "type": "text",
        "analyzer": "author_analyzer",
        "fields": {
          "exact": { "type": "keyword" },
          "suggest": { "type": "completion" }
        }
      },
      "orcid": { "type": "keyword" },
      "h_index": { "type": "integer" },
      "works_count": { "type": "integer" },
      "affiliations": {
        "type": "nested",
        "properties": {
          "institution": { "type": "text" },
          "ror": { "type": "keyword" },
          "country": { "type": "keyword" }
        }
      },
      "topics": { "type": "keyword" },

      // ══════════════════════════════════════════════════════
      // INSTITUTION FIELDS
      // ══════════════════════════════════════════════════════
      "institution_name": {
        "type": "text",
        "analyzer": "author_analyzer",
        "fields": { "exact": { "type": "keyword" } }
      },
      "ror": { "type": "keyword" },
      "institution_type": { "type": "keyword" },
      "country": { "type": "keyword" },
      "location": { "type": "geo_point" },

      // ══════════════════════════════════════════════════════
      // COMMON
      // ══════════════════════════════════════════════════════
      "created_at": { "type": "date" },
      "updated_at": { "type": "date" }
    }
  }
}

Advanced OpenSearch Queries

typescript

// OpenSearch client
import { Client } from '@opensearch-project/opensearch';

const client = new Client({ node: 'http://atlas-search:9200' });

// ══════════════════════════════════════════════════════════════════════════
// 1. MULTI-ENTITY SEARCH
// ══════════════════════════════════════════════════════════════════════════

const multiEntitySearch = async (query: string) => {
  const response = await client.search({
    index: 'atlas',
    body: {
      query: {
        bool: {
          should: [
            // Search on works (title, abstract)
            {
              bool: {
                must: { multi_match: { query, fields: ['title^3', 'abstract', 'keywords^2'] } },
                filter: { term: { _entity_type: 'work' } }
              }
            },
            // Search on authors
            {
              bool: {
                must: { match: { display_name: { query, fuzziness: 'AUTO' } } },
                filter: { term: { _entity_type: 'author' } }
              }
            },
            // Search on institutions
            {
              bool: {
                must: { match: { institution_name: { query, fuzziness: 'AUTO' } } },
                filter: { term: { _entity_type: 'institution' } }
              }
            }
          ]
        }
      },
      // Aggregations by entity type
      aggs: {
        by_entity_type: {
          terms: { field: '_entity_type' },
          aggs: {
            top_hits: { top_hits: { size: 5 } }
          }
        }
      },
      highlight: {
        fields: {
          title: {},
          abstract: { fragment_size: 200 },
          display_name: {},
          institution_name: {}
        }
      }
    }
  });

  return response.body;
};

// ══════════════════════════════════════════════════════════════════════════
// 2. EXPERT ON A TOPIC IN AN AREA
// ══════════════════════════════════════════════════════════════════════════

const findExperts = async (topic: string, location: { lat: number; lon: number }, radiusKm: number) => {
  const response = await client.search({
    index: 'atlas',
    body: {
      query: {
        bool: {
          must: [
            { term: { _entity_type: 'author' } },
            { match: { topics: topic } }
          ],
          filter: [
            {
              nested: {
                path: 'affiliations',
                query: {
                  geo_distance: {
                    distance: `${radiusKm}km`,
                    'affiliations.location': location
                  }
                }
              }
            }
          ]
        }
      },
      sort: [
        { h_index: 'desc' },
        { works_count: 'desc' }
      ],
      size: 20
    }
  });

  return response.body.hits.hits;
};

// ══════════════════════════════════════════════════════════════════════════
// 3. SEMANTIC SEARCH (VECTOR + KEYWORDS)
// ══════════════════════════════════════════════════════════════════════════

const semanticSearch = async (queryEmbedding: number[], keywords: string[], filters: any) => {
  const response = await client.search({
    index: 'atlas',
    body: {
      query: {
        bool: {
          must: [
            { term: { _entity_type: 'work' } },
            // Vector search (k-NN)
            {
              knn: {
                embedding: {
                  vector: queryEmbedding,
                  k: 50
                }
              }
            }
          ],
          should: [
            // Boost if keywords match
            { terms: { keywords: keywords, boost: 2 } }
          ],
          filter: filters
        }
      },
      // Rescoring to combine BM25 and vector
      rescore: {
        window_size: 100,
        query: {
          rescore_query: {
            multi_match: {
              query: keywords.join(' '),
              fields: ['title^2', 'abstract']
            }
          },
          query_weight: 0.3,
          rescore_query_weight: 0.7
        }
      }
    }
  });

  return response.body.hits.hits;
};

// ══════════════════════════════════════════════════════════════════════════
// 4. MULTI-ENTITY AUTOCOMPLETE
// ══════════════════════════════════════════════════════════════════════════

const autocomplete = async (prefix: string) => {
  const response = await client.search({
    index: 'atlas',
    body: {
      suggest: {
        title_suggest: {
          prefix,
          completion: {
            field: 'title.suggest',
            size: 5,
            contexts: { source: ['openalex', 'crossref'] }
          }
        },
        author_suggest: {
          prefix,
          completion: {
            field: 'display_name.suggest',
            size: 5
          }
        }
      }
    }
  });

  return {
    works: response.body.suggest.title_suggest[0].options,
    authors: response.body.suggest.author_suggest[0].options
  };
};

// ══════════════════════════════════════════════════════════════════════════
// 5. FACETS AND AGGREGATIONS
// ══════════════════════════════════════════════════════════════════════════

const searchWithFacets = async (query: string) => {
  const response = await client.search({
    index: 'atlas',
    body: {
      query: {
        bool: {
          must: [
            { term: { _entity_type: 'work' } },
            { multi_match: { query, fields: ['title', 'abstract'] } }
          ]
        }
      },
      aggs: {
        // Facet by year
        by_year: {
          histogram: {
            field: 'year',
            interval: 1,
            min_doc_count: 1
          }
        },
        // Facet by type
        by_type: {
          terms: { field: 'work_type', size: 10 }
        },
        // Facet by source
        by_source: {
          terms: { field: 'source', size: 10 }
        },
        // Facet by venue (top 20)
        by_venue: {
          terms: { field: 'venue.exact', size: 20 }
        },
        // Citation stats
        citation_stats: {
          stats: { field: 'citation_count' }
        },
        // Citation distribution
        citation_ranges: {
          range: {
            field: 'citation_count',
            ranges: [
              { to: 10 },
              { from: 10, to: 50 },
              { from: 50, to: 100 },
              { from: 100, to: 500 },
              { from: 500 }
            ]
          }
        }
      },
      size: 20
    }
  });

  return {
    hits: response.body.hits,
    facets: response.body.aggregations
  };
};

Multi-database Federation Architecture

Problem Statement

How to execute a single query across multiple databases (PostgreSQL, OpenSearch, Qdrant, TimescaleDB)?

┌─────────────────────────────────────────────────────────────────────────────┐
│                           QUERY FEDERATION                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  USER QUERY                                                                  │
│  ──────────                                                                  │
│  "Experts in ethical machine learning, in Paris, with recent highly         │
│   cited publications, and articles similar to my draft"                     │
│                                                                              │
│  DECOMPOSITION                                                               │
│  ─────────────                                                               │
│  1. Full-text "machine learning ethics" → OpenSearch                        │
│  2. Geo "Paris 50km" → PostgreSQL/PostGIS                                   │
│  3. Time-series "recent publications" → TimescaleDB                         │
│  4. Vector "similar to draft" → Qdrant                                      │
│  5. Graph "co-author network" → PostgreSQL or ArangoDB                      │
│  6. Final aggregation → Application layer                                   │
│                                                                              │
│  APPROACHES                                                                  │
│  ──────────                                                                  │
│  A. Query Federation (application layer)                                    │
│  B. Data Virtualization (Trino/Presto)                                      │
│  C. Polyglot Persistence with Event Sourcing                                │
│  D. GraphQL Federation                                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Approach A: Query Federation (Application Layer)

typescript

// ═══════════════════════════════════════════════════════════════════════════
// FEDERATED QUERY SERVICE
// ═══════════════════════════════════════════════════════════════════════════

import { Effect, pipe } from 'effect';

interface FederatedQuery {
  text?: string;              // Full-text query
  embedding?: number[];       // Vector query
  geo?: {
    lat: number;
    lon: number;
    radiusKm: number;
  };
  timeRange?: {
    from: Date;
    to: Date;
  };
  filters?: Record<string, unknown>;
  entityTypes?: ('work' | 'author' | 'institution')[];
  limit?: number;
}

interface QueryPlan {
  steps: QueryStep[];
  mergeStrategy: 'union' | 'intersection' | 'weighted';
}

interface QueryStep {
  source: 'postgres' | 'opensearch' | 'qdrant' | 'timescale';
  query: unknown;
  weight: number;
  dependsOn?: string[];
}

class FederatedQueryService {
  constructor(
    private readonly postgres: PostgresClient,
    private readonly opensearch: OpenSearchClient,
    private readonly qdrant: QdrantClient,
    private readonly timescale: TimescaleClient
  ) {}

  /**
   * Executes a federated query across all relevant sources
   */
  execute = (query: FederatedQuery): Effect.Effect<FederatedResult, QueryError> => {
    return Effect.gen(this, function* () {
      // 1. Query planning
      const plan = this.planQuery(query);

      // 2. Parallel execution of independent queries
      const independentSteps = plan.steps.filter(s => !s.dependsOn?.length);
      const dependentSteps = plan.steps.filter(s => s.dependsOn?.length);

      // Parallel execution
      const independentResults = yield* Effect.all(
        independentSteps.map(step => this.executeStep(step)),
        { concurrency: 'unbounded' }
      );

      // 3. Sequential execution of dependent steps
      let allResults = new Map<string, StepResult>();
      independentResults.forEach((result, i) => {
        allResults.set(independentSteps[i].source, result);
      });

      for (const step of dependentSteps) {
        const result = yield* this.executeStep(step, allResults);
        allResults.set(step.source, result);
      }

      // 4. Results merging
      return this.mergeResults(allResults, plan.mergeStrategy, query);
    });
  };

  private planQuery(query: FederatedQuery): QueryPlan {
    const steps: QueryStep[] = [];

    // Full-text → OpenSearch
    if (query.text) {
      steps.push({
        source: 'opensearch',
        query: this.buildOpenSearchQuery(query),
        weight: 0.4
      });
    }

    // Vector similarity → Qdrant
    if (query.embedding) {
      steps.push({
        source: 'qdrant',
        query: this.buildQdrantQuery(query),
        weight: 0.3
      });
    }

    // Geo filter → PostgreSQL/PostGIS
    if (query.geo) {
      steps.push({
        source: 'postgres',
        query: this.buildPostGISQuery(query),
        weight: 0.2
      });
    }

    // Time range aggregations → TimescaleDB
    if (query.timeRange) {
      steps.push({
        source: 'timescale',
        query: this.buildTimescaleQuery(query),
        weight: 0.1
      });
    }

    return {
      steps,
      mergeStrategy: query.embedding ? 'weighted' : 'intersection'
    };
  }

  private buildOpenSearchQuery(query: FederatedQuery): OpenSearchQuery {
    return {
      bool: {
        must: [
          { multi_match: { query: query.text, fields: ['title^3', 'abstract', 'display_name^2'] } }
        ],
        filter: query.entityTypes
          ? [{ terms: { _entity_type: query.entityTypes } }]
          : []
      }
    };
  }

  private buildQdrantQuery(query: FederatedQuery): QdrantQuery {
    return {
      vector: query.embedding!,
      limit: query.limit || 100,
      filter: query.filters
    };
  }

  private buildPostGISQuery(query: FederatedQuery): string {
    return `
      SELECT id, ST_Distance(location::geography, ST_MakePoint($1, $2)::geography) as distance
      FROM institutions
      WHERE ST_DWithin(location::geography, ST_MakePoint($1, $2)::geography, $3)
      ORDER BY distance
    `;
  }

  private buildTimescaleQuery(query: FederatedQuery): string {
    return `
      SELECT work_id, MAX(citation_count) as max_citations
      FROM citation_metrics
      WHERE time BETWEEN $1 AND $2
      GROUP BY work_id
      HAVING MAX(citation_count) > 10
      ORDER BY max_citations DESC
    `;
  }

  private async executeStep(
    step: QueryStep,
    previousResults?: Map<string, StepResult>
  ): Promise<StepResult> {
    switch (step.source) {
      case 'opensearch':
        return this.opensearch.search(step.query);
      case 'qdrant':
        return this.qdrant.search(step.query);
      case 'postgres':
        return this.postgres.query(step.query);
      case 'timescale':
        return this.timescale.query(step.query);
    }
  }

  private mergeResults(
    results: Map<string, StepResult>,
    strategy: 'union' | 'intersection' | 'weighted',
    query: FederatedQuery
  ): FederatedResult {
    // Collect all IDs from each source
    const idScores = new Map<string, { score: number; sources: string[] }>();

    for (const [source, result] of results) {
      const weight = this.getSourceWeight(source, query);

      for (const hit of result.hits) {
        const existing = idScores.get(hit.id) || { score: 0, sources: [] };
        existing.score += hit.score * weight;
        existing.sources.push(source);
        idScores.set(hit.id, existing);
      }
    }

    // Apply merge strategy
    let finalIds: string[];

    switch (strategy) {
      case 'union':
        finalIds = Array.from(idScores.keys());
        break;

      case 'intersection':
        const sourceCount = results.size;
        finalIds = Array.from(idScores.entries())
          .filter(([_, v]) => v.sources.length === sourceCount)
          .map(([id]) => id);
        break;

      case 'weighted':
        finalIds = Array.from(idScores.entries())
          .sort((a, b) => b[1].score - a[1].score)
          .slice(0, query.limit || 50)
          .map(([id]) => id);
        break;
    }

    // Fetch full entities from primary store
    return {
      ids: finalIds,
      scores: Object.fromEntries(
        finalIds.map(id => [id, idScores.get(id)!.score])
      ),
      sources: Object.fromEntries(
        finalIds.map(id => [id, idScores.get(id)!.sources])
      )
    };
  }

  private getSourceWeight(source: string, query: FederatedQuery): number {
    // Dynamic weights based on query type
    if (query.embedding && source === 'qdrant') return 0.5;
    if (query.text && source === 'opensearch') return 0.4;
    if (query.geo && source === 'postgres') return 0.3;
    return 0.2;
  }
}

Approach B: Trino (Data Virtualization)

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TRINO (PRESTO FORK)                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PRINCIPLE                                                                   │
│  ─────────                                                                   │
│  Distributed SQL engine that federates multiple data sources                │
│  via connectors (catalog).                                                  │
│                                                                              │
│  ARCHITECTURE                                                                │
│  ────────────                                                                │
│                                                                              │
│  ┌─────────┐                                                                 │
│  │  Client │                                                                 │
│  │  (SQL)  │                                                                 │
│  └────┬────┘                                                                 │
│       │                                                                      │
│       ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │              TRINO COORDINATOR                           │                │
│  │  • Parse SQL                                             │                │
│  │  • Plan query                                            │                │
│  │  • Distribute to workers                                 │                │
│  └────────────────────────┬────────────────────────────────┘                │
│                           │                                                  │
│       ┌───────────────────┼───────────────────┐                             │
│       ▼                   ▼                   ▼                             │
│  ┌─────────┐         ┌─────────┐         ┌─────────┐                       │
│  │ Worker  │         │ Worker  │         │ Worker  │                       │
│  └────┬────┘         └────┬────┘         └────┬────┘                       │
│       │                   │                   │                             │
│       ▼                   ▼                   ▼                             │
│  ┌─────────┐         ┌─────────┐         ┌─────────┐                       │
│  │PostgreSQL│        │OpenSearch│        │  S3     │                       │
│  │ Catalog │         │ Catalog │         │ Catalog │                       │
│  └─────────┘         └─────────┘         └─────────┘                       │
│                                                                              │
│  AVAILABLE CONNECTORS                                                        │
│  ────────────────────                                                        │
│  ✓ PostgreSQL                                                                │
│  ✓ MongoDB                                                                   │
│  ✓ Elasticsearch/OpenSearch                                                 │
│  ✓ Redis                                                                     │
│  ✓ S3 (Parquet, ORC, Avro)                                                  │
│  ✓ ClickHouse                                                                │
│  ✗ Qdrant (no native connector)                                             │
│  ✗ InfluxDB/TimescaleDB (limited)                                           │
│                                                                              │
│  KUBERNETES                                                                  │
│  ──────────                                                                  │
│  Helm chart: trino/trino                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Trino Configuration for Atlas

yaml

# catalogs/postgres.properties
connector.name=postgresql
connection-url=jdbc:postgresql://atlas-verify-db:5432/atlas
connection-user=${PG_USER}
connection-password=${PG_PASSWORD}

# catalogs/opensearch.properties
connector.name=opensearch
opensearch.host=atlas-search
opensearch.port=9200

# catalogs/mongodb.properties (if used for raw records)
connector.name=mongodb
mongodb.connection-url=mongodb://mongodb:27017
mongodb.schema-collection=_schema

sql

-- Federated Trino query
SELECT
  p.display_name,
  p.h_index,
  os.title,
  os.citation_count
FROM postgres.public.author_profiles p
JOIN opensearch.atlas.works os ON os.author_id = p.id
WHERE p.institution_country = 'FR'
  AND os.year >= 2020
ORDER BY os.citation_count DESC
LIMIT 50;

Approach C: GraphQL Federation

typescript

// ═══════════════════════════════════════════════════════════════════════════
// GRAPHQL SCHEMA FEDERATION
// ═══════════════════════════════════════════════════════════════════════════

// Unified GraphQL schema
const typeDefs = `#graphql
  type Query {
    # Multi-entity search
    search(
      query: String!
      types: [EntityType!]
      filters: SearchFilters
      limit: Int = 20
    ): SearchResult!

    # Expert search
    findExperts(
      topic: String!
      location: GeoInput
      minHIndex: Int
    ): [Author!]!

    # Similar articles
    similarWorks(
      workId: ID!
      limit: Int = 10
    ): [Work!]!

    # Temporal evolution
    domainTrends(
      domain: String!
      fromYear: Int!
      toYear: Int!
    ): [YearlyStats!]!
  }

  type SearchResult {
    works: [Work!]!
    authors: [Author!]!
    institutions: [Institution!]!
    totalCount: Int!
    facets: Facets!
  }

  type Work {
    id: ID!
    title: String!
    abstract: String
    year: Int
    doi: String
    citationCount: Int

    # Relations (resolved from GraphDB or SQL)
    authors: [Author!]!
    venue: Venue
    references: [Work!]!
    citations: [Work!]!

    # Enriched data
    similarity(to: ID!): Float @resolver(source: "qdrant")
    citationTrend: [CitationPoint!]! @resolver(source: "timescale")
  }

  type Author {
    id: ID!
    displayName: String!
    orcid: String
    hIndex: Int

    # Relations
    affiliations: [Affiliation!]!
    works(limit: Int = 10): [Work!]!
    coauthors(depth: Int = 1): [Author!]! @resolver(source: "graph")

    # Enriched data
    expertiseTopics: [String!]! @resolver(source: "ml")
    location: GeoPoint @resolver(source: "postgis")
  }

  type Institution {
    id: ID!
    name: String!
    ror: String
    country: String
    location: GeoPoint

    # Relations
    authors: [Author!]!
    works: [Work!]!

    # Enriched data
    publicationTrend: [YearlyStats!]! @resolver(source: "timescale")
  }

  input GeoInput {
    lat: Float!
    lon: Float!
    radiusKm: Float!
  }

  input SearchFilters {
    yearFrom: Int
    yearTo: Int
    sources: [String!]
    types: [String!]
    openAccess: Boolean
    minCitations: Int
  }

  enum EntityType {
    WORK
    AUTHOR
    INSTITUTION
  }
`;

// Resolvers with federation
const resolvers = {
  Query: {
    search: async (_, args, context) => {
      // Parallel execution on OpenSearch
      const [works, authors, institutions] = await Promise.all([
        context.dataSources.opensearch.searchWorks(args),
        context.dataSources.opensearch.searchAuthors(args),
        context.dataSources.opensearch.searchInstitutions(args)
      ]);

      return {
        works,
        authors,
        institutions,
        totalCount: works.length + authors.length + institutions.length,
        facets: await context.dataSources.opensearch.getFacets(args)
      };
    },

    findExperts: async (_, { topic, location, minHIndex }, context) => {
      // 1. Full-text search for the topic
      const candidates = await context.dataSources.opensearch.searchAuthors({
        query: topic,
        minHIndex
      });

      if (!location) return candidates;

      // 2. Geographic filter
      const authorIds = candidates.map(a => a.id);
      const geoFiltered = await context.dataSources.postgres.filterByLocation(
        authorIds,
        location
      );

      return candidates.filter(a => geoFiltered.includes(a.id));
    },

    similarWorks: async (_, { workId, limit }, context) => {
      // 1. Get work embedding
      const embedding = await context.dataSources.postgres.getWorkEmbedding(workId);

      // 2. ANN search in Qdrant
      const similarIds = await context.dataSources.qdrant.searchSimilar(
        embedding,
        limit,
        { excludeIds: [workId] }
      );

      // 3. Get details from PostgreSQL
      return context.dataSources.postgres.getWorksByIds(similarIds);
    }
  },

  Work: {
    authors: async (work, _, context) => {
      return context.dataSources.postgres.getWorkAuthors(work.id);
    },

    citationTrend: async (work, _, context) => {
      return context.dataSources.timescale.getCitationTrend(work.id);
    },

    similarity: async (work, { to }, context) => {
      const [emb1, emb2] = await Promise.all([
        context.dataSources.postgres.getWorkEmbedding(work.id),
        context.dataSources.postgres.getWorkEmbedding(to)
      ]);
      return cosineSimilarity(emb1, emb2);
    }
  },

  Author: {
    coauthors: async (author, { depth }, context) => {
      // Graph traversal
      return context.dataSources.graph.getCoauthorNetwork(author.id, depth);
    },

    location: async (author, _, context) => {
      const affiliation = await context.dataSources.postgres.getCurrentAffiliation(author.id);
      if (!affiliation?.institutionId) return null;

      return context.dataSources.postgres.getInstitutionLocation(affiliation.institutionId);
    }
  }
};

Recommended Multi-database Architecture

Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                      ATLAS MULTI-DATABASE ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                              ┌─────────────────┐                            │
│                              │   API Gateway   │                            │
│                              │   (GraphQL)     │                            │
│                              └────────┬────────┘                            │
│                                       │                                      │
│                              ┌────────┴────────┐                            │
│                              │  Query Planner  │                            │
│                              │  & Federation   │                            │
│                              └────────┬────────┘                            │
│                                       │                                      │
│          ┌────────────────────────────┼────────────────────────────┐        │
│          │                            │                            │        │
│          ▼                            ▼                            ▼        │
│  ┌───────────────┐          ┌───────────────┐          ┌───────────────┐   │
│  │  PostgreSQL   │          │  OpenSearch   │          │    Qdrant     │   │
│  │  + PostGIS    │          │               │          │   (vectors)   │   │
│  │  + TimescaleDB│          │               │          │               │   │
│  │  + pgvector   │          │               │          │               │   │
│  ├───────────────┤          ├───────────────┤          ├───────────────┤   │
│  │ • Entities    │          │ • Full-text   │          │ • Semantic    │   │
│  │ • Relations   │          │ • Facets      │          │   search      │   │
│  │ • Geo         │          │ • Autocomplete│          │ • Similarity  │   │
│  │ • Time-series │          │ • Aggregations│          │ • Clustering  │   │
│  │ • Audit trail │          │               │          │               │   │
│  └───────────────┘          └───────────────┘          └───────────────┘   │
│          │                            │                            │        │
│          └────────────────────────────┼────────────────────────────┘        │
│                                       │                                      │
│                              ┌────────┴────────┐                            │
│                              │     Redis       │                            │
│                              │  (cache, queue) │                            │
│                              └─────────────────┘                            │
│                                                                              │
│  DATA FLOW                                                                   │
│  ─────────                                                                   │
│  1. Ingestion → PostgreSQL (source of truth)                                │
│  2. CDC/Sync → OpenSearch (full-text index)                                 │
│  3. CDC/Sync → Qdrant (embeddings index)                                    │
│  4. Queries → Federation layer → Multiple backends                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Data Synchronization (CDC)

typescript

// ═══════════════════════════════════════════════════════════════════════════
// CHANGE DATA CAPTURE with Debezium
// ═══════════════════════════════════════════════════════════════════════════

// docker-compose.yml (or Kubernetes)
/*
version: '3'
services:
  debezium:
    image: debezium/connect:2.4
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: atlas-cdc
      CONFIG_STORAGE_TOPIC: connect-configs
      OFFSET_STORAGE_TOPIC: connect-offsets
      STATUS_STORAGE_TOPIC: connect-status

  # Or use Debezium Server (without Kafka)
  debezium-server:
    image: debezium/server:2.4
    environment:
      DEBEZIUM_SOURCE_CONNECTOR_CLASS: io.debezium.connector.postgresql.PostgresConnector
      DEBEZIUM_SOURCE_DATABASE_HOSTNAME: postgres
      DEBEZIUM_SOURCE_DATABASE_PORT: 5432
      DEBEZIUM_SOURCE_DATABASE_USER: debezium
      DEBEZIUM_SOURCE_DATABASE_PASSWORD: secret
      DEBEZIUM_SOURCE_DATABASE_DBNAME: atlas
      DEBEZIUM_SINK_TYPE: http
      DEBEZIUM_SINK_HTTP_URL: http://sync-service:3000/cdc
*/

// Synchronization service
import { Effect, Queue, Stream } from 'effect';

interface CDCEvent {
  op: 'c' | 'u' | 'd' | 'r';  // create, update, delete, read (snapshot)
  source: {
    table: string;
    schema: string;
  };
  before: Record<string, unknown> | null;
  after: Record<string, unknown> | null;
  ts_ms: number;
}

class SyncService {
  private readonly opensearch: OpenSearchClient;
  private readonly qdrant: QdrantClient;
  private readonly embedder: EmbeddingService;

  async handleCDCEvent(event: CDCEvent): Promise<void> {
    const { op, source, after, before } = event;

    switch (source.table) {
      case 'canonical_works':
        await this.syncWork(op, after, before);
        break;
      case 'author_profiles':
        await this.syncAuthor(op, after, before);
        break;
      case 'institutions':
        await this.syncInstitution(op, after, before);
        break;
    }
  }

  private async syncWork(
    op: string,
    after: Record<string, unknown> | null,
    before: Record<string, unknown> | null
  ): Promise<void> {
    if (op === 'd') {
      // Delete from OpenSearch and Qdrant
      await Promise.all([
        this.opensearch.delete('atlas', before!.id as string),
        this.qdrant.delete('work_embeddings', [before!.id as string])
      ]);
      return;
    }

    if (!after) return;

    // Generate embedding if text changed
    const needsEmbedding = op === 'c' ||
      before?.title !== after.title ||
      before?.abstract !== after.abstract;

    let embedding: number[] | undefined;
    if (needsEmbedding && (after.title || after.abstract)) {
      embedding = await this.embedder.embed(
        `${after.title} ${after.abstract}`
      );
    }

    // Sync to OpenSearch
    await this.opensearch.index('atlas', {
      id: after.id,
      body: {
        _entity_type: 'work',
        title: after.title,
        abstract: after.abstract,
        year: after.year,
        doi: after.doi,
        citation_count: after.citation_count,
        // ... other fields
      }
    });

    // Sync embedding to Qdrant
    if (embedding) {
      await this.qdrant.upsert('work_embeddings', {
        points: [{
          id: after.id as string,
          vector: { specter: embedding },
          payload: {
            year: after.year,
            source: after.source
          }
        }]
      });
    }
  }

  // Similar methods for authors and institutions...
}

Complete Kubernetes Manifests

yaml

# ═══════════════════════════════════════════════════════════════════════════
# NAMESPACE
# ═══════════════════════════════════════════════════════════════════════════
apiVersion: v1
kind: Namespace
metadata:
  name: atlas-data
  labels:
    app.kubernetes.io/name: atlas

---
# ═══════════════════════════════════════════════════════════════════════════
# POSTGRESQL (CloudNativePG with extensions)
# ═══════════════════════════════════════════════════════════════════════════
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: atlas-postgres
  namespace: atlas-data
spec:
  instances: 3

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "512MB"
      effective_cache_size: "2GB"
      maintenance_work_mem: "256MB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      work_mem: "32MB"
      min_wal_size: "1GB"
      max_wal_size: "4GB"

    # Enabled extensions
    shared_preload_libraries:
      - timescaledb
      - pg_stat_statements

  bootstrap:
    initdb:
      postInitSQL:
        - CREATE EXTENSION IF NOT EXISTS postgis
        - CREATE EXTENSION IF NOT EXISTS postgis_topology
        - CREATE EXTENSION IF NOT EXISTS timescaledb
        - CREATE EXTENSION IF NOT EXISTS vector
        - CREATE EXTENSION IF NOT EXISTS pg_trgm
        - CREATE EXTENSION IF NOT EXISTS btree_gin

  storage:
    size: 100Gi
    storageClass: ceph-block

  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "4"

  backup:
    barmanObjectStore:
      destinationPath: "s3://atlas-backups/postgres"
      endpointURL: "http://minio.minio:9000"
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
    retentionPolicy: "30d"

  monitoring:
    enablePodMonitor: true

---
# ═══════════════════════════════════════════════════════════════════════════
# OPENSEARCH (via Operator)
# ═══════════════════════════════════════════════════════════════════════════
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: atlas-opensearch
  namespace: atlas-data
spec:
  general:
    serviceName: atlas-opensearch
    version: 2.11.0
    httpPort: 9200
    pluginsList:
      - analysis-icu
      - analysis-kuromoji
      - analysis-smartcn
    drainDataNodes: true

  dashboards:
    enable: true
    replicas: 1
    resources:
      requests:
        memory: "512Mi"
        cpu: "200m"

  security:
    config:
      securityConfigSecret:
        name: opensearch-security-config
      adminCredentialsSecret:
        name: opensearch-admin-credentials

  nodePools:
    - component: masters
      replicas: 3
      roles:
        - master
      resources:
        requests:
          memory: "1Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "1"
      persistence:
        pvc:
          storageClass: ceph-block
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi

    - component: data
      replicas: 3
      roles:
        - data
        - ingest
      resources:
        requests:
          memory: "4Gi"
          cpu: "1"
        limits:
          memory: "8Gi"
          cpu: "4"
      persistence:
        pvc:
          storageClass: ceph-block
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 100Gi
      jvm: "-Xmx4g -Xms4g"

---
# ═══════════════════════════════════════════════════════════════════════════
# QDRANT
# ═══════════════════════════════════════════════════════════════════════════
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
  namespace: atlas-data
spec:
  serviceName: qdrant
  replicas: 1  # Single node to start, scale later
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
        - name: qdrant
          image: qdrant/qdrant:v1.7.4
          ports:
            - containerPort: 6333
              name: http
            - containerPort: 6334
              name: grpc
          env:
            - name: QDRANT__SERVICE__GRPC_PORT
              value: "6334"
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "4Gi"
              cpu: "2"
          volumeMounts:
            - name: storage
              mountPath: /qdrant/storage
          livenessProbe:
            httpGet:
              path: /
              port: 6333
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /readyz
              port: 6333
            initialDelaySeconds: 5
            periodSeconds: 5
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: ceph-block
        resources:
          requests:
            storage: 50Gi

---
apiVersion: v1
kind: Service
metadata:
  name: qdrant
  namespace: atlas-data
spec:
  selector:
    app: qdrant
  ports:
    - port: 6333
      targetPort: 6333
      name: http
    - port: 6334
      targetPort: 6334
      name: grpc

---
# ═══════════════════════════════════════════════════════════════════════════
# REDIS
# ═══════════════════════════════════════════════════════════════════════════
# Via Helm: helm install redis bitnami/redis -n atlas-data -f redis-values.yaml

# redis-values.yaml
# architecture: replication
# auth:
#   enabled: true
#   existingSecret: redis-password
# master:
#   persistence:
#     size: 8Gi
# replica:
#   replicaCount: 2
#   persistence:
#     size: 8Gi
# metrics:
#   enabled: true
#   serviceMonitor:
#     enabled: true

---
# ═══════════════════════════════════════════════════════════════════════════
# SYNC SERVICE (CDC Consumer)
# ═══════════════════════════════════════════════════════════════════════════
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sync-service
  namespace: atlas-data
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sync-service
  template:
    metadata:
      labels:
        app: sync-service
    spec:
      containers:
        - name: sync
          image: ghcr.io/univ-lehavre/atlas-sync:latest
          env:
            - name: POSTGRES_URL
              valueFrom:
                secretKeyRef:
                  name: atlas-secrets
                  key: postgres-url
            - name: OPENSEARCH_URL
              value: "http://atlas-opensearch:9200"
            - name: QDRANT_URL
              value: "http://qdrant:6333"
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: atlas-secrets
                  key: redis-url
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"

User Interface Patterns

Faceted Multi-entity Search

typescript

// ═══════════════════════════════════════════════════════════════════════════
// SVELTE COMPONENT - MULTI-ENTITY SEARCH
// ═══════════════════════════════════════════════════════════════════════════

// src/lib/components/Search.svelte
<script lang="ts">
  import { createQuery } from '@tanstack/svelte-query';

  interface SearchState {
    query: string;
    entityTypes: ('work' | 'author' | 'institution')[];
    filters: {
      yearFrom?: number;
      yearTo?: number;
      sources?: string[];
      openAccess?: boolean;
      location?: { lat: number; lon: number; radiusKm: number };
    };
    sort: string;
    page: number;
  }

  let state: SearchState = $state({
    query: '',
    entityTypes: ['work', 'author', 'institution'],
    filters: {},
    sort: 'relevance',
    page: 1
  });

  const searchQuery = createQuery({
    queryKey: ['search', state],
    queryFn: () => fetch('/api/search', {
      method: 'POST',
      body: JSON.stringify(state)
    }).then(r => r.json()),
    enabled: state.query.length > 2
  });

  // Autocomplete
  let suggestions = $state<Suggestion[]>([]);

  const fetchSuggestions = async (prefix: string) => {
    if (prefix.length < 2) {
      suggestions = [];
      return;
    }
    const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(prefix)}`);
    suggestions = await res.json();
  };

  // Debounce
  let debounceTimer: number;
  const onInput = (e: Event) => {
    const value = (e.target as HTMLInputElement).value;
    state.query = value;

    clearTimeout(debounceTimer);
    debounceTimer = setTimeout(() => fetchSuggestions(value), 150);
  };
</script>

<div class="search-container">
  <!-- Search bar -->
  <div class="search-bar">
    <input
      type="text"
      value={state.query}
      oninput={onInput}
      placeholder="Search articles, authors, institutions..."
    />

    <!-- Autocomplete suggestions -->
    {#if suggestions.length > 0}
      <div class="suggestions">
        {#each suggestions as suggestion}
          <button onclick={() => { state.query = suggestion.text; suggestions = []; }}>
            <span class="type-badge">{suggestion.type}</span>
            <span class="text">{suggestion.text}</span>
          </button>
        {/each}
      </div>
    {/if}
  </div>

  <!-- Entity filters -->
  <div class="entity-filters">
    {#each ['work', 'author', 'institution'] as type}
      <label>
        <input
          type="checkbox"
          checked={state.entityTypes.includes(type)}
          onchange={(e) => {
            if (e.target.checked) {
              state.entityTypes = [...state.entityTypes, type];
            } else {
              state.entityTypes = state.entityTypes.filter(t => t !== type);
            }
          }}
        />
        {type === 'work' ? 'Publications' : type === 'author' ? 'Authors' : 'Institutions'}
      </label>
    {/each}
  </div>

  <!-- Results with facets -->
  <div class="results-layout">
    <!-- Facets (left) -->
    <aside class="facets">
      {#if $searchQuery.data?.facets}
        <FacetGroup
          title="Year"
          facet={$searchQuery.data.facets.by_year}
          onSelect={(year) => { state.filters.yearFrom = year; state.filters.yearTo = year; }}
        />
        <FacetGroup
          title="Source"
          facet={$searchQuery.data.facets.by_source}
          onSelect={(source) => { state.filters.sources = [source]; }}
        />
        <FacetGroup
          title="Type"
          facet={$searchQuery.data.facets.by_type}
          onSelect={(type) => { /* ... */ }}
        />
        <FacetGroup
          title="Citations"
          facet={$searchQuery.data.facets.citation_ranges}
          type="range"
        />
      {/if}
    </aside>

    <!-- Results (center) -->
    <main class="results">
      {#if $searchQuery.isLoading}
        <LoadingSpinner />
      {:else if $searchQuery.data}
        <!-- Tabs by entity type -->
        <Tabs>
          <Tab label="Publications ({$searchQuery.data.works.length})">
            {#each $searchQuery.data.works as work}
              <WorkCard {work} highlights={work._highlights} />
            {/each}
          </Tab>
          <Tab label="Authors ({$searchQuery.data.authors.length})">
            {#each $searchQuery.data.authors as author}
              <AuthorCard {author} />
            {/each}
          </Tab>
          <Tab label="Institutions ({$searchQuery.data.institutions.length})">
            {#each $searchQuery.data.institutions as institution}
              <InstitutionCard {institution} />
            {/each}
          </Tab>
        </Tabs>
      {/if}
    </main>
  </div>
</div>

Expert Map

typescript

// src/lib/components/ExpertMap.svelte
<script lang="ts">
  import { onMount } from 'svelte';
  import maplibregl from 'maplibre-gl';

  interface Props {
    topic: string;
    minHIndex?: number;
  }

  let { topic, minHIndex = 10 }: Props = $props();

  let map: maplibregl.Map;
  let mapContainer: HTMLDivElement;

  const expertsQuery = createQuery({
    queryKey: ['experts', topic, minHIndex],
    queryFn: () => fetch(`/api/experts?topic=${topic}&minHIndex=${minHIndex}`).then(r => r.json())
  });

  onMount(() => {
    map = new maplibregl.Map({
      container: mapContainer,
      style: 'https://basemaps.cartocdn.com/gl/positron-gl-style/style.json',
      center: [2.3522, 48.8566],
      zoom: 4
    });

    return () => map.remove();
  });

  // Update markers when data changes
  $effect(() => {
    if (!map || !$expertsQuery.data) return;

    // Clear existing markers
    document.querySelectorAll('.expert-marker').forEach(el => el.remove());

    // Add new markers
    for (const expert of $expertsQuery.data) {
      if (!expert.location) continue;

      const el = document.createElement('div');
      el.className = 'expert-marker';
      el.style.width = `${Math.min(20 + expert.hIndex, 50)}px`;
      el.style.height = `${Math.min(20 + expert.hIndex, 50)}px`;

      new maplibregl.Marker(el)
        .setLngLat([expert.location.lon, expert.location.lat])
        .setPopup(new maplibregl.Popup().setHTML(`
          <div class="expert-popup">
            <h3>${expert.displayName}</h3>
            <p>${expert.institution}</p>
            <p>h-index: ${expert.hIndex}</p>
            <p>${expert.worksOnTopic} publications on "${topic}"</p>
          </div>
        `))
        .addTo(map);
    }
  });
</script>

<div class="expert-map-container">
  <div class="controls">
    <input
      type="text"
      bind:value={topic}
      placeholder="Research topic..."
    />
    <input
      type="number"
      bind:value={minHIndex}
      min="0"
      max="100"
    />
  </div>

  <div bind:this={mapContainer} class="map"></div>

  <div class="legend">
    <p>Marker size proportional to h-index</p>
  </div>
</div>

<style>
  .map {
    width: 100%;
    height: 600px;
  }

  :global(.expert-marker) {
    background: rgba(59, 130, 246, 0.7);
    border-radius: 50%;
    cursor: pointer;
    transition: transform 0.2s;
  }

  :global(.expert-marker:hover) {
    transform: scale(1.2);
  }
</style>

Co-author Network Visualization

typescript

// src/lib/components/CoauthorNetwork.svelte
<script lang="ts">
  import { onMount } from 'svelte';
  import ForceGraph3D from '3d-force-graph';

  interface Props {
    authorId: string;
    depth?: number;
  }

  let { authorId, depth = 2 }: Props = $props();

  let container: HTMLDivElement;
  let graph: any;

  const networkQuery = createQuery({
    queryKey: ['coauthor-network', authorId, depth],
    queryFn: () => fetch(`/api/authors/${authorId}/network?depth=${depth}`).then(r => r.json())
  });

  onMount(() => {
    graph = ForceGraph3D()(container)
      .nodeLabel('name')
      .nodeColor(node => node.id === authorId ? '#ef4444' : '#3b82f6')
      .nodeVal(node => Math.sqrt(node.hIndex || 1))
      .linkWidth(link => Math.sqrt(link.coauthorships))
      .linkColor(() => 'rgba(100, 100, 100, 0.5)')
      .onNodeClick(node => {
        // Navigate to author profile
        window.location.href = `/authors/${node.id}`;
      });

    return () => graph._destructor();
  });

  $effect(() => {
    if (!graph || !$networkQuery.data) return;

    graph.graphData({
      nodes: $networkQuery.data.nodes.map(n => ({
        id: n.id,
        name: n.displayName,
        hIndex: n.hIndex,
        institution: n.institution
      })),
      links: $networkQuery.data.edges.map(e => ({
        source: e.from,
        target: e.to,
        coauthorships: e.coauthorships
      }))
    });
  });
</script>

<div class="network-container">
  <div class="controls">
    <label>
      Depth:
      <input type="range" min="1" max="3" bind:value={depth} />
      {depth}
    </label>
  </div>

  <div bind:this={container} class="graph"></div>

  <div class="stats">
    {#if $networkQuery.data}
      <p>{$networkQuery.data.nodes.length} authors</p>
      <p>{$networkQuery.data.edges.length} collaborations</p>
    {/if}
  </div>
</div>

<style>
  .graph {
    width: 100%;
    height: 700px;
    background: #1a1a2e;
  }
</style>

Temporal Trends

typescript

// src/lib/components/DomainTrends.svelte
<script lang="ts">
  import { Chart } from 'chart.js/auto';

  interface Props {
    domain: string;
    fromYear?: number;
    toYear?: number;
  }

  let { domain, fromYear = 2015, toYear = 2024 }: Props = $props();

  let canvas: HTMLCanvasElement;
  let chart: Chart;

  const trendsQuery = createQuery({
    queryKey: ['trends', domain, fromYear, toYear],
    queryFn: () => fetch(
      `/api/trends?domain=${domain}&from=${fromYear}&to=${toYear}`
    ).then(r => r.json())
  });

  $effect(() => {
    if (!canvas || !$trendsQuery.data) return;

    if (chart) chart.destroy();

    chart = new Chart(canvas, {
      type: 'line',
      data: {
        labels: $trendsQuery.data.map(d => d.year),
        datasets: [
          {
            label: 'Publications',
            data: $trendsQuery.data.map(d => d.publicationCount),
            borderColor: '#3b82f6',
            yAxisID: 'y'
          },
          {
            label: 'Average citations',
            data: $trendsQuery.data.map(d => d.avgCitations),
            borderColor: '#10b981',
            yAxisID: 'y1'
          }
        ]
      },
      options: {
        responsive: true,
        scales: {
          y: {
            type: 'linear',
            position: 'left',
            title: { display: true, text: 'Publications' }
          },
          y1: {
            type: 'linear',
            position: 'right',
            title: { display: true, text: 'Average citations' },
            grid: { drawOnChartArea: false }
          }
        }
      }
    });

    return () => chart?.destroy();
  });
</script>

<div class="trends-container">
  <div class="controls">
    <input type="text" bind:value={domain} placeholder="Domain..." />
    <input type="number" bind:value={fromYear} min="1990" max="2024" />
    <input type="number" bind:value={toYear} min="1990" max="2024" />
  </div>

  <canvas bind:this={canvas}></canvas>

  {#if $trendsQuery.data}
    <div class="highlights">
      <div class="stat">
        <span class="value">{$trendsQuery.data.reduce((a, b) => a + b.publicationCount, 0)}</span>
        <span class="label">Total publications</span>
      </div>
      <div class="stat">
        <span class="value">
          {(($trendsQuery.data[$trendsQuery.data.length - 1]?.publicationCount -
            $trendsQuery.data[0]?.publicationCount) /
            $trendsQuery.data[0]?.publicationCount * 100).toFixed(1)}%
        </span>
        <span class="label">Growth</span>
      </div>
    </div>
  {/if}
</div>

Summary and Recommendations

Final Recommended Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ATLAS ARCHITECTURE - SUMMARY                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TIER 1 - ESSENTIAL (v1)                                                    │
│  ───────────────────────                                                     │
│  • PostgreSQL 16 + PostGIS + TimescaleDB + pgvector                         │
│    → Primary database, source of truth                                      │
│  • OpenSearch                                                                │
│    → Full-text, facets, autocomplete                                        │
│  • Redis                                                                     │
│    → Cache, sessions, jobs                                                   │
│                                                                              │
│  TIER 2 - ENRICHMENT (v2)                                                   │
│  ────────────────────────                                                    │
│  • Qdrant                                                                    │
│    → Semantic search, similarity (if pgvector insufficient)                 │
│  • Meilisearch                                                               │
│    → Ultra-fast autocomplete (if OpenSearch too slow)                       │
│                                                                              │
│  TIER 3 - ADVANCED (v3)                                                     │
│  ─────────────────────                                                       │
│  • ArangoDB or Neo4j                                                         │
│    → Co-author graph, citations (if complex traversals needed)              │
│  • ClickHouse                                                                │
│    → Heavy analytics, reporting (if TimescaleDB insufficient)               │
│                                                                              │
│  KEY DECISIONS                                                               │
│  ─────────────                                                               │
│                                                                              │
│  PostgreSQL as pivot:                                                        │
│  • Single database to operate initially                                     │
│  • Extensions cover 80% of needs                                            │
│  • pgvector for vectors (sufficient < 10M)                                  │
│  • PostGIS for geo                                                           │
│  • TimescaleDB for time-series                                              │
│                                                                              │
│  OpenSearch for search:                                                      │
│  • Advanced full-text, multilingual                                         │
│  • Facets and aggregations                                                   │
│  • Built-in vector search (k-NN)                                            │
│  • Open source (Apache 2.0)                                                  │
│                                                                              │
│  Federation at application level:                                           │
│  • GraphQL as unified interface                                             │
│  • Intelligent query planner                                                │
│  • CDC for synchronization                                                   │
│  • No Trino (too complex for v1)                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Resource Estimation

Component	v1 (basic)	v2 (production)	v3 (scale)
PostgreSQL	3x2Gi, 100Gi	3x4Gi, 200Gi	5x8Gi, 500Gi
OpenSearch	2x4Gi, 50Gi	3x8Gi, 200Gi	6x16Gi, 1Ti
Qdrant	-	1x2Gi, 50Gi	3x4Gi, 200Gi
Redis	1x512Mi	3x1Gi	3x2Gi
API/Workers	4x512Mi	8x1Gi	16x2Gi
Total	~20Gi RAM	~50Gi RAM	~150Gi RAM

Implementation Roadmap

Phase	Duration	Components	Features
v1.0	2 months	PG+OS+Redis	Basic search, verification
v1.5	1 month	+ pgvector	Semantic similarity
v2.0	2 months	+ Qdrant, Meili	Advanced search, autocomplete
v2.5	1 month	+ TimescaleDB	Trends, metrics
v3.0	3 months	+ ArangoDB	Co-author graph, traversals

Advanced Databases and Multi-source Search ​

Researcher Use Cases ​

Identified Needs ​

Translated Query Types ​

Multi-model Databases ​

ArangoDB ​

Overview ​

Model for Atlas ​

Advanced AQL Queries ​

ArangoDB Evaluation ​

SurrealDB ​

Overview ​

Specialized Databases ​

Vector Databases (Embeddings / Semantic Search) ​

Comparison ​

Qdrant for Atlas ​

pgvector (integrated alternative) ​

Time-series Databases ​

InfluxDB vs TimescaleDB vs QuestDB ​

TimescaleDB Schema for Metrics ​

Geospatial Databases ​

PostGIS vs H3 vs DuckDB Spatial ​

PostGIS Schema for Atlas ​

Search Engines ​

Elasticsearch vs OpenSearch vs Meilisearch ​

OpenSearch Architecture for Atlas ​

OpenSearch Multi-entity Mapping ​

Advanced OpenSearch Queries ​

Multi-database Federation Architecture ​

Problem Statement ​

Approach A: Query Federation (Application Layer) ​

Approach B: Trino (Data Virtualization) ​

Trino Configuration for Atlas ​

Approach C: GraphQL Federation ​

Recommended Multi-database Architecture ​

Overview ​

Data Synchronization (CDC) ​

Complete Kubernetes Manifests ​

User Interface Patterns ​

Faceted Multi-entity Search ​

Expert Map ​

Co-author Network Visualization ​

Temporal Trends ​

Summary and Recommendations ​

Final Recommended Architecture ​

Resource Estimation ​

Implementation Roadmap ​

Advanced Databases and Multi-source Search

Researcher Use Cases

Identified Needs

Translated Query Types

Multi-model Databases

ArangoDB

Overview

Model for Atlas

Advanced AQL Queries

ArangoDB Evaluation

SurrealDB

Overview

Specialized Databases

Vector Databases (Embeddings / Semantic Search)

Comparison

Qdrant for Atlas

pgvector (integrated alternative)

Time-series Databases

InfluxDB vs TimescaleDB vs QuestDB

TimescaleDB Schema for Metrics

Geospatial Databases

PostGIS vs H3 vs DuckDB Spatial

PostGIS Schema for Atlas

Search Engines

Elasticsearch vs OpenSearch vs Meilisearch

OpenSearch Architecture for Atlas

OpenSearch Multi-entity Mapping

Advanced OpenSearch Queries

Multi-database Federation Architecture

Problem Statement

Approach A: Query Federation (Application Layer)

Approach B: Trino (Data Virtualization)

Trino Configuration for Atlas

Approach C: GraphQL Federation

Recommended Multi-database Architecture

Overview

Data Synchronization (CDC)

Complete Kubernetes Manifests

User Interface Patterns

Faceted Multi-entity Search

Expert Map

Co-author Network Visualization

Temporal Trends

Summary and Recommendations

Final Recommended Architecture

Resource Estimation

Implementation Roadmap