Palantir: Zero to Monopoly

Overview: Google Maps for Enterprise Data

SDDI is Foundry's answer to the enterprise data integration problem. Traditional integration projects fail because they treat each data source as a unique puzzle requiring custom engineering. SDDI treats integration as a pattern recognition problem — and gets better at it with every deployment.

The five subsystems cooperate in sequence: Bundle Finder identifies what tables belong together, Key Finder figures out how to join them, Path Chooser selects the optimal integration chain, Column Namer translates cryptic field names into business language, and Feedback Logger ensures every decision improves the system for future deployments.

Bundle Finder: Workflow Pattern Recognition

What it does: Recognizes complete workflow patterns from a single table and suggests the full set of related tables needed.

Implementation:

Input: Single table metadata (name, module, column descriptions, sample data)
Process: Semantic analysis of table name and module context, pattern matching against workflow corpus (learned from thousands of deployments), confidence scoring based on historical co-occurrence patterns
Output: Suggested table bundle with confidence scores and rationale

MidWest Example:

Input: QALS table (SAP QM module)
Analysis: Table name contains "QualitySicherung" (Quality Assurance), module is QM
Pattern match: "QualityManagement" workflow (94% confidence)
Suggested bundle: [QALS, QAVE, QMUR, MARA, LFA1]
Rationale: "This looks like quality traceability. You'll need usage decisions 
(QAVE), defect causes (QMUR), material master (MARA), and supplier data (LFA1)."

Key Finder: Join Detection and Validation

What it does: Figures out which columns (or combinations of columns) should be used to join tables together — both within a single system and across different systems.

Three-phase implementation:

Phase 1 — Dictionary Lookup: Maintains a learned dictionary of known join patterns (e.g., SAP standard joins). Checks column name similarity and data type compatibility. Proposes candidate joins with initial confidence scores.
Phase 2 — Sample Validation: Executes test joins on sample data (typically 10K rows). Measures match rate, validates cardinality expectations. Detects anomalies (e.g., many-to-many when expecting one-to-many).
Phase 3 — Confidence Scoring: Combines dictionary confidence + validation results. Adjusts based on historical success rates for similar patterns. Flags low-confidence joins for human review.

Intra-System Joins (SAP QM tables):

QALS ↔ QAVE join proposal:
- Dictionary match: QMNUM (quality notification number) - 95% confidence
- Sample validation: 94% match rate, 1:N cardinality
- Final confidence: 94%

QAVE ↔ QMUR join proposal:
- Dictionary match: (QMNUM, FENUM) composite key - 92% confidence
- Sample validation: 89% match rate, 1:N cardinality
- Final confidence: 89%

Cross-System Joins: The BC-2847 Problem

The hardest integration problem MidWest faced was a single brake component referenced three different ways across three different systems:

SAP:      MATNR = "BC-2847"
MES:      PART_ID = "BRK_COMP_2847"
Supplier: PART_NUM = "ACM-BC-2847-Rev3"

Before SDDI, MidWest's engineers maintained six Excel lookup tables manually to translate between these identifiers. Key Finder's fuzzy matching algorithm resolves this automatically:

1. Substring detection: Identify common numerical patterns ("2847")
2. Prefix/suffix analysis: Detect systematic variations ("BC-", "BRK_COMP_", "ACM-", "-Rev3")
3. String similarity: Levenshtein distance after normalization
4. Validation: Cross-reference with known-good records (batch traceability)
5. Confidence scoring: 89% (flagged for human validation due to fuzzy nature)

Result: Automated the six Excel lookup tables engineers were maintaining manually.

Path Chooser: Minimal Integration Chain

What it does: Takes all possible joins from Key Finder and selects the simplest, correct path that connects the entire business process in the right direction.

Algorithm: Modified minimum spanning tree with directional constraints. Minimizes total join chain length while covering all required entities. Ensures process flow directionality (no cycles, matches business logic).

MidWest VIN-to-Supplier Traceability Path:

VIN (Warranty DB)
  → Production_Record (MES) [99% confidence]
  → Part_Instances (MES) [97% confidence]
  → Material_Master (SAP MARA via fuzzy MATNR) [89% confidence]
  → Quality_Notifications (SAP QALS) [94% confidence]
  → Usage_Decisions (SAP QAVE) [94% confidence]
  → Defect_Causes (SAP QMUR) [89% confidence]
  → Quality_Batches (SAP) [92% confidence]
  → Supplier_Process_Data (Portal) [82% confidence]

Overall path confidence: 71% (product of edge confidences)
Validation: Tested on 100 known failure cases — 97% correct

This is the path that compressed MidWest's 4-hour brake recall investigation to a 30-second query. The quality engineer types "show me every vehicle affected by supplier LOT-2024-0847" and traverses this entire chain in under a second.

Column Namer: Semantic Tagging

What it does: Maps cryptic, system-specific column names to clear, standardized semantic labels that enable cross-system understanding.

Without semantic tagging, each dataset keeps SAP-style naming. Downstream users, ML models, and cross-system analytics cannot recognize that EBELN = purchase_order_number = po_id. With it, you get consistent names, can merge data across systems, and can auto-build ontology objects because the system knows which fields represent the same concept everywhere.

Feature extraction per column:

Name tokens: Split MATNR → [MAT, NR]; token overlap with known dictionary (MAT≈Material, NR≈Number)
Character pattern: Uppercase letters + numbers = likely identifier
Regex matching: ^\d{10}$ → 10-digit numeric ID pattern
Type + length: CHAR(10) typical of SAP document numbers
Table context: Appears in MARA → likely material master domain
Dictionary text: "Material Number" from SAP data dictionary (DDIC)
Value hints: Sample values starting with specific prefixes reinforce semantic meaning

Model: Small transformer (DistilBERT-based) or gradient-boosted tree combining text embeddings with numeric features. Trained on thousands of (column name → semantic label) pairs from existing integrations and SAP data dictionaries. Output: probability distribution over ~300 semantic labels with confidence scores.

Fallback Strategy: If ML confidence is below threshold, the system falls back to:

Exact lookup in curated SAP field dictionary
Rule-based suffix/prefix mapping (*NR → Number, *PO → Item)
Cross-column consistency check (if table has EBELN and EBELP, tag as "PO Number / PO Item")

MidWest translations:

MATNR   → "MaterialNumber"  (99.8% — standard SAP field)
PRUEFLOS → "InspectionLot"  (99.5% — QM domain)
  Why: Token match PRUEF→Prüfung (inspection), seen in 92% of QM deployments
WERK    → "Plant"           (99.9% — production location)
CHARG   → "BatchNumber"     (99.7% — lot traceability)
ZMANU_BATCH → "ManufacturingBatchId" (94% — custom field)
  Why: Context from supplier portal, pattern matches batch IDs in 78% of automotive

Validation & Scoring: After tagging, the system validates against known object schemas (e.g., Batch should have BatchNumber), detects duplicates (two columns mapped to same semantic = warning), and produces confidence scores 0–1 for every mapping.

Business impact:

Before: Filter PRUEFLOS where WERK = '1000'
After:  Filter InspectionLot where Plant = 'Denver'
Result: Consistent semantic identity across all systems — MATNR in MARA and MATNR
        in QALS refer to the same concept

Feedback Logger: Continuous Learning Engine

What it does: Captures every decision the FDE makes to continuously improve all other subsystems.

What gets logged:

Accepted joins: Pattern validated — add to global dictionary
Rejected joins: Pattern doesn't work — reduce confidence for similar cases
Modified joins: User correction teaches better approach
New patterns: Novel solutions become templates for future deployments

Learning scope:

Local (MidWest): Immediate reuse within the same deployment
Global (all customers): Patterns added to platform-wide models
Domain-specific (automotive): Industry-specific optimizations

Compound Learning in Action: MidWest Timeline

Day 1:    FDE accepts BC-2847 fuzzy match (89% → 94% confidence boost)
Day 4:    SDDI encounters transmission parts at a different plant
          Applies learned fuzzy match rules automatically — no FDE intervention needed
Month 2:  Next automotive customer deployment starts with MidWest's part number
          matching rules pre-configured. Similar integration: 3 days instead of 7.
After 10 automotive deployments:
          SDDI proposes automotive-specific joins at 96% confidence

The network effect compounds: every deployment teaches the platform. After thousands of integrations across industries, SDDI becomes increasingly intelligent at recognizing patterns and proposing correct solutions — which is why MidWest-to-MedTech template transfer achieved 95% SDDI reuse in Chapter 9.

SDDI Architecture — Software-Defined Data Integration