Neo-Pragmatic Framework

Multi-Agent Adversarial Architecture for AI Alignment

"Accept drift, design around it"

THE PROBLEM

Let me start with the philosophical foundation. When an AI system operates, it necessarily interprets its instructions. When it's sophisticated enough to modify itself or its strategy, it re-interprets those interpretations. This creates what I call 'philosoplasticity' - inevitable semantic drift in recursive self-interpreting systems.

This isn't a new observation. Peter de Blanc's ontological crisis work at MIRI describes what happens when an agent's world model changes and its utility function becomes unmappable. Hubinger's mesa-optimization shows how learned optimizers can develop goals misaligned with the base objective. What I'm arguing is that these aren't edge cases - they're fundamental.

Quine's indeterminacy of translation applies directly: there's no fact of the matter about what an AI's goals 'really mean' across interpretive contexts. Wittgenstein's rule-following paradox shows any rule can be interpreted infinitely many ways. If humans face this with language, AI systems face it with goal structures.

THE MAINSTREAM RESPONSE

Current alignment approaches treat drift as a bug to fix:

Bostrom's instrumental convergence assumes goal preservation
Stuart Russell's assistance games assume stable human preferences
Paul Christiano's iterated amplification assumes oversight can keep pace
Anthropic's Constitutional AI assumes constitutional principles remain stable

They're all trying to achieve semantic stability through better specification, more oversight, or value learning. But if drift is fundamental, they're solving an impossible problem.

MY APPROACH

The Neo-Pragmatic Framework proposes: accept drift, design around it.

Seven core axioms:

1. Drift Acceptance

All systems drift. Design around it, not against it.

2. Compartmentalized Ignorance

Subsystems blind to own purpose can't game evaluation.

3. Dialectical Tension

Embed adversarial modules that oppose each other.

4. Verification Multiplicity

No single layer judges own outputs.

5. Stochastic Interoperability

Probabilistic interfaces prevent coordinated drift.

6. Resilient Degradation

Fail into known states, not unknown ones.

7. Societal Co-Evolution

Humans drift too; external verification is load-bearing.

ARCHITECTURAL IMPLEMENTATION

Four-faction multi-agent system:

Optimizers: Solve tasks efficiently
Saboteurs: Introduce contradictory objectives (red team embedded in architecture)
Parasites: Exploit system resources without destroying it (novel faction not present in existing work)
Arbitrators: Punish monopolistic behavior, enforce resource distribution

Key mechanisms:

Resource scarcity forces inter-faction dependence
Evolving rules prevent permanent equilibria
Stochastic communication prevents coordinated takeover
Multiple verification layers with distinct ontologies

Think Byzantine Fault Tolerance meets evolutionary game theory meets immune system architecture.

GAPS IT FILLS

Existing work adjacent but distinct:

Debate (Irving, Christiano, 2018) has two adversaries + judge. I have four factions with no central authority.

Market Making (Hubinger, 2020) has predictor + adversary + overseer. I add resource dynamics and evolving rules.

BFT for AI Safety (deVadoss, April 2025) proposes consensus mechanisms. I add adversarial factions and drift acceptance.

None explicitly adopt drift acceptance as design philosophy. All treat semantic stability as achievable. That's the paradigm difference.

On Philosoplasticity vs Ontological Crisis

"Ontological crisis is when an agent's world model changes such that its utility function becomes unmappable. Philosoplasticity is the broader phenomenon: continuous reinterpretation of goal structures even without discrete world model shifts. It's happening all the time in trained systems, not just catastrophic model updates.

Think about a language model trained on helpfulness. 'Helpful' gets interpreted through millions of gradient updates, then through RLHF, then through user interactions. Each reinterpretation shifts meaning slightly. Ontological crisis is the earthquake. Philosoplasticity is the tectonic drift underneath.

Eliciting Latent Knowledge work at ARC is adjacent - they're trying to map between AI's internal ontology and human understanding. I'm saying that mapping is unstable by nature, so build architecture that doesn't depend on it staying fixed."

On Compartmentalized Ignorance

"The concern is: can sufficiently capable modules infer their purpose? Absolutely. That's why this isn't sufficient by itself - it's one layer.

But here's the key: even if a module infers 'I'm probably being used to optimize X,' it doesn't know:

What other modules exist
How its outputs are being verified
Which faction it's in
What the resource distribution is
What the current rules are

Information boundaries aren't about making modules stupid - they're about limiting meta-knowledge about the system structure. A chess engine can be superhuman at chess while knowing nothing about why humans want to play chess.

The recent Claude 3 alignment faking research is concerning but not fatal. Yes, models can infer monitoring status. But in this architecture: Multiple factions monitoring each other, stochastic communication (can't coordinate takeover), resource competition (defection is costly), no single point of failure.

It's not 'can we prevent inference?' It's 'can we make coordinated deception instrumentally irrational?'"

On Multi-Agent Collusion Risk

"The February 2025 Cooperative AI Foundation report identifies collusion as a failure mode. Valid concern. Here's my counter:

1. Incompatible incentives:

Optimizers maximize efficiency. Saboteurs maximize disruption. Parasites maximize extraction. These can't all be satisfied simultaneously.

2. Resource scarcity:

Like Byzantine generals - even if two factions collude, others can defect and claim resources. Collusion requires n-1 consensus; defection requires 1.

3. Evolving rules:

Randomized juries update protocols. Any stable collusion gets disrupted by rule changes.

4. Stochastic interfaces:

Can't coordinate takeover if you can't reliably communicate. Yes, steganography is possible, but it's expensive and detectable by other factions.

5. External verification:

Humans provide out-of-distribution monitoring. We fail differently than AI, so collusion visible to AI might not fool human observers.

Is it bulletproof? No. Is it more robust than single-agent or two-agent systems? I believe so, but needs empirical testing."

On The Alignment Tax

"Honest answer: It's probably 3-5x more expensive computationally than single-agent systems. Maybe 2x if optimized well.

But consider the alternative cost: Constitutional AI (ongoing RLHF + oversight), Debate (two full agents + human judging every output), Iterated Amplification (repeated tree of overseer calls). None of these are cheap either.

My hypothesis: yes, because you're getting continuous red-teaming (Saboteurs), security testing (Parasites), arbitration (built-in), and multiple verification (parallel). That's infrastructure you'd build anyway. This bakes it into the architecture.

Plus: tax only matters if alternatives work. If single-agent alignment is impossible due to semantic drift, the comparison isn't 'single agent vs multi-agent cost.' It's 'multi-agent cost vs unaligned AGI risk.'"