Technical Deep Dive

The SingularVault Protocol — Technical Architecture & Research Brief

This document outlines the theoretical architecture, cryptographic approach, and protocol design for a global cross-provider deduplication layer. This is a research brief — not a product spec. We're looking for specialists who can validate, challenge, and refine these ideas.

01 — The Technical Problem

Cloud storage today operates on isolated silos. Each provider — AWS S3, Azure Blob Storage, Google Cloud Storage, and thousands of smaller providers — maintains completely independent storage infrastructure. When a user uploads a file to Provider A, and the same file (byte-for-byte identical) exists on Provider B, C, and D, no cross-provider awareness exists.

Within a single provider, deduplication is common. AWS, for example, uses internal dedup within S3 at the block level. Enterprise storage systems (NetApp ONTAP, Dell EMC, Veeam) have offered single-tenant dedup for years. But cross-provider deduplication does not exist at any meaningful scale.

The result: trillions of file-blocks stored redundantly across competing infrastructure, each copy consuming disk space, electricity for storage and retrieval, cooling water, and physical land. Conservative estimates suggest that cross-provider dedup could reduce global storage volume by 30–60%, depending on the file type distribution.

Current vs Proposed Architecture

CURRENT: ISOLATED SILOSProvider AFile X, Y, ZFile A, BProvider BFile X, Y, ZFile C, DProvider CFile X, Y, ZFile E, FX, Y, Z stored 3× — no cross-provider awarenessPROPOSED: SHARED LAYERSingularVault LayerFiles: X, Y, Z, A, B, C, D, E, F (each stored ONCE)Provider Arefs onlyProvider Brefs onlyProvider Crefs only9 unique files vs 15 stored copies = 40% reduction

02 — Proposed Architecture

SingularVault would function as a middleware protocol layer that sits between users/applications and cloud storage providers. It is not a replacement for S3 or Blob Storage — it's a coordination layer that enables cross-provider content awareness.

Content-Addressable Storage (CAS)

At the core, every file (or file-block) is identified by its cryptographic hash. This is the same principle behind Git, IPFS, and BitTorrent. The hash becomes the address — if two files produce the same hash, they are the same content.

// Simplified content addressing
file_content = read("photo.jpg")
hash = SHA-256(file_content)  // → "a3f2b8c9..."

// This hash IS the storage address
// Same content always → same hash → same address
// No duplicates by mathematical guarantee

if global_index.exists(hash):
    return reference(hash)  // Already stored — just link to it
else:
    store(hash, encrypted_content)
    global_index.register(hash)

System Components

The architecture consists of four primary components:

HASH REGISTRY A distributed index mapping content hashes to storage locations. This is the “does this file already exist?” lookup service. Must be globally consistent, highly available, and fast. Could be implemented as a distributed hash table (DHT) similar to Kademlia, or a federated consensus network.

ENCRYPTION ENGINE Client-side encryption using convergent encryption (CE) or message-locked encryption (MLE). This allows deduplication of encrypted data — the key innovation that makes privacy-preserving dedup possible.

STORAGE BACKEND The actual cloud providers. They continue to store encrypted blocks, but now with awareness that a block may be shared across tenants/providers. Storage nodes need minimal changes — they just store blobs addressed by hash.

ACCESS CONTROL LAYER Manages who can access which files. Even though the underlying storage is deduplicated, access permissions remain per-user and per-organization. Uses capability-based access tokens.


03 — The Encryption-Deduplication Paradox

This is the hardest technical problem. Standard encryption (AES-256 with random IVs) produces unique ciphertext for every encryption operation — meaning two copies of the same file, encrypted by different users, produce completely different encrypted outputs. This defeats deduplication.

Convergent Encryption (CE)

The leading approach: derive the encryption key from the content itself.

// Convergent Encryption
content_key = Hash(file_content)    // Key derived FROM the content
ciphertext = Encrypt(content_key, file_content)

// Same content → same key → same ciphertext
// Deduplication works on ciphertext!

// To access: user stores their content_key securely
// The storage system never sees the key or plaintext

Known vulnerability: CE is susceptible to confirmation-of-a-file attacks. If an attacker knows (or can guess) the plaintext, they can compute the hash and confirm the file exists in the system. This is a real concern for low-entropy files (e.g., standard system files, common documents).

Message-Locked Encryption (MLE)

A more formal framework proposed by Bellare, Keelveedhi, and Ristenpart (2013). MLE formalizes CE and introduces additional constructions that can mitigate some attacks, including randomized convergent encryption and server-aided MLE schemes.

HCE² — Hybrid Convergent Encryption

A potential approach for SingularVault would combine CE with per-user key wrapping:

// HCE²: Hybrid approach
content_hash = SHA-256(file_content)
content_key = HKDF(content_hash, salt="singularvault-v1")
ciphertext = AES-256-GCM(content_key, file_content)

// User wraps the content_key with their own key
wrapped_key = RSA-OAEP(user_public_key, content_key)

// Storage sees: ciphertext (dedupable) + wrapped_key (per-user)
// User needs their private key to unwrap → access content
// System proves uniqueness via content_hash without seeing plaintext

This allows deduplication at the ciphertext level while maintaining per-user access control. The tradeoff is increased key management complexity.


04 — Protocol Flow

Here's how a file upload would work through the SingularVault protocol:

Upload Protocol

  1. Client-Side Hashing

    The client computes SHA-256(file) locally. The file content never leaves the device at this stage. The hash is sent to the SingularVault registry as a “do you have this?” query.

  2. Global Existence Check

    The hash registry performs a lookup. If the hash exists → the file is already stored globally. Skip to step 5. If not → proceed to upload.

  3. Convergent Encryption

    Client derives encryption key from file content via HKDF(SHA-256(content)), encrypts the file, then wraps the content key with the user's public key.

  4. Storage & Registration

    Encrypted ciphertext is uploaded to the nearest storage node. The hash → location mapping is registered in the global index. Storage node stores the blob without any knowledge of its contents.

  5. Reference Creation

    Whether the file was newly uploaded or already existed, the user receives a reference (hash + wrapped key). This reference is all they need to retrieve the file later. Multiple users can hold references to the same underlying data.

Download Protocol

Retrieval is straightforward: the client presents their reference (hash + wrapped key), the system locates the ciphertext via the hash registry, downloads the encrypted blob, unwraps the content key using the user's private key, and decrypts locally. The storage system never handles plaintext.

Deletion Handling

This is a critical design decision. When User A deletes “their” file, but User B still has a reference to the same content — the underlying data must persist. The system needs reference counting: the physical data is only deleted when the last reference is removed. This introduces garbage collection complexity and requires careful handling to prevent data loss or orphaned blobs.


05 — Prior Art & Landscape

Several projects have explored aspects of this problem. Understanding where they succeeded and failed is critical for SingularVault's design.

SystemApproachScaleCross-ProviderStatus
IPFSContent-addressed DHT, Merkle DAGsGlobalYesActive, limited adoption
FilecoinIncentivized IPFS storage with proof-of-replicationGlobalYesActive, crypto volatility
NetApp ONTAPInline/post-process dedup within arraySingle-tenantNoMature, enterprise
Dell EMC DataDomainVariable-length dedup for backupSingle-tenantNoMature, enterprise
AWS S3 (internal)Block-level dedup within S3Single-providerNoOpaque, internal
Sia / StorjDecentralized storage with erasure codingGlobalPartialActive, niche
SingularVaultCross-provider CAS + convergent encryptionGlobalYes (goal)Research phase

Key differentiator: IPFS and Filecoin are decentralized storage networks that aim to replace traditional cloud providers. SingularVault's approach is different — it aims to sit on top of existing providers as a coordination layer, requiring minimal changes to existing infrastructure. Think of it as a dedup protocol rather than a storage network.


06 — Open Challenges

These are the problems that need to be solved before SingularVault can move from theory to practice. This is exactly what we need specialists for.

Confirmation-of-File Attacks

Convergent encryption allows anyone who knows (or guesses) file content to verify its existence. For high-entropy files (photos, videos), this is low-risk. For low-entropy files (standard documents, configs), it’s a real concern. Solutions: server-assisted hashing with blinding factors, proof-of-ownership protocols before granting dedup references.

Data Sovereignty & GDPR

If a file is stored once in a US data center but referenced by EU users, does that violate GDPR data residency requirements? The protocol may need geo-aware storage with regional canonical copies — which partially defeats the energy savings but may be legally required.

Hash Registry Scalability

The global hash index needs to handle billions (potentially trillions) of lookups per second with low latency. A naive centralized database won’t work. DHT-based approaches (like Kademlia) scale but introduce latency and consistency challenges. Bloom filters could reduce lookup costs but introduce false positives.

Provider Incentive Alignment

Cloud providers profit from storage volume. Reducing stored data reduces their revenue. SingularVault needs a business model where providers benefit — perhaps through reduced infrastructure costs outweighing reduced billing, or through a shared revenue model for the dedup layer itself.

Deletion & Garbage Collection

Reference-counted deletion across a distributed system is notoriously hard. Race conditions between simultaneous uploads and deletions could cause data loss. The system needs eventual consistency guarantees that never lose data, even at the cost of temporarily keeping orphaned blobs.

Block vs. File-Level Dedup

File-level dedup is simpler but misses opportunities (two files that differ by one byte are stored twice). Block-level dedup (like rsync’s rolling checksum) catches more redundancy but dramatically increases the hash registry size and lookup complexity.


07 — Research Questions

If you're a specialist considering joining this project, these are the open questions we most need help with:

1. What is the realistic global deduplication ratio? Across all cloud providers, what percentage of stored data is truly unique? Industry estimates range from 30–80% redundancy depending on data type. We need actual measurements or credible models.

2. Can convergent encryption be made resistant to confirmation attacks at scale? Server-aided approaches exist in literature but haven’t been deployed at global scale. What are the performance and trust tradeoffs?

3. What’s the minimal viable coordination surface? SingularVault doesn’t need every provider to adopt a full protocol. What’s the smallest change a provider needs to make to participate? Can we build this as a proxy/gateway layer that requires zero provider changes?

4. How do you handle versioning and mutability? CAS naturally handles immutable data. But real-world files change. How does the system handle file updates without losing the dedup benefits for the unchanged portions?

5. What’s the economic model that aligns all incentives? Users want cheaper storage. Providers want revenue. The planet needs less energy. Is there a model where all three can win?

6. Can zero-knowledge proofs enable dedup without any information leakage? ZKPs are computationally expensive today but improving rapidly. Could a ZKP-based protocol allow a client to prove “I have content matching hash X” without revealing anything about the content — not even the hash itself?