Technical Deep Dive
The SingularVault Protocol — Technical Architecture & Research Brief
This document outlines the theoretical architecture, cryptographic approach, and protocol design for a global cross-provider deduplication layer. This is a research brief — not a product spec. We're looking for specialists who can validate, challenge, and refine these ideas.
01 — The Technical Problem
Cloud storage today operates on isolated silos. Each provider — AWS S3, Azure Blob Storage, Google Cloud Storage, and thousands of smaller providers — maintains completely independent storage infrastructure. When a user uploads a file to Provider A, and the same file (byte-for-byte identical) exists on Provider B, C, and D, no cross-provider awareness exists.
Within a single provider, deduplication is common. AWS, for example, uses internal dedup within S3 at the block level. Enterprise storage systems (NetApp ONTAP, Dell EMC, Veeam) have offered single-tenant dedup for years. But cross-provider deduplication does not exist at any meaningful scale.
The result: trillions of file-blocks stored redundantly across competing infrastructure, each copy consuming disk space, electricity for storage and retrieval, cooling water, and physical land. Conservative estimates suggest that cross-provider dedup could reduce global storage volume by 30–60%, depending on the file type distribution.
Current vs Proposed Architecture
02 — Proposed Architecture
SingularVault would function as a middleware protocol layer that sits between users/applications and cloud storage providers. It is not a replacement for S3 or Blob Storage — it's a coordination layer that enables cross-provider content awareness.
Content-Addressable Storage (CAS)
At the core, every file (or file-block) is identified by its cryptographic hash. This is the same principle behind Git, IPFS, and BitTorrent. The hash becomes the address — if two files produce the same hash, they are the same content.
// Simplified content addressing file_content = read("photo.jpg") hash = SHA-256(file_content) // → "a3f2b8c9..." // This hash IS the storage address // Same content always → same hash → same address // No duplicates by mathematical guarantee if global_index.exists(hash): return reference(hash) // Already stored — just link to it else: store(hash, encrypted_content) global_index.register(hash)
System Components
The architecture consists of four primary components:
HASH REGISTRY A distributed index mapping content hashes to storage locations. This is the “does this file already exist?” lookup service. Must be globally consistent, highly available, and fast. Could be implemented as a distributed hash table (DHT) similar to Kademlia, or a federated consensus network.
ENCRYPTION ENGINE Client-side encryption using convergent encryption (CE) or message-locked encryption (MLE). This allows deduplication of encrypted data — the key innovation that makes privacy-preserving dedup possible.
STORAGE BACKEND The actual cloud providers. They continue to store encrypted blocks, but now with awareness that a block may be shared across tenants/providers. Storage nodes need minimal changes — they just store blobs addressed by hash.
ACCESS CONTROL LAYER Manages who can access which files. Even though the underlying storage is deduplicated, access permissions remain per-user and per-organization. Uses capability-based access tokens.
03 — The Encryption-Deduplication Paradox
This is the hardest technical problem. Standard encryption (AES-256 with random IVs) produces unique ciphertext for every encryption operation — meaning two copies of the same file, encrypted by different users, produce completely different encrypted outputs. This defeats deduplication.
Convergent Encryption (CE)
The leading approach: derive the encryption key from the content itself.
// Convergent Encryption content_key = Hash(file_content) // Key derived FROM the content ciphertext = Encrypt(content_key, file_content) // Same content → same key → same ciphertext // Deduplication works on ciphertext! // To access: user stores their content_key securely // The storage system never sees the key or plaintext
Known vulnerability: CE is susceptible to confirmation-of-a-file attacks. If an attacker knows (or can guess) the plaintext, they can compute the hash and confirm the file exists in the system. This is a real concern for low-entropy files (e.g., standard system files, common documents).
Message-Locked Encryption (MLE)
A more formal framework proposed by Bellare, Keelveedhi, and Ristenpart (2013). MLE formalizes CE and introduces additional constructions that can mitigate some attacks, including randomized convergent encryption and server-aided MLE schemes.
HCE² — Hybrid Convergent Encryption
A potential approach for SingularVault would combine CE with per-user key wrapping:
// HCE²: Hybrid approach content_hash = SHA-256(file_content) content_key = HKDF(content_hash, salt="singularvault-v1") ciphertext = AES-256-GCM(content_key, file_content) // User wraps the content_key with their own key wrapped_key = RSA-OAEP(user_public_key, content_key) // Storage sees: ciphertext (dedupable) + wrapped_key (per-user) // User needs their private key to unwrap → access content // System proves uniqueness via content_hash without seeing plaintext
This allows deduplication at the ciphertext level while maintaining per-user access control. The tradeoff is increased key management complexity.
04 — Protocol Flow
Here's how a file upload would work through the SingularVault protocol:
Upload Protocol
Client-Side Hashing
The client computes
SHA-256(file)locally. The file content never leaves the device at this stage. The hash is sent to the SingularVault registry as a “do you have this?” query.Global Existence Check
The hash registry performs a lookup. If the hash exists → the file is already stored globally. Skip to step 5. If not → proceed to upload.
Convergent Encryption
Client derives encryption key from file content via
HKDF(SHA-256(content)), encrypts the file, then wraps the content key with the user's public key.Storage & Registration
Encrypted ciphertext is uploaded to the nearest storage node. The hash → location mapping is registered in the global index. Storage node stores the blob without any knowledge of its contents.
Reference Creation
Whether the file was newly uploaded or already existed, the user receives a reference (hash + wrapped key). This reference is all they need to retrieve the file later. Multiple users can hold references to the same underlying data.
Download Protocol
Retrieval is straightforward: the client presents their reference (hash + wrapped key), the system locates the ciphertext via the hash registry, downloads the encrypted blob, unwraps the content key using the user's private key, and decrypts locally. The storage system never handles plaintext.
Deletion Handling
This is a critical design decision. When User A deletes “their” file, but User B still has a reference to the same content — the underlying data must persist. The system needs reference counting: the physical data is only deleted when the last reference is removed. This introduces garbage collection complexity and requires careful handling to prevent data loss or orphaned blobs.
05 — Prior Art & Landscape
Several projects have explored aspects of this problem. Understanding where they succeeded and failed is critical for SingularVault's design.
| System | Approach | Scale | Cross-Provider | Status |
|---|---|---|---|---|
| IPFS | Content-addressed DHT, Merkle DAGs | Global | Yes | Active, limited adoption |
| Filecoin | Incentivized IPFS storage with proof-of-replication | Global | Yes | Active, crypto volatility |
| NetApp ONTAP | Inline/post-process dedup within array | Single-tenant | No | Mature, enterprise |
| Dell EMC DataDomain | Variable-length dedup for backup | Single-tenant | No | Mature, enterprise |
| AWS S3 (internal) | Block-level dedup within S3 | Single-provider | No | Opaque, internal |
| Sia / Storj | Decentralized storage with erasure coding | Global | Partial | Active, niche |
| SingularVault | Cross-provider CAS + convergent encryption | Global | Yes (goal) | Research phase |
Key differentiator: IPFS and Filecoin are decentralized storage networks that aim to replace traditional cloud providers. SingularVault's approach is different — it aims to sit on top of existing providers as a coordination layer, requiring minimal changes to existing infrastructure. Think of it as a dedup protocol rather than a storage network.
06 — Open Challenges
These are the problems that need to be solved before SingularVault can move from theory to practice. This is exactly what we need specialists for.
Confirmation-of-File Attacks
Convergent encryption allows anyone who knows (or guesses) file content to verify its existence. For high-entropy files (photos, videos), this is low-risk. For low-entropy files (standard documents, configs), it’s a real concern. Solutions: server-assisted hashing with blinding factors, proof-of-ownership protocols before granting dedup references.
Data Sovereignty & GDPR
If a file is stored once in a US data center but referenced by EU users, does that violate GDPR data residency requirements? The protocol may need geo-aware storage with regional canonical copies — which partially defeats the energy savings but may be legally required.
Hash Registry Scalability
The global hash index needs to handle billions (potentially trillions) of lookups per second with low latency. A naive centralized database won’t work. DHT-based approaches (like Kademlia) scale but introduce latency and consistency challenges. Bloom filters could reduce lookup costs but introduce false positives.
Provider Incentive Alignment
Cloud providers profit from storage volume. Reducing stored data reduces their revenue. SingularVault needs a business model where providers benefit — perhaps through reduced infrastructure costs outweighing reduced billing, or through a shared revenue model for the dedup layer itself.
Deletion & Garbage Collection
Reference-counted deletion across a distributed system is notoriously hard. Race conditions between simultaneous uploads and deletions could cause data loss. The system needs eventual consistency guarantees that never lose data, even at the cost of temporarily keeping orphaned blobs.
Block vs. File-Level Dedup
File-level dedup is simpler but misses opportunities (two files that differ by one byte are stored twice). Block-level dedup (like rsync’s rolling checksum) catches more redundancy but dramatically increases the hash registry size and lookup complexity.
07 — Research Questions
If you're a specialist considering joining this project, these are the open questions we most need help with:
1. What is the realistic global deduplication ratio? Across all cloud providers, what percentage of stored data is truly unique? Industry estimates range from 30–80% redundancy depending on data type. We need actual measurements or credible models.
2. Can convergent encryption be made resistant to confirmation attacks at scale? Server-aided approaches exist in literature but haven’t been deployed at global scale. What are the performance and trust tradeoffs?
3. What’s the minimal viable coordination surface? SingularVault doesn’t need every provider to adopt a full protocol. What’s the smallest change a provider needs to make to participate? Can we build this as a proxy/gateway layer that requires zero provider changes?
4. How do you handle versioning and mutability? CAS naturally handles immutable data. But real-world files change. How does the system handle file updates without losing the dedup benefits for the unchanged portions?
5. What’s the economic model that aligns all incentives? Users want cheaper storage. Providers want revenue. The planet needs less energy. Is there a model where all three can win?
6. Can zero-knowledge proofs enable dedup without any information leakage? ZKPs are computationally expensive today but improving rapidly. Could a ZKP-based protocol allow a client to prove “I have content matching hash X” without revealing anything about the content — not even the hash itself?