How GNUnet File Share stores it's data securely, on other people's computers

home

Took me a while to understand how GNUnet File Share is able to encrypt and store data on the decentralized network. While making nodes unable to decrypt it. The answer is in the GNUnet whitepaper published in 2002. Like any technical documentation, you need some background knowledge to understand it. I will try to explain it in a simple way.

GNET (The origin or GNUnet) whitepaper

Suppose we want to store a file on someone else's computer. But want to keep the content secret, we ought to encrypt it. The simplest solution is to use GPG with symmetric key mode gpg --symmetric $MYFILE. This works for personal files. But experiences some problems for public file sharing

The key needs to be shared with the recipient but data can't be decryptable by the host storing it
Not sutable for larger files
No deduplication
Not content addressable

GNUnet does the scheme more cleverly. Instead of encrypting the file with a specified key. It encrypts with the hash of the file H(FILE). Then publishes the file under the name H(H(FILE)). This achieves 2 things. 1st, the file is now encrypted and the host storing the file can't decrypt it (If H(FILE) is both the name and encryption key, then it's travel to decrypt). 2nd, it's content addressable. The uploader knows H(FILE) and thus can query for H(H(FILE)). This also allows verification of downloaded content - we know something is wrong if the hash does not match after decrypting the download. Solving problem 1 and 4 listed above.

┌────────────────────────────────────┐
│  Publish as double hash of itself  │
│               ┌──────────────────┐ │
│ Encrypted     │                  │ │
│    with       │   YOUR FILE      │ │
│    hash       │                  │ │
│ of itself     │                  │ │
│               └──────────────────┘ │
└────────────────────────────────────┘

To deal with large uploads. Instead of uploading the whole file at once. GNUnet splits the file into blocks of 1KB. Each block is encrypted with the method above. Then the list of block names are gathered as a sort of Inode I = H(B1)+H(B2)+H(B3)...CRC32(B1+B2+B3..). The could be hierarchical, having a parent I pointing to a child which then finally points to a data node. The CRC at the end is simply there. It turns out there's exactly 4 bytes left after storing 51 RIPE160 hashes (the hash GNUnet uses). The directory is encrypted with H(D) and published under H(H(I)). This is the same as the file encryption method. To download the file. We just retrieve I, decrypt it and then retrieve the blocks. Solving problem 2 by chunking and spreading over the network. This also deduplicates the file. If a block with the same hash is already on the network. The host don't have to maek a second copy of the copy. Better yet, this deduplication works both inter and intra files.

 Each node is           ┌───────────────────────┐
 encrypted with         │     Inode             │
 hash of itself         │                       │
 and published          ├────────┐              │
 under it's             │ hashes │ CRC32 chksum │
 double hash            └─┬───┬──┴──────────────┘
                          │   │
                     ┌────┘   └───────────────────┐
                     │                            │
           ┌─────────▼─────────────┐    ┌─────────▼─────────────┐
           │     Inode             │    │     Inode             │
           │                       │    │                       │
           ├────────┐              │    ├────────┐              │
           │ hashes │ CRC32 chksum │    │ hashes │ CRC32 chksum │
           └──┬───┬─┴──────────────┘    └────────┴──────────────┘
              │   │
     ┌────────┘   └─────────────┐
┌────▼───────────┐      ┌───────▼────────┐
│ Data encrypted │      │ Data encrypted │
│  with it's     │      │  with it's     │  .....      ....
│   own hash     │      │   own hash     │
└────────────────┘      └────────────────┘

For the search functionality, GNUnet uses a similar scheme but with some tweaks. We can't just publish the root Inode with as Enc(H(Q), I) as I does not have the hash H(Q). Causing the validation scheme to fail. Instead, the encrypted Inode Enc(H(Q), I) is published under a triple hash H(H(H(Q))) and H(H(Q)) is sent to the storage node. H(H(Q)) then must be provided to the retriever to prove it's correctness. This however, does not prevent malicious nodes precomputing their own double and triple hash for any common query Q and return their garbage. It just makes things harder.

The paper suggests other applications can utilize the same data storage scheme. I'm not 100% sure but this storage scheme seems to be the predecessor of the datastore or peerstore subsystem.

Misc thoughts

Query security

The query scheme feels kinda weak when I read it. After giving it more thought. I think it's very difficult to improve on. For no other reason then we have to encrypt it with H(Q) thus loosing the ability to verify the content. All the while it could store arbitrary references to any root Inode. Heck even a public key scheme can't help. This is likely why IPFS does not come with a search system. The only way I can think of is to include more metadata in the query content. We publish a query as Enc(H(Q), H(H(I))+H(H(B1)+Nonce)+Nonce) where Nonce is just some random data and B1 is the 1st block of the file. Since supposedly no one besides the uploader knows H(B1). We can check the validity of using H(H(B1)+Nonce). This is however still not solid as an attacker can stumble upon the root Inode (as it is likely shared publicly) and then forge H(H(B1)+Nonce). Not to say much less efficient due to needing to traverse the first branch of the Inode tree.

GNUnet FS vs IPFS confidentiality

The encyption by deisgn of GNUnet is in stark contrast to IPFS, where it does not encrypt your data at the storage level. You are responsible to encrypt it beforehand. Quote IPFS documentation:

IPFS uses transport-encryption but not content encryption. This means that your data is secure when being sent from one IPFS node to another. However, anyone can download and view that data if they have the CID. The lack of content encryption is an intentional decision. Instead of forcing you to use a particular encryption protocol, you are free to choose whichever method is best for your project. This modular design keeps IPFS lightweight and free of vendor lock-in.

IPFS encryption and privacy documentation

I don't believe that is a reasonable justification. It's ok to keep the system clean and not touch encryption at all. Just admit it. Cryptography, AES, hashing, etc.. are not vendor specific tools. This sounds like BS to me.

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df