How GNUnet File Share stores it's data securely, on other people's computers
Took me a while to understand how GNUnet File Share is able to encrypt and store data on the decentralized network. While making nodes unable to decrypt it. The answer is in the GNUnet whitepaper published in 2002. Like any technical documentation, you need some background knowledge to understand it. I will try to explain it in a simple way.
Suppose we want to store a file on someone else's computer. But want to keep the content secret, we ought to encrypt it. The simplest solution is to use GPG with symmetric key mode
gpg --symmetric $MYFILE. This works for personal files. But experiences some problems for public file sharing
- The key needs to be shared with the recipient but data can't be decryptable by the host storing it
- Not sutable for larger files
- No deduplication
- Not content addressable
GNUnet does the scheme more cleverly. Instead of encrypting the file with a specified key. It encrypts with the hash of the file
H(FILE). Then publishes the file under the name
H(H(FILE)). This achieves 2 things. 1st, the file is now encrypted and the host storing the file can't decrypt it (If
H(FILE) is both the name and encryption key, then it's travel to decrypt). 2nd, it's content addressable. The uploader knows
H(FILE) and thus can query for
H(H(FILE)). This also allows verification of downloaded content - we know something is wrong if the hash does not match after decrypting the download. Solving problem 1 and 4 listed above.
┌────────────────────────────────────┐ │ Publish as double hash of itself │ │ ┌──────────────────┐ │ │ Encrypted │ │ │ │ with │ YOUR FILE │ │ │ hash │ │ │ │ of itself │ │ │ │ └──────────────────┘ │ └────────────────────────────────────┘
To deal with large uploads. Instead of uploading the whole file at once. GNUnet splits the file into blocks of 1KB. Each block is encrypted with the method above. Then the list of block names are gathered as a sort of Inode
I = H(B1)+H(B2)+H(B3)...CRC32(B1+B2+B3..). The could be hierarchical, having a parent I pointing to a child which then finally points to a data node. The CRC at the end is simply there. It turns out there's exactly 4 bytes left after storing 51 RIPE160 hashes (the hash GNUnet uses). The directory is encrypted with
H(D) and published under
H(H(I)). This is the same as the file encryption method. To download the file. We just retrieve I, decrypt it and then retrieve the blocks. Solving problem 2 by chunking and spreading over the network. This also deduplicates the file. If a block with the same hash is already on the network. The host don't have to maek a second copy of the copy. Better yet, this deduplication works both inter and intra files.
Each node is ┌───────────────────────┐ encrypted with │ Inode │ hash of itself │ │ and published ├────────┐ │ under it's │ hashes │ CRC32 chksum │ double hash └─┬───┬──┴──────────────┘ │ │ ┌────┘ └───────────────────┐ │ │ ┌─────────▼─────────────┐ ┌─────────▼─────────────┐ │ Inode │ │ Inode │ │ │ │ │ ├────────┐ │ ├────────┐ │ │ hashes │ CRC32 chksum │ │ hashes │ CRC32 chksum │ └──┬───┬─┴──────────────┘ └────────┴──────────────┘ │ │ ┌────────┘ └─────────────┐ ┌────▼───────────┐ ┌───────▼────────┐ │ Data encrypted │ │ Data encrypted │ │ with it's │ │ with it's │ ..... .... │ own hash │ │ own hash │ └────────────────┘ └────────────────┘
For the search functionality, GNUnet uses a similar scheme but with some tweaks. We can't just publish the root Inode with as
Enc(H(Q), I) as I does not have the hash
H(Q). Causing the validation scheme to fail. Instead, the encrypted Inode
Enc(H(Q), I) is published under a triple hash
H(H(Q)) is sent to the storage node.
H(H(Q)) then must be provided to the retriever to prove it's correctness. This however, does not prevent malicious nodes precomputing their own double and triple hash for any common query Q and return their garbage. It just makes things harder.
The paper suggests other applications can utilize the same data storage scheme. I'm not 100% sure but this storage scheme seems to be the predecessor of the datastore or peerstore subsystem.
The query scheme feels kinda weak when I read it. After giving it more thought. I think it's very difficult to improve on. For no other reason then we have to encrypt it with
H(Q) thus loosing the ability to verify the content. All the while it could store arbitrary references to any root Inode. Heck even a public key scheme can't help. This is likely why IPFS does not come with a search system. The only way I can think of is to include more metadata in the query content. We publish a query as
Enc(H(Q), H(H(I))+H(H(B1)+Nonce)+Nonce) where Nonce is just some random data and B1 is the 1st block of the file. Since supposedly no one besides the uploader knows H(B1). We can check the validity of using
H(H(B1)+Nonce). This is however still not solid as an attacker can stumble upon the root Inode (as it is likely shared publicly) and then forge
H(H(B1)+Nonce). Not to say much less efficient due to needing to traverse the first branch of the Inode tree.
GNUnet FS vs IPFS confidentiality
The encyption by deisgn of GNUnet is in stark contrast to IPFS, where it does not encrypt your data at the storage level. You are responsible to encrypt it beforehand. Quote IPFS documentation:
IPFS uses transport-encryption but not content encryption. This means that your data is secure when being sent from one IPFS node to another. However, anyone can download and view that data if they have the CID. The lack of content encryption is an intentional decision. Instead of forcing you to use a particular encryption protocol, you are free to choose whichever method is best for your project. This modular design keeps IPFS lightweight and free of vendor lock-in.
I don't believe that is a reasonable justification. It's ok to keep the system clean and not touch encryption at all. Just admit it. Cryptography, AES, hashing, etc.. are not vendor specific tools. This sounds like BS to me.
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
I run TLGS, a major search engine on Gemini. Used by Buran by default.
- marty1885 \at protonmail.com
- GPG: 76D1 193D 93E9 6444
- Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df