Two cents on the mistery of double slashes in URLs
I never expected my initial post about gemini crawlers would trun into a long conversation between crawler developers and server developers. Just saw Sean Conner's post through cosmos about double slashes in URLs.
I decided to see what happens on the web. I poked a few web sites with similar “double slash” requests and I got mixed results. Most of the sites just accepted them as is and served up a page. The only site that seemed to have issues with it was Hacker News , and I'm not sure what status it returned since it's difficult to obtain the status codes from browsers.
Well, how do current Gemini servers deal with it? Pretty much like existing web servers—most just treat multiple slashses as a single slash. I think my server is the outlier here. Now the question is—how pedantic do I want to be? Is “good enough” better then “perfect?”
This is just my 2 cents on the matter. RFC3986 section 3.3 contains the ABNF grammar for the path component of a URI. Which requires at leat 1 characters between slashes (the
1* means 1 to infinite repetitions of the following element. (Below is the relevant part of the grammar.)
path = path-absolute ; begins with "/" but not "//" path-absolute = "/" [ segment-nz *( "/" segment ) ] segment-nz = 1*pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
So any URLs with // in them are not valid and should be rejected according to the RFC. In practice I find most servers instead follows POSIX standard for path resolution. And collapses multiple slashes into a single slash. Which states
A pathname that begins with two successive slashes may be interpreted in an implementation-defined manner, although more than two leading slashes shall be treated as a single slash.
My server, Dremini, does the same thing. But it does not depend on POSIX to collasp the shahes. Instead it is a side effect of how I (or being accurate, drogon which I'm a maintainer) resolves
.. and tries to detect directory traversal attacks before serving the file - instead of writing a URL-specific path resolver, it passes the path into C++'s
std::filesystem library and normalizes it. Which is documented to ignore multiple slashes.
The path name has the following syntax:
directory-separators: the forward slash character / or ... If this character is repeated, it is treated as a single directory separator: /usr///////lib is the same as /usr/lib
I bet this is what most servers are doing, basically two fold. Any attempt at detecting directory traversal attacks will attempt to normalize the path. Thus collapsing multiple slashes into a single slash by the respective language's path parser. And if they didn't do that, the OS does it.
In short, I think you are totally good to reject any URLs with // in them. Most servers are not compliant with RFC3986 in this regard. But since the mental model of a URL is basically
protocol://hostname/path. It's also reasonable to make the path follow what POSIX does. And it's extra work to reject those URLs, without any security benifits.
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
I run TLGS, a major search engine on Gemini. Used by Buran by default.
- marty1885 \at protonmail.com
- Matrix: firstname.lastname@example.org
- Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df