AWK the ultimate swiss army knife for data

I want AWK to be more well known. It was treated as soem weird-ass tool that somehow works but no one knows why. It's all over the place on StackOverflow whenever we need to select a reagon of a file or something. I didn't apprishate this in the beginning. I was forced to learn sed and awk back in collage. Now I absolutely love awk. It's a standard tool shipped on every UNIX system (so, anywhere besides Windows). Being readily available and easy to use. It's a great tool to have in your toolbox if you work with any kind of data.

Dumb bulk text processing

I was writing some code to check the sanity of TLGS's database today. One of the things I need is a list of capsule root URLs. This is doable in SQL. But the query is too complicated for the task. Just querying for unique host and port combination. Then dump the URL to a file. Default ports shall be handled so I don't have to be annoyed by them. I want gemini:// not gemini:// For the SQL query, it's easy.

SELECT DISTINCT domain_name, port FROM pages;

Then CASE could be used to expand and handle default ports.

SELECT DISTINCT 'gemini://' || CASE WHEN port=1965 THEN domain_name
    WHEN port<>1965 THEN CONCAT_WS('', domain_name,':',port) END || '/' AS root_url FROM pages;

But heck that's long and ugly. Espically when I have to type it in the terminal. Welp, I need the data locally anyway. Just save it into a CSV and use AWK.

\copy (SELECT DISTINCT domain_name, port FROM pages) TO 'capsule.csv' WITH (FROMAT CSV);

Then a short one line AWK command is all I need.

> awk -F, '{print "gemini://" $1 ($2==1965 ? "" : ":" $2) "/" }' capsule.csv > urls.txt

To compare, Python is many lines.

import pandas as pd
df = pd.read_csv('capsule.csv'. names=['domain_name', 'port'])
df['root_url'] = 'gemini://' + df['domain_name'] + (df['port']==1965 ? '' : ':' + df['port'].astype(str)) + '/'
df['root_url'].to_csv('urls.txt', index=False)

Searching through file for results

Once, I helped a friend studing bio tech with their research. They had a bunch of CSV, TSV and raw gene match data. Their professor wanted them to parse and proces the stuff in C. Yes, that C. And on FreeBSD. The python version on there was old and pip isn't helpful. AWK to the rescue again. First task, A TSV of a gene to protien match was given. Several genes are submitted for the search. Column 0 is the gene name, column 7 is the match target. And the TSV was pre-sorted by matching rate. The professor wanted to know which gene best matches to which protien. A quick AWK script does the job.

> awk -F'\t' 'NR!=0 { if($1 in tbl == 0){print $0, $7; tbl[$1]=1} }' gene_match.tsv

Next, combing trough the raw output of a genome match. They submitted some samples of their own. Which are matched against known human genes. However, the matching service only provides a raw output. Tabulated TSV costs much extra. Sure! AWK can do that too. This time the field split won't be helpful. We'll just rely on AWK's awesome text processing power.

(script altered to not reveal which service was used)

> cat match_log | awk '/GNOMID=/ {gene_id=gensub(/GNOMID=(\w+)/, "\\1", "g");}\
    /ENSP/{print gene_id, gensub(/ENSP=(\w+) (.*)/, "\\1,\\2", "g")}'

Processing system information

One quick final one. I run fail2ban on my homelab to block bots knocking on SSH. They can't harm me. My kung-fu is good enough to not get pwned by that. But I do hate te feeling of knowing someone is on my doorsteps. After getting fail2ban running. I want to know which country is attacking me. I want some quick commands that I can string together to get the info I want. AWK again is the right tool.

> sudo iptables -L -n | # get all rules
    awk '$1 == "REJECT" && $5 == "" {print($4)}' | # filter out the banned IPs
    xargs -L 1 geoiplookup | # get the country info
    awk '{gsub(",", "", $4); print $4}' | # Get the country code
    sort | uniq -c | sort -n -r # count the occurance
    365 US
    157 IN
    145 SG
    103 BR
     86 DE
     86 CN
     82 RU
     82 HK

IMO, Python and R is great. But when it comes to more unstructured data or just something for the command line. AWK is the way to go. It's a great tool to have in your toolbox. AWK is everywhere, on every UNIX. Learn it.

Author's profile. Photo taken in VRChat by my friend Tast+
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

  • marty1885 \at
  • GPG: 76D1 193D 93E9 6444
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df