PSA: Arch Linux ARM zsh 5.9.5 leading to Oh My Zsh hang on initialization
I just upgraded my home server today and randomly found I couldn't SSH into it. Also the BMC console though allows me to type into it, it accepts my password but doesn't show me the prompt. I was fearing that my system was compromised somehow. Through pure guessing, I found that the issue was somewhere in zsh. Switching to bash allowed me to login. From there a series of debugging steps led me to the actual issue.
TL;DR
DO NOT upgrade to zsh-5.9-5-aarch64
if you are on Arch Linux ARM. It will hang Oh My Zsh, the very common shell configuration framework, on initialization. And no one knows what other bugs it might have.
The story
My server is a HoneyComb LX2K running Arch Linux ARM. Located in my dad's house.
Around 4:30 PM local time, I tried to connect VSCode to my home server to get some work done. That failed with a timeout. Then I tried SSH. SSH ran and shows the last logging information as it always does. But then hangs. Hitting Ctrl-C does not drop me back into the default zsh (without and rc files). For security reasons, I have disabled the SSH logging into root and the root password - I'm locked out from the only user that can do anything. Immediately I thought my system was compromised somehow. I tried the other route of logging in, through the BMC console. The HoneyComb LX2K is really a high end comms board instead of a proper server. So instead of IPMI, it has OpenBMC on an MCU that doubles as a USB to serial adapter. Luclky I had connected the USB side to another dev board for unrelated reasons. That login attempt.. failed. The same symptoms as SSH. I could type in the password but it wouldn't show me the prompt.
My system might be compromised. I thought. While panicing I saw a tiny nina%
on the screen (nina is the hostname). Somehow one Ctrl-C had worked. Immediately I grabbed the chance and sudo su
into root and ran tmux to keep the only working console from locking up. That worked. So either the attacker is not smart enough to ban my only user from sudo or something else is wrong. Either way I made up 2 new passwords and changed both root and my user's password, just in case the attacker was still in the system.
With the help of tmux, I can freely experiment without fear of losing the only console that is working. Ran su myuser
(ofc myuser is not the actual username) and that hanged with the same symptoms. Spawned a new tmux window and tried again and again. sudo -u myuser zsh
, doas -u myuser zsh
(yes I have doas as a backup in case sudo breaks), sudo =u myuser bash
, etc.. all hanged. It's frustrating but I realized that PAM cannot be the issue. Otherwise I shouldn't be able to sudo anywhere, including to root.
"Maybe it's zsh?" I thought. chsh myuser
into using bash then su myuser
. Boom that worked! Interesting. Ran zsh
and that hanged. So it's zsh? At least it's very unlikely I'm compromised given the evidence.
Tmux can be ditched as I tried SSH again and it logged into bash without any issues. Time to debug zsh. First instinct was to strace. That give the following curious output ending at a ppoll with a curious SIGCHLD
right before. Huh?
read(13, "zsh\n", 64) = 4
read(13, "", 64) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [WINCH], 8) = 0
close(13) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD WINCH], 8) = 0
kill(23338, 0) = 0
rt_sigprocmask(SIG_SETMASK, [INT], [CHLD WINCH], 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23338, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [INT CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [INT CHLD], ~[KILL STOP RTMIN RT_1], 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED|WCONTINUED, {ru_utime={tv_sec=0, tv_usec=939}, ru_stime={tv_sec=0, tv_usec=0}, ...}) = 23338
getrusage(RUSAGE_CHILDREN, {ru_utime={tv_sec=0, tv_usec=939}, ru_stime={tv_sec=0, tv_usec=0}, ...}) = 0
wait4(-1, 0xffffd0474c9c, WNOHANG|WSTOPPED|WCONTINUED, 0xffffd0474cb8) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[INT]}) = 0
ppoll(NULL, 0, NULL, NULL, 2
Right. Let me bisect zshrc... Ohh.. it hanged in ~/.oh-my-zsh/on-init.sh
which is the first file Oh My Zsh sources. Tried removing every plugin and switching themes.. nope. How about deleting the whole .oh-my-zsh
and .zshrc? Please don't tell me zsh itself is broken.. right, it's not, I still have the basic prompt and can run commands. Reinstall oh-my-zsh? It hangs right after the installation finished and dropping you into the new zsh shell.
Great.. something oh-my-zsh isn't working. Let's see what zsh -x
says (zsh debug mode). It hanged somewhere in the same file, but printing failing to load the zsh/datetime
module and hangs on a fstat
command against the omz lock file ~/.oh-my-zsh/log/update.lock
. but that file doesn't exist during the hang nor getting the status should hang. Everything is just weird. Now is around 8:10 PM and I gotta get some dinner though I really want to get this stupid little issue fixed.
I started to question myself after some food and more fruitless staring at zsh logs. Is it really oh-my-zsh? Quickly I listed the pacman package cache and find two versions of zsh. 5.9-4-aarch64
and 5.9-5-aarch64
. I don't even know when I upgraded zsh. I prayed and downgraded zsh to the older version, hoping it was a part of some recent upgrade. Apparently some dietaries did hear me. Running zsh
now no longer hangs. And switched everything back to normal.
Finally got everything under control again around 8:45. So yeah, if you are on Arch Linux ARM, don't upgrade to zsh-5.9-5-aarch64
.
Also, how the heck did upstream not catch this!? Argh!
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
I run TLGS, a major search engine on Gemini. Used by Buran by default.
- marty1885 \at protonmail.com
- Matrix: @clehaxze:matrix.clehaxze.tw
- Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df