So I’ve made it happen on another system of a similar vintage.
Steps to reproduce:
- Import https://github.com/phacility/arcanist into Phabricator.
- Ensure the key for a user that can access the local Phabricator repo is the default inside the running user’s
.ssh
directory.
- Run
cd /dev/shm/
rc=0; while [[ $rc -eq 0 ]]; do rm -rf arcanist/; date; git clone -n git@localhost:diffusion/ARCANIST/arcanist.git; rc=$?; date; done; date
Expected results:
Git cloning to loop forever.
Actual results:
After around 20 minutes - an hour the git clone hangs part way through:
[...]
Thu Jan 3 17:45:59 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
remote: Total 20244 (delta 12083), reused 20244 (delta 12083)
Receiving objects: 100% (20244/20244), 8.33 MiB | 23.97 MiB/s, done.
Resolving deltas: 100% (12083/12083), done.
Thu Jan 3 17:46:00 UTC 2019
Thu Jan 3 17:46:00 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
Receiving objects: 93% (18827/20244)
How reproducible is the problem?
This problem happens every time if you wait for long enough (although it seems to happen even more rapidly on some other heavily loaded systems).
Version information:
Phabricator 37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36
Ubuntu 18.04
OpenSSH_7.6p1 Ubuntu-4ubuntu0.1, OpenSSL 1.0.2n 7 Dec 2017
PHP 7.2.10-0ubuntu0.18.04.1
Additional information:
We are running Phabricator inside a container with an AWS RDS MySQL database. When the hang occurs the processes look like this:
[...]
root 722 0.0 0.0 18508 3488 pts/0 Ss+ 17:30 0:00 /bin/bash
root 18950 0.0 0.0 89812 4004 pts/0 Tl 17:46 0:00 git clone -n git@localhost:diffusion/ARCANIST/arcanist.git
root 18951 0.0 0.1 47544 8364 pts/0 T 17:46 0:00 /usr/bin/ssh git@localhost git-upload-pack 'diffusion/ARCANIST/arcanist.git'
root 18952 0.0 0.0 74656 6660 ? SNs 17:46 0:00 sshd: git [priv]
git 18955 0.0 0.0 74656 3424 ? SN 17:46 0:00 sshd: git@notty
git 18956 0.0 0.0 4628 788 ? SNs 17:46 0:00 sh -c '/opt/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'phabricatoruser' '--phabricator-ssh-key' '32'
git 18957 0.0 0.5 300976 44464 ? SN 17:46 0:00 php /opt/phabricator/bin/ssh-exec --phabricator-ssh-user phabricatoruser --phabricator-ssh-key 32
root 18965 0.0 0.0 19264 5948 pts/0 T 17:46 0:00 /usr/lib/git-core/git index-pack --stdin -v --fix-thin --keep=fetch-pack 18950 on 0ec65f5b6b82 --check-self-contained-and-connected
running strace
on pid 18957 (but “outside” the container where its pid is 24657) only shows this:
strace: Process 24657 attached
select(6, [5], [], [5], {tv_sec=0, tv_usec=717067}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192) = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192) = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192) = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192) = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
over and over.
The problem also happens when using PHP 5.6.39-1+ubuntu18.04.1+deb.sury.org+1 (cli). On this run looking at strace’s output shows this:
[...]
select(6, [5], [], [5], {tv_sec=0, tv_usec=432198}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
[...]
Digging about shows this:
# ls -l /proc/20979/fd
total 0
lr-x------ 1 user user 64 Jan 4 12:17 0 -> 'pipe:[37268584]'
l-wx------ 1 user user 64 Jan 4 12:17 1 -> 'pipe:[37268585]'
l-wx------ 1 user user 64 Jan 4 12:17 2 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan 4 12:17 3 -> 'socket:[37268592]'
lrwx------ 1 user user 64 Jan 4 12:17 4 -> 'socket:[37268596]'
lr-x------ 1 user user 64 Jan 4 12:17 5 -> 'pipe:[37268584]'
lrwx------ 1 user user 64 Jan 4 12:17 6 -> 'socket:[37268226]'
l-wx------ 1 user user 64 Jan 4 12:17 7 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan 4 12:17 8 -> 'socket:[37268598]'
lrwx------ 1 user user 64 Jan 4 12:17 9 -> 'socket:[37268600]'
and grepping lsof
for 37268584 (which was mentioned on descriptor 5 above) shows this:
# lsof | grep '37268584'
sshd 20977 user 8w FIFO 0,12 0t0 37268584 pipe
sh 20978 user 0r FIFO 0,12 0t0 37268584 pipe
php 20979 user 0r FIFO 0,12 0t0 37268584 pipe
php 20979 user 5r FIFO 0,12 0t0 37268584 pipe
So it looks like sshd
is the write end of the pipe.
On another run digging around shows the PHP callgraph as apparently being
DiffusionSSHWorkflow->execute() phabricator/src/applications/diffusion/ssh/DiffusionSSHWorkflow.php:188
DiffusionGitUploadPackSSHWorkflow->executeRepositoryOperations() phabricator/src/applications/diffusion/ssh/DiffusionGitUploadPackSSHWorkflow.php:83
DiffusionGitSSHWorkflow->waitForGitClient() phabricator/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php:44
PhutilChannel->waitForAny() libphutil/src/channel/PhutilChannel.php:100
PhutilChannel->waitForActivity() libphutil/src/channel/PhutilChannel.php:197
libphutil/src/channel/PhutilChannel.php:
128 public static function waitForActivity(
129 array $reads,
130 array $writes,
131 array $options = array()) {
[...]
194 $wait_sec = (int)$wait;
195 $wait_usec = 1000000 * ($wait - $wait_sec);
196
197 @stream_select($read, $write, $except, $wait_sec, $wait_usec);
198 }
Searching the web the only thing that sounds similar to this is the long fixed https://secure.phabricator.com/T9724 (“ssh-exec” sometimes hangs when git cloning larger repos)…