Sporadic git cloning hang over ssh


#1

Observed Behavior:
git clone over ssh periodically hangs.

Expected Behavior:
git clone over ssh should not hang (unless network is broken etc).

Phabricator Version:
37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36

SSH version
OpenSSH_7.6p1 Ubuntu-4ubuntu0.1, OpenSSL 1.0.2n 7 Dec 2017

Reproduction Steps:

true;
while [[ $? -eq 0 ]]; do 
  rm -rf libedit/; date; 
  git/bin/git clone -n git@remotehost:diffusion/LIBEDIT/libedit; 
  date; 
done; 
date

#2

Your phabricator install is very old, and your reproduction steps don’t reproduce to me.


#3

I agree that the version of Phabricator being used isn’t the freshest. Bearing that in mind, is anyone aware of something that has been fixed that might explain this? Better yet, is there some sort of logging that could be enabled that might be able to explain this (normal logging from sshd doesn’t seem to offer any insight and the problem happens even happens when the clone is happening via localhost)?


#4

We implement the Git protocol by just running git, so we can’t add logging to git index-pack.

Try cloning from the host using git clone file:///path/to/repo instead of SSH. No Phabricator code runs when you do this, so if the issue reproduces this is not a problem with Phabricator.

If the issue does not reproduce using file:///, we need reproduction steps which work for us in order to move forward. This does not reproduce for me, either.

See also https://secure.phabricator.com/T13111, perhaps.


#5

So I actually tried cloning inside the docker container that phabricator is running in:

rc=0; while [[ $rc -eq 0 ]]; do rm -rf libedit/; date; git clone -n git@localhost:diffusion/LIBEDIT/libedit; rc=$?; date; done; date

and the problem still occurs.

I set up an alternate sshd that didn’t call out to phabricator:

# NOTE: You must have OpenSSHD 6.2 or newer; support for AuthorizedKeysCommand
# was added in this version.

# NOTE: Edit these to the correct values for your setup.

# You may need to tweak these options, but mostly they just turn off everything
# dangerous.

Port 2222
Protocol 2
PermitRootLogin yes
AllowAgentForwarding no
AllowTcpForwarding no
PrintMotd no
PrintLastLog no
PasswordAuthentication no
UseDNS no

PidFile /dev/shm/sshd-phabricator.pid

(had to set up keys etc) and doing

rc=0; while [[ $rc -eq 0 ]]; do rm -rf libedit/; date; git clone -n ssh://user@localhost:2222/var/repo/83 libedit; rc=$?; date; done; date

Never manifested the issue.


#6

The repo in question (/var/repo/83) was manually repacked yesterday and the actual contents hasn’t been changed in about two years so I’d be surprised if this is a git prune issue…


#7

Sounds like this might be an issue with Phabricator then. We’re happy to look at this in more detail as soon as we have a working reproduction case.


#8

So I’ve made it happen on another system of a similar vintage.

Steps to reproduce:

  1. Import https://github.com/phacility/arcanist into Phabricator.
  2. Ensure the key for a user that can access the local Phabricator repo is the default inside the running user’s .ssh directory.
  3. Run
cd /dev/shm/
rc=0; while [[ $rc -eq 0 ]]; do rm -rf arcanist/; date; git clone -n git@localhost:diffusion/ARCANIST/arcanist.git; rc=$?; date; done; date

Expected results:
Git cloning to loop forever.

Actual results:
After around 20 minutes - an hour the git clone hangs part way through:

[...]
Thu Jan  3 17:45:59 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
remote: Total 20244 (delta 12083), reused 20244 (delta 12083)
Receiving objects: 100% (20244/20244), 8.33 MiB | 23.97 MiB/s, done.
Resolving deltas: 100% (12083/12083), done.
Thu Jan  3 17:46:00 UTC 2019
Thu Jan  3 17:46:00 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
Receiving objects:  93% (18827/20244)

How reproducible is the problem?
This problem happens every time if you wait for long enough (although it seems to happen even more rapidly on some other heavily loaded systems).

Version information:
Phabricator 37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36
Ubuntu 18.04
OpenSSH_7.6p1 Ubuntu-4ubuntu0.1, OpenSSL 1.0.2n 7 Dec 2017
PHP 7.2.10-0ubuntu0.18.04.1

Additional information:
We are running Phabricator inside a container with an AWS RDS MySQL database. When the hang occurs the processes look like this:

[...]
root       722  0.0  0.0  18508  3488 pts/0    Ss+  17:30   0:00 /bin/bash
root     18950  0.0  0.0  89812  4004 pts/0    Tl   17:46   0:00 git clone -n git@localhost:diffusion/ARCANIST/arcanist.git
root     18951  0.0  0.1  47544  8364 pts/0    T    17:46   0:00 /usr/bin/ssh git@localhost git-upload-pack 'diffusion/ARCANIST/arcanist.git'
root     18952  0.0  0.0  74656  6660 ?        SNs  17:46   0:00 sshd: git [priv]
git      18955  0.0  0.0  74656  3424 ?        SN   17:46   0:00 sshd: git@notty
git      18956  0.0  0.0   4628   788 ?        SNs  17:46   0:00 sh -c '/opt/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'phabricatoruser' '--phabricator-ssh-key' '32'
git      18957  0.0  0.5 300976 44464 ?        SN   17:46   0:00 php /opt/phabricator/bin/ssh-exec --phabricator-ssh-user phabricatoruser --phabricator-ssh-key 32
root     18965  0.0  0.0  19264  5948 pts/0    T    17:46   0:00 /usr/lib/git-core/git index-pack --stdin -v --fix-thin --keep=fetch-pack 18950 on 0ec65f5b6b82 --check-self-contained-and-connected

running strace on pid 18957 (but “outside” the container where its pid is 24657) only shows this:

strace: Process 24657 attached
select(6, [5], [], [5], {tv_sec=0, tv_usec=717067}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)

over and over.

The problem also happens when using PHP 5.6.39-1+ubuntu18.04.1+deb.sury.org+1 (cli). On this run looking at strace’s output shows this:

[...]
select(6, [5], [], [5], {tv_sec=0, tv_usec=432198}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
[...]

Digging about shows this:

# ls -l /proc/20979/fd
total 0
lr-x------ 1 user user 64 Jan  4 12:17 0 -> 'pipe:[37268584]'
l-wx------ 1 user user 64 Jan  4 12:17 1 -> 'pipe:[37268585]'
l-wx------ 1 user user 64 Jan  4 12:17 2 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan  4 12:17 3 -> 'socket:[37268592]'
lrwx------ 1 user user 64 Jan  4 12:17 4 -> 'socket:[37268596]'
lr-x------ 1 user user 64 Jan  4 12:17 5 -> 'pipe:[37268584]'
lrwx------ 1 user user 64 Jan  4 12:17 6 -> 'socket:[37268226]'
l-wx------ 1 user user 64 Jan  4 12:17 7 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan  4 12:17 8 -> 'socket:[37268598]'
lrwx------ 1 user user 64 Jan  4 12:17 9 -> 'socket:[37268600]'

and grepping lsof for 37268584 (which was mentioned on descriptor 5 above) shows this:

# lsof | grep '37268584'
sshd      20977                 user    8w     FIFO               0,12       0t0   37268584 pipe
sh        20978                 user    0r     FIFO               0,12       0t0   37268584 pipe
php       20979                 user    0r     FIFO               0,12       0t0   37268584 pipe
php       20979                 user    5r     FIFO               0,12       0t0   37268584 pipe

So it looks like sshd is the write end of the pipe.

On another run digging around shows the PHP callgraph as apparently being

DiffusionSSHWorkflow->execute() phabricator/src/applications/diffusion/ssh/DiffusionSSHWorkflow.php:188 
 DiffusionGitUploadPackSSHWorkflow->executeRepositoryOperations() phabricator/src/applications/diffusion/ssh/DiffusionGitUploadPackSSHWorkflow.php:83 
  DiffusionGitSSHWorkflow->waitForGitClient() phabricator/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php:44
   PhutilChannel->waitForAny() libphutil/src/channel/PhutilChannel.php:100 
    PhutilChannel->waitForActivity() libphutil/src/channel/PhutilChannel.php:197 

libphutil/src/channel/PhutilChannel.php:

128   public static function waitForActivity(
129     array $reads,
130     array $writes,
131     array $options = array()) {
[...]
194     $wait_sec = (int)$wait;
195     $wait_usec = 1000000 * ($wait - $wait_sec);
196 
197     @stream_select($read, $write, $except, $wait_sec, $wait_usec);
198   }

Searching the web the only thing that sounds similar to this is the long fixed https://secure.phabricator.com/T9724 (“ssh-exec” sometimes hangs when git cloning larger repos)…


#11

I have made this happen with a clean instance using:
37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36
I then upgraded that instance to
Phabricator c7656312ec4207b39fe3962fb350c9c33f17b4a3 (stable) Promote 2019 Week 1
Arcanist 97ddb9d5a1be282d6002a875a759266bb97b653f (stable) Promote 2018 Week 51
libphutil a537ba03c994eca87698cc4b95d4db4570edc665 (stable) Promote 2018 Week 50

and the problem continued to occur (on this occasion it took 14 minutes to occur on a machine with 80 cores, an Intel SSD and 256GB of RAM).

@avivey given that I reproduced the issue on the latest stable Phabricator can you retry the latest reproduction steps from Sporadic git cloning hang over ssh (reply 8) for at least an hour and report your results?


#12

I let this run for about an hour and a half:

$ while [ true ]; do date; let "COUNT++"; echo $COUNT; rm -rf libphutil/; git clone ssh://secure@secure.phabricator.com/source/libphutil.git; done

It cloned 950 times without hanging:

...
Mon Jan  7 11:35:59 PST 2019
950
Cloning into 'libphutil'...
# Fetch received by "secure004.phacility.net", forwarding to cluster host.
# Acquiring read lock for repository "libphutil" on device "secure002.phacility.net"...
# Acquired read lock immediately.
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Cleared to fetch on cluster host "secure002.phacility.net".
remote: Enumerating objects: 15572, done.
remote: Counting objects: 100% (15572/15572), done.
remote: Compressing objects: 100% (6336/6336), done.
remote: Total 15572 (delta 10457), reused 13214 (delta 8898)
Receiving objects: 100% (15572/15572), 5.90 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (10457/10457), done.
...

#13

@epriestley OK fair enough - let me see if I can reproduce it with libphutil


#14

I’ve just tired with libphutil here and reproduced the hang:

# mkdir /dev/shm/gitclonehang
# cd /dev/shm/gitclonehang
# COUNT=0; START_DATE="$(date)"; while [ true ]; do date; let "COUNT++"; echo $COUNT; rm -rf libphutil/; git clone ssh://git@localhost/diffusion/LIBPHUTIL/libphutil.git; done
Tue Jan  8 06:03:45 UTC 2019
1
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
remote: Total 15572 (delta 9240), reused 15572 (delta 9240)
Receiving objects: 100% (15572/15572), 8.38 MiB | 20.44 MiB/s, done.
Resolving deltas: 100% (9240/9240), done.
Tue Jan  8 06:03:46 UTC 2019
2
[...]
36
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
remote: Total 15572 (delta 9240), reused 15572 (delta 9240)
Receiving objects: 100% (15572/15572), 8.38 MiB | 22.19 MiB/s, done.
Resolving deltas: 100% (9240/9240), done.
Tue Jan  8 06:04:29 UTC 2019
37
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
Receiving objects:  62% (9655/15572)
[hang]

This was with a freshly reset instance of “Phabricator c7656312ec4207b39fe3962fb350c9c33f17b4a3 (stable) Promote 2019 Week 1”. My Phabricator is running on an up to date Ubuntu 18.04.

Differences between your Phabricator instance and mine that I can see:

  • You’re using a clustered Phabricator whereas I’m not.
  • I’m cloning over localhost (to rule out networking issues) whereas it’s less clear whether you are doing the same.
  • Your libphutil repo seems to be smaller/better compressed than mine.
  • This particular Phabricator was likely less busy than yours (in this case no one but me was using this Phabricator instance).

#15

I’ve since made the issue happen in a Debian 9 container too. I’m now working on making the environment that demonstrates the issue easier to obtain…