Git-upload-pack pids not being removed, running out of pids

Reproduction Instructions
Complete steps which allow someone else who does not have access to your environment to reproduce the bug.

  1. Have the SmallStep step-ssh package installed on the host of at least version 0.19.0-1. (Smallstep SSH — Single Sign-On SSH With Zero Key Management)

  2. Have a user account for ssh upload of git.

  3. After installing smallstep on your host have a script placed at /usr/local/bin/step-ssh-ctl-ignore-git with these contents:

#!/usr/bin/env bash

if  [ "$PAM_RUSER" != "git" ]; then
    /usr/bin/step-ssh-ctl "$@"
fi
  1. Modify the smallstep section added to /etc/pam.d/sudo

Should have at the bottom of the file:

# Added by Smallstep Step SSH.
session    optional   pam_exec.so /usr/local/bin/step-ssh-ctl-ignore-git session

This allows SmallStep to manage and interact with the normally running sshd and should ignore ssh operations running on the phabricator based sshd you have running for git/hg.

  1. Perform git operations pushing, fetching and cloning as normal.

Should begin to see in the process tree stuff like this:

git      23964  0.0  0.0   4640   784 ?        Ss   06:23   0:00 sh -c '/opt/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'FooBar' '--phabricator-ssh-ke
git      23965  0.0  0.1 319848 41768 ?        S    06:23   0:04 php /opt/phabricator/bin/ssh-exec --phabricator-ssh-user FooBar --phabricator-ssh-key 53
git      24057  0.0  0.0   4640   896 ?        S    06:23   0:00 sh -c '/usr/bin/sudo' -E -n -u 'daemon-user' -- git-upload-pack -- '/var/repo/142/'
root     24060  0.0  0.0 1144728 7912 ?        Sl   06:23   0:00 /usr/bin/sudo -E -n -u daemon-user -- git-upload-pack -- /var/repo/142/
daemon-+ 24174  0.0  0.0      0     0 ?        Z    06:23   0:00 [git-upload-pack] <defunct>

sticking around and not eventually being completed/reaped.

This has led to us running out of pids twice and necessitating a reboot twice over the past four months over it.

You’ll see a process tree like this:

pstree 23964
sh───php───sh───sudo─┬─git-upload-pack
                     └─5*[{sudo}]

Phabricator/Arcanist Version
arcanist 4b3baca999a4a229433c891cf69c2c4e2d634b89 (16 Oct 2020)

Host Info
Kernel: 5.10.2-x86_64-linode140
Distro: Ubuntu
Revision: 18.04LTS
PHP Version: PHP 7.2.24-0ubuntu0.18.04.7 (cli) (built: Oct 7 2020 15:24:25) ( NTS )

If reproducing this requires SmallStep SSH, I’m not going to look at this on a free support basis unless you can present a much more compelling case that this is a rooted in some bug in Phabricator rather than some bug in SmallStep SSH: this smells a lot like a wild goose chase in search of a non-Phabricator environmental problem to me.

If reproducing this does not require SmallStep SSH, please provide reproduction steps that don’t mention SmallStep SSH.

I am broadly unable to reproduce this in the general case: the Phacility production cluster runs ~40 repository hosts, configured according to the documentation, some of which execute tens of thousands of repository operations per day. They don’t get rebooted very often:

% phage remote ssh --pools repo -- --exec 'uptime'
 PHAGE  Planning execution...
 PHAGE  Initiating execution...
[repo005.phacility.net]  16:38:06 up 1902 days, 22:49,  0 users,  load average: 1.23, 1.60, 1.38
[repo008.phacility.net]  16:38:06 up 1553 days, 23:28,  0 users,  load average: 0.83, 0.63, 0.54
[repo002.phacility.net]  16:38:06 up 1671 days,  3:51,  0 users,  load average: 0.56, 0.55, 0.51
[repo019.phacility.net]  16:38:06 up 1441 days,  5:22,  0 users,  load average: 0.64, 0.68, 0.72
[repo020.phacility.net]  16:38:06 up 1441 days,  5:22,  0 users,  load average: 1.00, 0.82, 0.74
[repo021.phacility.net]  16:38:06 up 1421 days,  4:00,  0 users,  load average: 0.61, 0.56, 0.52
[repo014.phacility.net]  16:38:06 up 1483 days, 23:18,  0 users,  load average: 1.99, 1.62, 1.57
[repo017.phacility.net]  16:38:06 up 1441 days,  5:23,  0 users,  load average: 0.40, 0.43, 0.47
[repo015.phacility.net]  16:38:06 up 930 days,  6:07,  0 users,  load average: 2.14, 1.93, 2.02

...

-<  DONE  >-----------------------------------------------------------------------

 COMPLETE  Everything went according to plan (in 1,968ms).

(I’m also not aware of other installs that need to regularly reboot Phabricator hosts, although I think most organizations have a different DevOps philosophy than I do and probably haven’t hit 5 years of uptime on many hosts.)

That’s completely fair and I do suspect the case does lie with SmallStep rather then with phabricator.

Figured I’d open a ticket/topic/report for completeness sake in case there was somehow a bug in phabricator.