Sporadic git cloning hang over ssh


#1

Observed Behavior:
git clone over ssh periodically hangs.

Expected Behavior:
git clone over ssh should not hang (unless network is broken etc).

Phabricator Version:
37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36

SSH version
OpenSSH_7.6p1 Ubuntu-4ubuntu0.1, OpenSSL 1.0.2n 7 Dec 2017

Reproduction Steps:

true;
while [[ $? -eq 0 ]]; do 
  rm -rf libedit/; date; 
  git/bin/git clone -n git@remotehost:diffusion/LIBEDIT/libedit; 
  date; 
done; 
date

#2

Your phabricator install is very old, and your reproduction steps don’t reproduce to me.


#3

I agree that the version of Phabricator being used isn’t the freshest. Bearing that in mind, is anyone aware of something that has been fixed that might explain this? Better yet, is there some sort of logging that could be enabled that might be able to explain this (normal logging from sshd doesn’t seem to offer any insight and the problem happens even happens when the clone is happening via localhost)?


#4

We implement the Git protocol by just running git, so we can’t add logging to git index-pack.

Try cloning from the host using git clone file:///path/to/repo instead of SSH. No Phabricator code runs when you do this, so if the issue reproduces this is not a problem with Phabricator.

If the issue does not reproduce using file:///, we need reproduction steps which work for us in order to move forward. This does not reproduce for me, either.

See also https://secure.phabricator.com/T13111, perhaps.


#5

So I actually tried cloning inside the docker container that phabricator is running in:

rc=0; while [[ $rc -eq 0 ]]; do rm -rf libedit/; date; git clone -n git@localhost:diffusion/LIBEDIT/libedit; rc=$?; date; done; date

and the problem still occurs.

I set up an alternate sshd that didn’t call out to phabricator:

# NOTE: You must have OpenSSHD 6.2 or newer; support for AuthorizedKeysCommand
# was added in this version.

# NOTE: Edit these to the correct values for your setup.

# You may need to tweak these options, but mostly they just turn off everything
# dangerous.

Port 2222
Protocol 2
PermitRootLogin yes
AllowAgentForwarding no
AllowTcpForwarding no
PrintMotd no
PrintLastLog no
PasswordAuthentication no
UseDNS no

PidFile /dev/shm/sshd-phabricator.pid

(had to set up keys etc) and doing

rc=0; while [[ $rc -eq 0 ]]; do rm -rf libedit/; date; git clone -n ssh://user@localhost:2222/var/repo/83 libedit; rc=$?; date; done; date

Never manifested the issue.


#6

The repo in question (/var/repo/83) was manually repacked yesterday and the actual contents hasn’t been changed in about two years so I’d be surprised if this is a git prune issue…


#7

Sounds like this might be an issue with Phabricator then. We’re happy to look at this in more detail as soon as we have a working reproduction case.


#8

So I’ve made it happen on another system of a similar vintage.

Steps to reproduce:

  1. Import https://github.com/phacility/arcanist into Phabricator.
  2. Ensure the key for a user that can access the local Phabricator repo is the default inside the running user’s .ssh directory.
  3. Run
cd /dev/shm/
rc=0; while [[ $rc -eq 0 ]]; do rm -rf arcanist/; date; git clone -n git@localhost:diffusion/ARCANIST/arcanist.git; rc=$?; date; done; date

Expected results:
Git cloning to loop forever.

Actual results:
After around 20 minutes - an hour the git clone hangs part way through:

[...]
Thu Jan  3 17:45:59 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
remote: Total 20244 (delta 12083), reused 20244 (delta 12083)
Receiving objects: 100% (20244/20244), 8.33 MiB | 23.97 MiB/s, done.
Resolving deltas: 100% (12083/12083), done.
Thu Jan  3 17:46:00 UTC 2019
Thu Jan  3 17:46:00 UTC 2019
Cloning into 'arcanist'...
remote: Counting objects: 20244, done.
remote: Compressing objects: 100% (7843/7843), done.
Receiving objects:  93% (18827/20244)

How reproducible is the problem?
This problem happens every time if you wait for long enough (although it seems to happen even more rapidly on some other heavily loaded systems).

Version information:
Phabricator 37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36
Ubuntu 18.04
OpenSSH_7.6p1 Ubuntu-4ubuntu0.1, OpenSSL 1.0.2n 7 Dec 2017
PHP 7.2.10-0ubuntu0.18.04.1

Additional information:
We are running Phabricator inside a container with an AWS RDS MySQL database. When the hang occurs the processes look like this:

[...]
root       722  0.0  0.0  18508  3488 pts/0    Ss+  17:30   0:00 /bin/bash
root     18950  0.0  0.0  89812  4004 pts/0    Tl   17:46   0:00 git clone -n git@localhost:diffusion/ARCANIST/arcanist.git
root     18951  0.0  0.1  47544  8364 pts/0    T    17:46   0:00 /usr/bin/ssh git@localhost git-upload-pack 'diffusion/ARCANIST/arcanist.git'
root     18952  0.0  0.0  74656  6660 ?        SNs  17:46   0:00 sshd: git [priv]
git      18955  0.0  0.0  74656  3424 ?        SN   17:46   0:00 sshd: git@notty
git      18956  0.0  0.0   4628   788 ?        SNs  17:46   0:00 sh -c '/opt/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'phabricatoruser' '--phabricator-ssh-key' '32'
git      18957  0.0  0.5 300976 44464 ?        SN   17:46   0:00 php /opt/phabricator/bin/ssh-exec --phabricator-ssh-user phabricatoruser --phabricator-ssh-key 32
root     18965  0.0  0.0  19264  5948 pts/0    T    17:46   0:00 /usr/lib/git-core/git index-pack --stdin -v --fix-thin --keep=fetch-pack 18950 on 0ec65f5b6b82 --check-self-contained-and-connected

running strace on pid 18957 (but “outside” the container where its pid is 24657) only shows this:

strace: Process 24657 attached
select(6, [5], [], [5], {tv_sec=0, tv_usec=717067}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x7f9290ee7000, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)

over and over.

The problem also happens when using PHP 5.6.39-1+ubuntu18.04.1+deb.sury.org+1 (cli). On this run looking at strace’s output shows this:

[...]
select(6, [5], [], [5], {tv_sec=0, tv_usec=432198}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
read(5, 0x55b098ede0c0, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
[...]

Digging about shows this:

# ls -l /proc/20979/fd
total 0
lr-x------ 1 user user 64 Jan  4 12:17 0 -> 'pipe:[37268584]'
l-wx------ 1 user user 64 Jan  4 12:17 1 -> 'pipe:[37268585]'
l-wx------ 1 user user 64 Jan  4 12:17 2 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan  4 12:17 3 -> 'socket:[37268592]'
lrwx------ 1 user user 64 Jan  4 12:17 4 -> 'socket:[37268596]'
lr-x------ 1 user user 64 Jan  4 12:17 5 -> 'pipe:[37268584]'
lrwx------ 1 user user 64 Jan  4 12:17 6 -> 'socket:[37268226]'
l-wx------ 1 user user 64 Jan  4 12:17 7 -> 'pipe:[37268586]'
lrwx------ 1 user user 64 Jan  4 12:17 8 -> 'socket:[37268598]'
lrwx------ 1 user user 64 Jan  4 12:17 9 -> 'socket:[37268600]'

and grepping lsof for 37268584 (which was mentioned on descriptor 5 above) shows this:

# lsof | grep '37268584'
sshd      20977                 user    8w     FIFO               0,12       0t0   37268584 pipe
sh        20978                 user    0r     FIFO               0,12       0t0   37268584 pipe
php       20979                 user    0r     FIFO               0,12       0t0   37268584 pipe
php       20979                 user    5r     FIFO               0,12       0t0   37268584 pipe

So it looks like sshd is the write end of the pipe.

On another run digging around shows the PHP callgraph as apparently being

DiffusionSSHWorkflow->execute() phabricator/src/applications/diffusion/ssh/DiffusionSSHWorkflow.php:188 
 DiffusionGitUploadPackSSHWorkflow->executeRepositoryOperations() phabricator/src/applications/diffusion/ssh/DiffusionGitUploadPackSSHWorkflow.php:83 
  DiffusionGitSSHWorkflow->waitForGitClient() phabricator/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php:44
   PhutilChannel->waitForAny() libphutil/src/channel/PhutilChannel.php:100 
    PhutilChannel->waitForActivity() libphutil/src/channel/PhutilChannel.php:197 

libphutil/src/channel/PhutilChannel.php:

128   public static function waitForActivity(
129     array $reads,
130     array $writes,
131     array $options = array()) {
[...]
194     $wait_sec = (int)$wait;
195     $wait_usec = 1000000 * ($wait - $wait_sec);
196 
197     @stream_select($read, $write, $except, $wait_sec, $wait_usec);
198   }

Searching the web the only thing that sounds similar to this is the long fixed https://secure.phabricator.com/T9724 (“ssh-exec” sometimes hangs when git cloning larger repos)…


#11

I have made this happen with a clean instance using:
37a40d82726b31d45446306b9306b9cd48b219b4 (stable) Promote 2018 Week 36
I then upgraded that instance to
Phabricator c7656312ec4207b39fe3962fb350c9c33f17b4a3 (stable) Promote 2019 Week 1
Arcanist 97ddb9d5a1be282d6002a875a759266bb97b653f (stable) Promote 2018 Week 51
libphutil a537ba03c994eca87698cc4b95d4db4570edc665 (stable) Promote 2018 Week 50

and the problem continued to occur (on this occasion it took 14 minutes to occur on a machine with 80 cores, an Intel SSD and 256GB of RAM).

@avivey given that I reproduced the issue on the latest stable Phabricator can you retry the latest reproduction steps from Sporadic git cloning hang over ssh (reply 8) for at least an hour and report your results?


#12

I let this run for about an hour and a half:

$ while [ true ]; do date; let "COUNT++"; echo $COUNT; rm -rf libphutil/; git clone ssh://secure@secure.phabricator.com/source/libphutil.git; done

It cloned 950 times without hanging:

...
Mon Jan  7 11:35:59 PST 2019
950
Cloning into 'libphutil'...
# Fetch received by "secure004.phacility.net", forwarding to cluster host.
# Acquiring read lock for repository "libphutil" on device "secure002.phacility.net"...
# Acquired read lock immediately.
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Cleared to fetch on cluster host "secure002.phacility.net".
remote: Enumerating objects: 15572, done.
remote: Counting objects: 100% (15572/15572), done.
remote: Compressing objects: 100% (6336/6336), done.
remote: Total 15572 (delta 10457), reused 13214 (delta 8898)
Receiving objects: 100% (15572/15572), 5.90 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (10457/10457), done.
...

#13

@epriestley OK fair enough - let me see if I can reproduce it with libphutil


#14

I’ve just tired with libphutil here and reproduced the hang:

# mkdir /dev/shm/gitclonehang
# cd /dev/shm/gitclonehang
# COUNT=0; START_DATE="$(date)"; while [ true ]; do date; let "COUNT++"; echo $COUNT; rm -rf libphutil/; git clone ssh://git@localhost/diffusion/LIBPHUTIL/libphutil.git; done
Tue Jan  8 06:03:45 UTC 2019
1
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
remote: Total 15572 (delta 9240), reused 15572 (delta 9240)
Receiving objects: 100% (15572/15572), 8.38 MiB | 20.44 MiB/s, done.
Resolving deltas: 100% (9240/9240), done.
Tue Jan  8 06:03:46 UTC 2019
2
[...]
36
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
remote: Total 15572 (delta 9240), reused 15572 (delta 9240)
Receiving objects: 100% (15572/15572), 8.38 MiB | 22.19 MiB/s, done.
Resolving deltas: 100% (9240/9240), done.
Tue Jan  8 06:04:29 UTC 2019
37
Cloning into 'libphutil'...
remote: Counting objects: 15572, done.
remote: Compressing objects: 100% (5994/5994), done.
Receiving objects:  62% (9655/15572)
[hang]

This was with a freshly reset instance of “Phabricator c7656312ec4207b39fe3962fb350c9c33f17b4a3 (stable) Promote 2019 Week 1”. My Phabricator is running on an up to date Ubuntu 18.04.

Differences between your Phabricator instance and mine that I can see:

  • You’re using a clustered Phabricator whereas I’m not.
  • I’m cloning over localhost (to rule out networking issues) whereas it’s less clear whether you are doing the same.
  • Your libphutil repo seems to be smaller/better compressed than mine.
  • This particular Phabricator was likely less busy than yours (in this case no one but me was using this Phabricator instance).

#15

I’ve since made the issue happen in a Debian 9 container too. I’m now working on making the environment that demonstrates the issue easier to obtain…


#16

I’ve made an Ubuntu 18.04 Docker container that reproduces the problem. Usage steps are as follows:

docker run --name mysql-phabricator -e MYSQL_ROOT_PASSWORD=githang -d mysql:8 mysqld --innodb-buffer-pool-size=256M --sql-mode="STRICT_ALL_TABLES"
docker build -t sitsofe/phabricator-clone-hang:latest https://github.com/sitsofe/phabricator-clone-hang.git
docker rm phabricator-clone-hang; docker run --privileged --init -it -e SERVER_FQDN=$(hostname -f) -p 80:80 --link mysql-phabricator --name phabricator-clone-hang sitsofe/phabricator-clone-hang /bin/bash

It can take between 5 minutes - 2 hours for the hang to reproduce itself in the container but it always happens eventually. The aforementioned script worked fine when using Docker on Ubuntu 18.04 with a traditional static fully qualified DNS address. When using Docker on macOS using Safari to try and connect to the machine’s zeroconf *.local address failed (but using Chrome to access that address succeeded). Reproduction took much longer when running on Docker for Mac (on a Mac laptop) in comparison to using an Ubuntu 18.04 install of Docker on a Linux server.

@epriestley, @avivey can you see if the latest steps allow the issue to reproduce within two hours for you?


#17

The problem appears to be related to trying to interpret the meaning of zero writes. From phutil_fwrite_nonblocking_stream() in https://github.com/phacility/libphutil/blob/9456d6683d3388fbfe23503ff9f676a9186952e4/src/utils/utils.php#L995:

 995 function phutil_fwrite_nonblocking_stream($stream, $bytes) {
 996   if (!strlen($bytes)) {
 997     return 0;
 998   }
 999 
1000   $result = @fwrite($stream, $bytes);
1001   if ($result !== 0) {
1002     // In cases where some bytes are witten (`$result > 0`) or
1003     // an error occurs (`$result === false`), the behavior of fwrite() is
1004     // correct. We can return the value as-is.
1005     return $result;
1006   }
1007 
1008   // If we make it here, we performed a 0-length write. Try to distinguish
1009   // between EAGAIN and EPIPE. To do this, we're going to `stream_select()`
1010   // the stream, write to it again if PHP claims that it's writable, and
1011   // consider the pipe broken if the write fails.
1012 
1013   $read = array();
1014   $write = array($stream);
1015   $except = array();
1016 
1017   @stream_select($read, $write, $except, 0);
1018 
1019   if (!$write) {
1020     // The stream isn't writable, so we conclude that it probably really is
1021     // blocked and the underlying error was EAGAIN. Return 0 to indicate that
1022     // no data could be written yet.
1023     return 0;
1024   }
1025 
1026   // If we make it here, PHP **just** claimed that this stream is writable, so
1027   // perform a write. If the write also fails, conclude that these failures are
1028   // EPIPE or some other permanent failure.
1029   $result = @fwrite($stream, $bytes);
1030   if ($result !== 0) {
1031     // The write worked or failed explicitly. This value is fine to return.
1032     return $result;
1033   }
1034 
1035   // We performed a 0-length write, were told that the stream was writable, and
1036   // then immediately performed another 0-length write. Conclude that the pipe
1037   // is broken and return `false`.
1038   return false;
1039 }

Unfortunately there are cases when we fall through to the return false when the underlying C call was indicating EAGAIN. This leads to callers closing the socket which in turn leads to endless looping trying to send a write to a socket that is closed because the other side of the pipe is still trying to do reads. So far the following has worked around the issue:

1029   for ($try = 0; $try < 5; $try++) {
1030     $result = @fwrite($stream, $bytes);
1031     if ($result !== 0) {
1032       // The write worked or failed explicitly. This value is fine to return.
1033       return $result;
1034     }
1035     usleep(5000);
1036   }   

With that in place waitForGitClient() you can apply the following:

diff --git a/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php b/src/applications/diffus
ion/ssh/DiffusionGitSSHWorkflow.php
index 6bc56d7..876f2e5 100644
--- a/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php
+++ b/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php
@@ -34,15 +34,17 @@ abstract class DiffusionGitSSHWorkflow
 
   protected function waitForGitClient() {
     $io_channel = $this->getIOChannel();
+    //$io_channel->closeWriteChannel();
+    //$io_channel->closeReadChannel();
 
     // If we don't wait for the client to close the connection, `git` will
     // consider it an early abort and fail. Sit around until Git is comfortable
     // that it really received all the data.
-    while ($io_channel->isOpenForReading()) {
-      $io_channel->update();
-      $this->getErrorChannel()->flush();
-      PhutilChannel::waitForAny(array($io_channel));
-    }
+//    while ($io_channel->isOpenForReading()) {
+//      $io_channel->update();
+//      $this->getErrorChannel()->flush();
+//      PhutilChannel::waitForAny(array($io_channel));
+//    }
   }

+++ b/src/infrastructure/ssh/PhabricatorSSHPassthruCommand.php
@@ -172,6 +172,8 @@ final class PhabricatorSSHPassthruCommand extends Phobject {
 
       // If we have nothing left on stdin, close stdin on the subprocess.
       if (!$io_channel->isOpenForReading()) {
+        //$command_channel->update();
+        $command_channel->flush();
         $command_channel->closeWriteChannel();
       }

In brief testing it’s never taking more than one additional spin round the try loop for things to recover using the docker container.


#18

Further investigation has shown that whenever multiple writes need to be done after the stream_select() call has indicated the stream is writable, the stream_select() call will have actually failed in a bizarre way that makes socket_last_error() return 0 (Success)!.

The following is a shorter patch with most of the debugging (bar the piece that indicates stream_select() failed) commented out. The patch fixes the problem by just telling the phutil_fwrite_nonblocking_stream() caller that a zero write was done so they can try again themselves:

diff --git a/src/utils/utils.php b/src/utils/utils.php
index 4164e6f..29f79b8 100644
--- a/src/utils/utils.php
+++ b/src/utils/utils.php
@@ -1014,7 +1014,16 @@ function phutil_fwrite_nonblocking_stream($stream, $bytes) {
   $write = array($stream);
   $except = array();
 
-  @stream_select($read, $write, $except, 0);
+  $result = @stream_select($read, $write, $except, 0);
+  if ($result === false) {
+      // FIXME: Remove debugging
+      file_put_contents("/tmp/libphutil.txt", "Socket error: " . socket_strerror(socket_last_error()) . "\n", FILE_APPEND);
+      // We can't trust the result in $write and socket_last_error() can be 0
+      // ("Success")! This is usually transient so return 0 to indicate that no
+      // data could be written yet (if there's a real error a subsequent write
+      // will indicate it).
+      return 0;
+  }
 
   if (!$write) {
     // The stream isn't writable, so we conclude that it probably really is
@@ -1028,6 +1037,7 @@ function phutil_fwrite_nonblocking_stream($stream, $bytes) {
   // EPIPE or some other permanent failure.
   $result = @fwrite($stream, $bytes);
   if ($result !== 0) {
+    //file_put_contents("/tmp/libphutil.txt", "Retried write succeeded\n", FILE_APPEND);
     // The write worked or failed explicitly. This value is fine to return.
     return $result;
   }
@@ -1035,6 +1045,7 @@ function phutil_fwrite_nonblocking_stream($stream, $bytes) {
   // We performed a 0-length write, were told that the stream was writable, and
   // then immediately performed another 0-length write. Conclude that the pipe
   // is broken and return `false`.
+  file_put_contents("/tmp/libphutil.txt", "Broken stream!\n", FILE_APPEND);
   return false;
 }

Currently over 50000 iterations on multiple different containers have been completed without hitting the hang issue.


#19

So after a colleague’s review I’ve dug deeper and the cause has been traced to a stream_select() calling an underlying select() that returns EINTR. It’s rare but it does happen… Further, although not documented, you MUST NOT use the $read, $write, $except arrays when stream_select() fails because PHP won’t have set/changed them. Given all this, I’ve come up with the following diff:

commit 910c026248cd6ecc764fc56e24fc24b7b98eb7a3
Author: Sitsofe Wheeler <sitsofe@>
Date:   Fri Feb 1 11:33:02 2019 +0000

    Fix git cloning over SSH from Phabricator hanging issue
    
    A git clone over SSH of a repository hosted on Phabricator can occasionally
    hang forever. Inspection showed that the phabricator/bin/ssh-exec script was
    stuck in an infinite loop due to the write part of the socket being closed when
    there was still data to send.
    
    The closure was triggered by a call to phutil_fwrite_nonblocking_stream()
    supposedly failing in a permanent manner. Digging inside
    phutil_fwrite_nonblocking_stream() showed the $write array of a failed
    stream_select() was unreliable which caused a follow-up fwrite() to perform a
    zero-sized write (which is used to detect an EPIPE situation which is returned
    as permanent error). Finally, it was discovered the stream_select() call had
    failed due to the underlying select() call it made returning EINTR (which is a
    soft error indicating the operation can be retried).
    
    Prevent the infinite loop being triggered by attempting detection of the case
    when stream_select() triggers EINTR (or EAGAIN) in its select() call and if it
    does just tell the caller nothing was done (which will likely cause the caller
    to retry). For good measure, tell the caller a permanent failure happened if
    stream_select() fails in any other way.
    
    Upstream report:
    https://discourse.phabricator-community.org/t/sporadic-git-cloning-hang-over-ssh/2233

diff --git a/src/utils/utils.php b/src/utils/utils.php
index 4164e6f..2c18b93 100644
--- a/src/utils/utils.php
+++ b/src/utils/utils.php
@@ -1014,7 +1014,31 @@ function phutil_fwrite_nonblocking_stream($stream, $bytes) {
   $write = array($stream);
   $except = array();
 
-  @stream_select($read, $write, $except, 0);
+  // Clear last error without raising an error.
+  set_error_handler(function() { }, 0);
+  restore_error_handler();
+  $result = @stream_select($read, $write, $except, 0);
+  if ($result === false) {
+    // When stream_select() fails we can't trust $write. Since
+    // socket_last_error()/posix_get_last_error() just return 0, try scraping
+    // the errno out of the PHP warning.
+    $error = error_get_last()['message'];
+    $pattern = '/stream_select\(\): unable to select \[(\\d+)\]:/';
+    $errno = 0;
+    if (preg_match($pattern, $error, $matches)) {
+        $errno = (int)$matches[1];
+    }
+    $eintr = defined('SOCKET_EINTR') ? SOCKET_EINTR : 4;
+    $eagain = defined('SOCKET_EAGAIN') ? SOCKET_EAGAIN : 11;
+    if ($errno == $eintr || $errno == $eagain) {
+      // The underlying select() returned EINTR/EAGAIN. Return 0 to indicate
+      // that no data could be written yet so the caller tries again.
+      return 0;
+    } else {
+      // select() failed some other way so conclude it's a permanent failure.
+      return false;
+    }
+  }
 
   if (!$write) {
     // The stream isn't writable, so we conclude that it probably really is

With this patch applied we’ve been able to do over 9000 clones in the docker image (mentioned in an earlier comment) without issue.


#20

Thanks, see https://secure.phabricator.com/T13243 and https://secure.phabricator.com/D20083.


#21

So a few follow ups:

  • I’ve submitted a note/comment to http://php.net/manual/en/function.stream-select.php about not using the result in the arrays when stream_select() fails.
  • It looks like the upstream diff D20083 is essentially the approach in the “attempt 2” above in comment 18. As stated there it works insofar as it’s done lots and lots of iterations successfully.
  • It is possible to construct contrived, non-real world scenarios where fwrite() succeeds with a 0 zero sized write but select fails without EINTR:
<?php
$fp = fopen("/tmp/php_fifo", "w");
stream_set_blocking($fp , 0);
stream_filter_append($fp, 'zlib.deflate', STREAM_FILTER_WRITE, 1);
while (true) {
    $result = fwrite($fp, ""); // phutil_fwrite_nonblocking_stream() handles empty data
    if ($result !== 0) {
        echo "result = $result\n";
        continue;
    }

    echo "Fell through\n";
    $read = $except = array();
    $write = array($fp);
    $result = stream_select($read, $write, $except, null);
    if ($result === false) {
        echo "Select failed, " . socket_strerror(socket_last_error()) . "\n";
        echo "Assuming EINTR\n";
        sleep(120);
    } else {
        echo "Select succeeded\n";
    }
}
  • When playing around with mkfifo /tmp/php_fifo; sleep 5 < /tmp/php_fifo the above example can generate the following:
Fell through
PHP Warning:  stream_select(): cannot cast a filtered stream on this system in /home/sitsofew/stream_error_fail.php on line 16
PHP Warning:  stream_select(): No stream arrays were passed in /home/sitsofew/stream_error_fail.php on line 16
Select failed, Success
Assuming EINTR

For anyone wondering - yes this was a tough issue to track down. I’m unfamiliar with PHP and spent huge amounts of time fumbling around (I may write up the procedure I wound up using using gdb to attach to a running php process). In conclusion myself and colleagues found PHP has an API deficiency around how it fails to expose errno with respect to lower level call failure but saying that here is just preaching to the choir.