Possible race condition in landing to clustered hosted repositories?


#1

Observed Behavior:
arc land fails occasionally with an error message suggesting the repository state has changed in the background.

Here are the logs from our arc land attempt.

$ arc land --revision Dnnnnn --onto master
Landing current branch 'fix-the-things'.
 TARGET  Landing onto "master", selected by the --onto flag.
 REMOTE  Using remote "origin", the default remote under git.
 FETCH  Fetching origin/master...
# Fetch received by "web001.mydomain", forwarding to cluster host.
# Acquiring read lock for repository "rCORE" on device "repo002.mydomain"...
# Acquired read lock immediately.
# Device "repo002.mydomain" is already a cluster leader and does not need to be synchronized.
# Cleared to fetch on cluster host "repo002.mydomain".
These commits will be landed:

      - 98ba9f5 Merge branch 'master' into fix-the-things
      - 9a99276 fix
      - 2fe6bdc lint
      - 0b96fb7 Fix The Things

Landing revision 'Dnnnnn: Fix The Things'...
 BUILDS PASSED  Harbormaster builds for the active diff completed successfully.
 PUSHING  Pushing changes to "origin/master".
# Push received by "web001.mydomain", forwarding to cluster host.
# Acquiring write lock for repository "rCORE"...
# Acquired write lock immediately.
# Acquiring read lock for repository "rCORE" on device "repo002.mydomain"...
# Acquired read lock immediately.
# Device "repo002.mydomain" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "repo002.domain".
# Released cluster write lock.
To ssh://git@git.mydomain/diffusion/CORE/core.git
   ed3f135..27be34c  27be34c714dfaab080f3efa02c9b501a85f74481 -> master
 UPDATE  Local "master" tracks target remote "origin/master", checking out and pulling changes.
 PULL  Checking out and pulling "master".


FINISHED Exception 
Command failed with error #1!
COMMAND
git pull --

STDOUT
(empty)

STDERR
# Fetch received by "web001.mydomain", forwarding to cluster host.
# Acquiring read lock for repository "rCORE" on device "repo001.mydomain"...
# Acquired read lock immediately.
# Synchronizing this device ("repo001.domain") from cluster leader ("repo002.mydomain") before read.
# Synchronization of "repo001.corp.tmachine.io" from leader "repo002.mydomain" failed: Command failed with error #1!
COMMAND
'/usr/bin/sudo' -E -n -u 'phd' -- git fetch --prune -- 'ssh://[repo2 ip address]:2223/diffusion/CORE/' '+refs/*:refs/*'

STDOUT
(empty)

STDERR
Warning: Permanently added '[repo2 ip address]:2223' (ECDSA) to the list of known hosts.
error: cannot lock ref 'refs/heads/master': is at 27be34c714dfaab080f3efa02c9b501a85f74481 but expected ed3f1353f352acc87a020c6294a434ad92a13d36
From ssh://[repo2 ip address]:2223/diffusion/CORE
 ! ed3f1353f3..27be34c714  master     -> master  (unable to update local ref)

CommandException: Command failed with error #1!

If I’m reading this log correctly, between the push to master and the subsequent read, repo001 was updated but something didn’t notice this and tried to sync repo001. This failed because the hashes no longer matched.

Expected Behavior:
This usually works, we only see this error message infrequently.

We are in the process of enabling repository automation, and expect to use the ‘land revision’ action in Differential in future.

Phabricator Version:

phabricator fba35975e7667e6d6dc00115e1c874a06923cd96 (Sun, Apr 8) (branched from f01c2e36948b0378eda726158ec86808cac2a9fd on phacility)
arcanist 6185c8911737f4dec0d0918cc72cd07a701475af (Sun, Apr 8) (branched from 73f5afd441109cb712282660c1eb01089b6297fa on phacility)
phutil 47cfa511ca6782df13c459fdd212606f091a44ec (Fri, Mar 16) (branched from 1ad42491e44a1866975b366ae552f1d47761e35b on phacility)

Reproduction Steps:
Infrequent only, I’m afraid.


#2

We’ve now seen this in our Jenkins CI as well in the initial checkout stage, so it’s not just land related.

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://git@git.mydomain/diffusion/CORE/core.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: # Fetch received by "web001.mydomain", forwarding to cluster host.
# Acquiring read lock for repository "rCORE" on device "repo001.mydomain"...
# Acquired read lock immediately.
# Synchronizing this device ("repo001.mydomain") from cluster leader ("repo002.mydomain") before read.
# Synchronization of "repo001.mydomain" from leader "repo002.mydomain" failed: Command failed with error #1!
COMMAND
'/usr/bin/sudo' -E -n -u 'phd' -- git fetch --prune -- 'ssh://[ip]:2223/diffusion/CORE/' '+refs/*:refs/*'

STDOUT
(empty)

STDERR
Warning: Permanently added '[ip]:2223' (ECDSA) to the list of known hosts.
From ssh://[ip]:2223/diffusion/CORE
 - [deleted]               (none)     -> T14182-do-some-things
error: cannot lock ref 'refs/heads/master': is at 54300c9ef15a6c77dd47b6854b576ad38393ffed but expected b6cc42e19da812c6b57c2236259208499de76571
 ! b6cc42e19d..54300c9ef1  master     -> master  (unable to update local ref)

fatal: Could not read from remote repository.

#3

We’re still seeing this after updating to the latest stable - are we the only ones this is happening to? Do I possibly have something misconfigured in our cluster setup?


#4

Hi, we’re still seeing this occasionally. Has anyone else ever seen this?