Disabling git diffs to deal with search bots and crawlers


We are currently noticing an increase of load on our server that appears to be related to a bunch of search engines crawling our differential pages, among other things.

At a glance the load seems to mostly come from Phabricator invoking many instances of git in the cli to get a diff and do the pygments highlighting pass. We were wondering if there is anyway to possibly disable this by default on a page load, or perhaps by passing a GET url param that we could redirect bots to based on their referrer.

In addition, I noticed that the actual diffs appear to be no-store cache control. Is there any advice or direction on the topic of tossing a caching reverse proxy in front of the pages, possibly even forcing caching of at least the git diffs?

Any help would be appreciated. Thanks!

Differential pages should not invoke git.

Diffusion pages should be forbidden by robots.txt, so well-behaved crawlers should not access these pages.

You may be able to forbid access by poorly-behaved crawlers at the HTTP server level (for instance, by User-Agent or remote IP).

Phabricator also supports application-level rate limiting, but this is rarely used by third-party installs and not currently documented. https://secure.phabricator.com/D18703 has some context.

Thanks for the quick reply.

The git is maybe just from the phd’s perhaps, so take with a grain of salt. I was under the impression base on timings and quick flurry of process listings that it may be that it was invoking git cli directly.

I found that our fast-cgi was pre memory upgrade (ie: low) and upped those a bit to help. See how it goes.

Currently we bumped our robots.txt to:

User-agent: *
Crawl-delay: 60
Disallow: /diffusion/
Disallow: /differential/
Disallow: /D*
Disallow: /r*
Disallow: /file/
Disallow: /login/
Disallow: /conduit/

We upped delay from 5 to 60 because one ISP was hitting us from 20 IP’s in a /18 :confused: Today we also added the /D* and the /differential. I guess we will see how well this holds until we can put it on SSD’s!

Edit: It seems that when I visit a /D#### page the cli does:

17194  -  RJ      0:00.01 git diff -M -C --no-ext-diff --no-color --src-prefix=a/ --dst-prefix=b/ -U65535 53f20b940a1e520e131b8bb31cf0529ed4d30f9e^ 53f20b940a1e5
17197  -  RJ      0:00.02 git log -n 1 --format=%P 53f20b940a1e520e131b8bb31cf0529ed4d30f9e --

I can’t reproduce this and can’t think of an execution pathway which would cause it to occur.

Wait, are you saying that nothing should ever be invoking git log directly, or just not /D### pages? Could be onto something here in our setup.

/Dxxx pages should never directly invoke git commands. They may be invoked through other pages, and by the daemons.

Perhaps you were right! I disabled the phd’s and didn’t see much in the way of git log or diff’s.

I guess was just timing? What was odd was when the PHD’s were running and I forceably reloaded one of the /D pages, I got the same git log hash back to back. I wonder if there is something in our setup or private fork that is inadvertently causing the PHD’s to run some task that is blocking on a fork() to git?