Pygmentize causing high CPU indefinitely


#1

Observed Behavior:
On a diff with a .sh file (bash script) everytime the diff is loaded a bunch of pygmentize processes are started and peg the CPU indefinitely.

Expected Behavior:
Not sure where the parsing bug is, but there should definitely be a cap on how long and how resource-intensive pygmentize should run.

Phabricator Version:
52600a415159595526d48bf1d8de4887f7840fad (Fri, Jun 15) (branched from a7c681b54997b849c46054571fa2d0071c97fa32 on origin)

Reproduction Steps:
Upload a diff on a .sh file. I can supply the file privately if needed for debugging.


#2

FYI, if anyone else is seeing this, we implemented a (kludgy) crontab entry to get around this:
* * * * * ps axh -O etimes | awk ‘/[/]pygmentize/ {if ($2 >= 5) print $1}’ | xargs kill > /dev/null 2>&1


#3

There used to be issues with pygmentize in https://secure.phabricator.com/T5644, but the solutions seems to be mostly just not running it when not needed, and there hadn’t been any reports in a while. Possibly, pygmentize itself has improved over the years.

If you have a reproduction case, there might be a legitimate bug somewhere (in Phabricator or in Pygments).


#4

The reproduction for me is very simple:

  1. Modify a bash script file that has escaped backslashes in it (I think this is related). For example: echo "Some \\\\Funky\\\\text"
  2. Upload a diff of your changes
  3. Everytime you visit the diff, pygmentize is kicked off for the bash script and never returns, eventually pegging the cpu

#5

Does not reproduce.


#6

Ok, instead of guessing, I’ll actually upload what is causing this issue, as it is a bit stranger than that.

#!/bin/bash

# Write crontab file to cron.d
echo "Writing crontab to cron.d"

echo "
# crontab
# Daily
00  00  *   *   *   root php /var/www/html/cli.php --task=\\\\\\\\User\\\\\\\\Tasks\\\\\\\\Cli\\\\\\\\Db --action=daily >> /var/www/html/log.log 2>&1
" > /etc/cron.d/daily

Let me know if you can try that out. Some platform details:

  1. Ubuntu 14.04.5
  2. PHP 5.6.38
  3. Python 2.7.6
  4. Pygments 2.2.0

#7

Does not reproduce:

I’m on the same version of Pygments. My other versions differ (Python 2.7.10, PHP 7.1.16, Mac OS X) but I’d be surprised if those versions affected the syntax highlighting behavior.


#8

Try modifying the actual line with all the slashes so that gets the diff coloring? Besides that, the most relevant difference is perhaps the Python version, I’ll upgrade to yours and see if it changes anything.


#9

Pygments execution is not effected by diff coloring - it runs on the raw text.


#10

Ok, so I came up with an exact diff file that reproduces this everytime (I reduced it as much as I could):

# HG changeset patch
# Parent  664f354eda89f8467e6d431a435f157973efa748
update

diff --git a/test.sh b/test.sh
--- a/test.sh
+++ b/test.sh
@@ -5,17 +5,17 @@ echo "Writing crontab to cron.d"
 
 echo "
 # crontab
 # Daily Cleanup
 00  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --task=\\\\\\\\Tasksr\\\\\\\\Tasks\\\\\\\\Cli\\\\\\\\Db --action=checkIp >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Daily Clearing
 00  *  *   *   *   root php /var/www/html/rc/app/apps/cli.php --task=\\\\\\\\Tasksr\\\\\\\\Tasks\\\\\\\\Cli\\\\\\\\Db --action=inter >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Daily Cleanup
-00  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --task=\\\\\\\\Tasksr\\\\\\\\Tasks\\\\\\\\Cli\\\\\\\\Cli --action=checkIp --param1=/var/www/html/rc/app/jobs/old >> /var/www/html/rc/app/logs/cli.log 2>&1
+00  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --task=\\\\\\\\Tasksr\\\\\\\\Tasks\\\\\\\\Cli\\\\\\\\Cli --action=checkIp --param1=/var/www/html/rc/app/jobs/old --param2=old >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Daily Processing
 00  00 *   *   *   root php /var/www/html/rc/app/apps/cli.php --job=/var/www/html/rc/app/internal/sysjobs/jobs_sysjobsrc.job >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Single Processing
 01  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --job=/var/www/html/rc/app/internal/job/single >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Daily Processing
 03  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --job=/var/www/html/rc/app/internal/job/daily >> /var/www/html/rc/app/logs/cli.log 2>&1
 # Weekly Processing
 04  00  *   *   *   root php /var/www/html/rc/app/apps/cli.php --job=/var/www/html/rc/app/internal/job/weekly >> /var/www/html/rc/app/logs/cli.log 2>&1

#11

This has the potential for DOS attack, perhaps this should be hidden or something.


#12

Thanks. This appears to be a bug in Pygments. Here’s a simplified reproduction case where Pygments takes more than 4 minutes to highlight about 70 characters of text on my machine, with no Phabricator code involved:

epriestley@orbital ~/dev/libphutil $ cat 60slash.sh 
echo "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
epriestley@orbital ~/dev/libphutil $ cat 60slash.sh | time python /usr/local/bin/pygmentize -O encoding=utf-8 -O stripnl=False -f html -l bash
...
      312.67 real       311.36 user         0.55 sys

As more backslashes are added, Pygments takes longer and longer to produce output.

The Pygments Bash lexer includes this pattern in lexers/shell.py:

 (r'(?s)\$?"(\\\\|\\[0-7]+|\\.|[^"\\$])*"', String.Double),

This pattern has can backtrack explosively when given input in the form "\\\\... (an unterminated, double quoted string containing many backslashes). Here’s a simple reproduction case:

import re

pattern = r'"(\\\\|\\.)*"'
corpus = r'"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'
print re.compile(pattern).match(corpus)

This script takes a very long time to execute if run as python match.py. At each step, the script tries to find a way to match the input by examining each subpattern as an alternative. This leads to an evaluation cost on O(2^N) where N is the number of double backslashes in the input.

In some regular expression engines, this can be fixed by using a “once-only” subpattern to disable backtracking. However, as far as I can tell, Python doesn’t support this feature.

A slightly tricker way to fix this is to make sure the repeating pattern has only one possible interpretation for any input. If we rewrite the pattern like this:

r'(?s)\$?"(\\([0-7]+|[^0-7])|[^"\\$])*"'

…I believe the engine will accept and reject the same inputs, but will never backtrack. Making this change to shell.py locally:

-             (r'(?s)\$?"(\\\\|\\[0-7]+|\\.|[^"\\$])*"', String.Double),
+             (r'(?s)\$?"(\\([0-7]+|[^0-7])|[^"\\$])*"', String.Double),

…reduces runtime on 60slash.sh (above) from ~300 seconds to 0.05 seconds on my machine.

As a workaround, you can try making the same change on your system.

Another “workaround” is to limit the number of backslashes which appear in any single Bash string to around 40. This is obviously fairly ridiculous, but your reproduction script is also pretty ridiculous.

I’ll file this in our upstream and consider options for how we’ll move forward with it.

When filing bugs here, it’s incredibly important that you provide a reproduction case which actually lets us reproduce the issue, which is why the pinned thread spends five paragraphs reiterating over and over again that you need to give us a usable reproduction case. The original report five months ago didn’t have enough information to possibly trace the root cause to backtracking behavior in a lexer regex.


#13

Despite wanting it turned on and trying to turn Pygmentize on twice, I’ve currently got it disabled in my instance because of multiple cases where we saw high cpu caused by Pygmentize on our Phabricator box (especially during repo import)

I always put it down to possibly my mistake or bad setup, but now this has come up , I’m wondering if reading your description here if we might have been encountering this or something similar…

At the point you saw the high CPU was it easy to determine which source file it was stuck processing? its possible I might be able to find other failure scenarios.


#14

ps aux | grep pygmentize should give you the process list. You can’t necessarily see which file is processing, but you can see what type of processing is being done. One of the options include the lexer being used (in my case it would show ‘bash’)


closed #15