Thanks. This appears to be a bug in Pygments. Here’s a simplified reproduction case where Pygments takes more than 4 minutes to highlight about 70 characters of text on my machine, with no Phabricator code involved:
epriestley@orbital ~/dev/libphutil $ cat 60slash.sh
epriestley@orbital ~/dev/libphutil $ cat 60slash.sh | time python /usr/local/bin/pygmentize -O encoding=utf-8 -O stripnl=False -f html -l bash
312.67 real 311.36 user 0.55 sys
As more backslashes are added, Pygments takes longer and longer to produce output.
The Pygments Bash lexer includes this pattern in
This pattern has can backtrack explosively when given input in the form
"\\\\... (an unterminated, double quoted string containing many backslashes). Here’s a simple reproduction case:
pattern = r'"(\\\\|\\.)*"'
corpus = r'"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'
This script takes a very long time to execute if run as
python match.py. At each step, the script tries to find a way to match the input by examining each subpattern as an alternative. This leads to an evaluation cost on
O(2^N) where N is the number of double backslashes in the input.
In some regular expression engines, this can be fixed by using a “once-only” subpattern to disable backtracking. However, as far as I can tell, Python doesn’t support this feature.
A slightly tricker way to fix this is to make sure the repeating pattern has only one possible interpretation for any input. If we rewrite the pattern like this:
…I believe the engine will accept and reject the same inputs, but will never backtrack. Making this change to
- (r'(?s)\$?"(\\\\|\\[0-7]+|\\.|[^"\\$])*"', String.Double),
+ (r'(?s)\$?"(\\([0-7]+|[^0-7])|[^"\\$])*"', String.Double),
…reduces runtime on
60slash.sh (above) from ~300 seconds to 0.05 seconds on my machine.
As a workaround, you can try making the same change on your system.
Another “workaround” is to limit the number of backslashes which appear in any single Bash string to around 40. This is obviously fairly ridiculous, but your reproduction script is also pretty ridiculous.
I’ll file this in our upstream and consider options for how we’ll move forward with it.
When filing bugs here, it’s incredibly important that you provide a reproduction case which actually lets us reproduce the issue, which is why the pinned thread spends five paragraphs reiterating over and over again that you need to give us a usable reproduction case. The original report five months ago didn’t have enough information to possibly trace the root cause to backtracking behavior in a lexer regex.