Bad regex in prose diff logic

Had a daemon erroring out as it was trying to build a diff.

I tracked though the code and was able to isolate it to the following test case:

--- a/src/infrastructure/diff/prose/__tests__/PhutilProseDiffTestCase.php
+++ b/src/infrastructure/diff/prose/__tests__/PhutilProseDiffTestCase.php
@@ -30,6 +30,15 @@ final class PhutilProseDiffTestCase
       ),
       pht('Remove Paragraph'));

+    $this->assertProseParts(
+       "xxx",
+       "xxxyyy\n.zzz",
+       array(
+               "=xxx",
+               "+ yyy\n.zzz",
+               "= \n"
+       ),
+       pht('Amend paragraph and add paragraph starting with punctuation'));
 
     // Without smoothing, the alogorithm identifies that "shark" and "cat"
     // both contain the letter "a" and tries to express this as a very

It’s also reproducible on an admin.phacility test instance with the same content
#1 make a task with description “xxx”.
#2 edit description to “xxxyyyy\n.zzz”.
#3 click the “show details” of the “updated the task description” transaction

(or see https://test-x4ryazymuji7.phacility.com/T1)

The prose diff engine fails on level 1 (sentences) of the diff building as it’s dissecting the corpus. It eventually pass the fragment “xxxyyy\n.” into stitchPieces(). This then fails the regex there and we blow the stack when we try to access the matches. How it should be fixed I guess depends on how you’d want to interpret that diff. You could probably throw in a PCRE_DOTALL and be done with it, but maybe it would be better to say you don’t really want interior newlines at that point. It’s not clear to me what’s best.

Let me know if there’s any more information I can provide that would be useful


Version: HEAD of master (or whatever the test admin.phacility instances use).

Thanks for the very detailed report! I believe this is fixed by https://secure.phabricator.com/D21295.

In that change, I’ve just added /s (dotall) – and your test case.

It’s possible that parsing the text yyy\n. as two separate sentences (yyy\n and .) would produce better diffs than the current behavior, but we could make that change in the future – at least in the case of this particular diff, the diff wouldn’t change and seems correct as-is.