[Conduit] file.search doesn't accept name with accent

Reproduction Instructions

  • Add file with conduit or directly in phabricator with accent like Détail article.jpg
  • try to find if exist with conduit api file search (by code or directly with form):

without php urlencode/rawurlencode => exception
with php urlencode/rawurlencode => file not found

Phabricator/Arcanist Version
Output from Config > Version Information or arc version.

Phabricator version :

commit 64cc4fe9151547a8f872ea8c83b9a6cea22ce023
Author: epriestley git@epriestley.com
Date: Fri Feb 14 13:56:08 2020 -0800

Arcanist version

arcanist ee66b15bd4694813dff67524eee7fd11ac3ebd97 (17 Feb 2020)

Thanks, I filed this upstream as https://secure.phabricator.com/T13501.

Some ngram indexing functions had a bug which caused them to slice multibyte characters incorrectly. I’ve fixed this bug in https://secure.phabricator.com/D21128.

If you upgrade Phabricator and rebuild the index (with .bin/search index --type PhabricatorFile --force) you should now be able to find these results in some cases. For example:

$ echo '{"constraints": {"name": "De\u0301tail"}}' | ./bin/conduit call --method file.search --input -
Reading input from stdin...
{
  "result": {
...
          "name": "De\u0301tail article.jpg",
...

Note that this is a search for D e <U+0301 COMBINING ACUTE ACCENT> t a i l, which works because this is exactly the same sequence stored in the database.

If you search for D e t a i l or D <U+00E9 LATIN SMALL LETTER E WITH ACUTE> t a i l, Phabricator will not find the file. This is undesirable: the preferred behavior is that Phabricator should find the file.

Fixing this seems complicated because MySQL’s LIKE operator does not appear to normalize characters, even in the utf8mb4_unicode_ci collation (which is supposedly “accent-insensitive”).

MySQL’s documentation possibly alludes to this, e.g.:

Some characters are not supported, and combining marks are not fully supported.
https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-unicode-sets.html

This can be fixed, but it means that we have to add a separate field to store the search-normalized version of the string and then only issue LIKE queries against that. Because of how many objects this affects, this is a large amount of work. This will also require installs to perform a reindex, which motivates making a real effort to get normalization (which is complex) correct.

I plan to pursue these broader changes eventually, but they’re a large amount of work and not likely to happen in the short term given that relatively few users/queries are impacted.

Thanks

I will update after the lockdown, for the moment all deployment are stoped :frowning: