ArcanistLintEngine does not take multi-byte characters into account


#1

Observed Behavior & Reproduction steps:
With the following file as test:

↓↓↓↓↓↓
abcdef
ghijkl

from the context of ArcanistLintEngine:

list($line, $char) = getLineAndCharFromOffset('test', 15);
echo("line=$line, char=$char");

we get the following output:

line=0, char=15

Expected Behavior:
the character should be interpreted as 1 character, so the output should be line=2, char=1

Phabricator Version:
This issue is present on d581c453b83c515f3acac963bbc117e8dd0d1ef4


#2

A way to resolve this would be to use mb_strlen instead of strlen, as that method does count multi-byte characters as 1 character


#3

Thanks, that looks reasonable.
I think there are some use-cases somewhere that actually depend on this being byte-offset rather than char-offset, so actually fixing it would require some work. Also, lints are very low in upstream priority right now.

I’ve created a ticket upstream as https://secure.phabricator.com/T13139, but if you can convince your tool to output byte offset instead of char offset (Or better - line+char info), you’ll get to your goal faster than waiting for this to be resolved.