Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > gnu.utils.bug > #2252
| From | Sjur Nørstebø Moshagen <sjur.n.moshagen@uit.no> |
|---|---|
| Newsgroups | gnu.utils.bug |
| Subject | UTF-8 corruption bug with diff -y |
| Date | 2018-11-08 08:47 +0000 |
| Message-ID | <mailman.3647.1541686030.1284.bug-gnu-utils@gnu.org> (permalink) |
[Multipart message — attachments visible in raw view] - view raw
Hello
Using diff on text files with long lines risk corrupting UTF-8 enocded files when used with the default column width of 130 columns, if a multibyte char happens to be on the border of that limit. The diff command will truncate the resulting diff output in the middle of the byte sequence, producing malformed UTF-8 text.
To reproduce:
diff -y Input-text-1.txt Input-text-2.txt
The bug can be circumvented by setting the column width to a randomly high number, as long as it is higher than the longest diff line produced:
diff -y -W 200 Input-text-1.txt Input-text-2.txt
The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached. The text (excluding --------) is also reproduced below in case the attachments are removed during e-mail transfer.
Regards,
Sjur Moshagen
Input-text-1.txt:
--------
"<ja>"
"ja" CC
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Attr
"iešguhtet" Pron Indef Gen
"<lágan>"
"lága" N Sem/Dummytag Ess
"lága" N Sem/Dummytag Sg Loc South Err/Orth
"lágan" A Sem/Hum Attr
"lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
"lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
"lágan" A Sem/Hum Sg Nom
"láhka" N Sem/Rule Sg Loc South Err/Orth
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Input-text-2.txt
--------
"<ja>"
"ja" CC
"<iešguđet lágan>"
"iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Back to gnu.utils.bug | Previous | Next | Find similar
UTF-8 corruption bug with diff -y Sjur Nørstebø Moshagen <sjur.n.moshagen@uit.no> - 2018-11-08 08:47 +0000
csiph-web