Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > gnu.utils.bug > #2252
| Path | csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail |
|---|---|
| From | Sjur Nørstebø Moshagen <sjur.n.moshagen@uit.no> |
| Newsgroups | gnu.utils.bug |
| Subject | UTF-8 corruption bug with diff -y |
| Date | Thu, 8 Nov 2018 08:47:57 +0000 |
| Lines | 83 |
| Approved | bug-gnu-utils@gnu.org |
| Message-ID | <mailman.3647.1541686030.1284.bug-gnu-utils@gnu.org> (permalink) |
| NNTP-Posting-Host | lists.gnu.org |
| Mime-Version | 1.0 |
| Content-Type | multipart/mixed; boundary="_005_9B01AE6B9CB0495AA5A160CCA7D583C2uitno_" |
| X-Trace | usenet.stanford.edu 1541686031 3852 208.118.235.17 (8 Nov 2018 14:07:11 GMT) |
| X-Complaints-To | action@cs.stanford.edu |
| To | "bug-gnu-utils@gnu.org" <bug-gnu-utils@gnu.org> |
| Envelope-to | bug-gnu-utils@gnu.org |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=UniversitetetiTromso.onmicrosoft.com; s=selector1-uit-no; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LrHwQjznHsT9j1P9UU1l9j7zQ0r16lrKTQrjUk+YTjY=; b=WrE47DlPeOhF4vCs7/qoJd+bjhuhu3eOpOIc8dQeOXAleQIBqt0qnGBsiLkY1g/VsUDITVgiu8UxXLx06O+WfBz/CG6qqPvGoj4xxzyk0LIE2FM+lNCzjcscZjDodqEu5URH3ByrVabpcIOWFLStcSTDtiXEQ4ir229dvvGU6cs= |
| Thread-Topic | UTF-8 corruption bug with diff -y |
| Thread-Index | AQHUdz+/p/lv0e5qHUa/FDrglkkjXg== |
| Accept-Language | nn-NO, en-US |
| Content-Language | en-US |
| X-MS-Has-Attach | yes |
| X-MS-TNEF-Correlator | |
| x-originating-ip | [2001:2003:f045:ed40:c33:22fc:ddcf:2e34] |
| x-ms-publictraffictype | |
| x-microsoft-exchange-diagnostics | 1; HE1PR07MB3435; 6:QGp1McJ+iVsjVcAxbemn9olR3GUFf61x5V9JINpB5RLR1NVR06GEwbwXerrhEmJjLc+EkITCLThathX6pU6sGhmqtzchBuG5eDC0T9YOiV7YpjFSYFxBXL3Tvc0mNkvcaB98SRUSzxKoqtu/7uKo8EIRdGzsrvsbXEVf8kq4WzPvdKyQ78CCtYJXqXwswtgq31Ya/7cmmTlp46zWz/kGNC1qSdkNSSqUl9CWGu78KEkrX/9Zwg2ipv6z7ADBCf3N9BpM5p9F1HiGMdOfYc/PhvCI8J3AOye61i+XzpqS5QLO+LPuX75uasFoEtxvnIw9GC8sf2tDQCAeE+tq2cOtgrQXF4y9zrNdXxIRGt5EMy4AAmDjSTh4NjGDcspPcs7ysgMyLpvU2oNS+A8J+b7gdZurdUN9qrt/qvvxHQNV3MRlMK3YfFHw8ueuTRq7JtFrJCGfWkYb/62XPuE8C/vurQ==; 5:2rfoT7QglmVPHyhf387LGoGR010S/Aj5xKg6qiO/mPT1i1y01i8vx6h1uwSJDkhFwz+FxT/9oCwuTCYyd2eLz7V6HgekJ/M60chdEEr4P3Yvf5j0MAM+brhjcU930QwFB4Zgs51W5wzgNgtgK6MNa1dpKocbrJZNGT4yHASMT+I=; 7:S2lPCXGsaAauEXKnOutBQ7Bhw1NK2XgHC2jdwb/FEXL8vmBTfAbNzoZke05AIJ+duUK2mLQeQxbsIJotUehLodGfsJL1WJ29oJqwNU0cezSOyRhNDb8cDwZ8FjusO5X7HrSEYkutuIMpzMift4dG4w== |
| x-ms-exchange-antispam-srfa-diagnostics | SOS; |
| x-ms-office365-filtering-correlation-id | 085e4701-9be5-4b3b-cbf8-08d64556e199 |
| x-microsoft-antispam | BCL:0; PCL:0; RULEID:(7020095)(4652040)(8989299)(5600074)(711020)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7153060)(49563074)(7193020); SRVR:HE1PR07MB3435; |
| x-ms-traffictypediagnostic | HE1PR07MB3435: |
| authentication-results | spf=none (sender IP is ) smtp.mailfrom=sjur.n.moshagen@uit.no; |
| x-microsoft-antispam-prvs | <HE1PR07MB3435003527E13440B81A3075D9C50@HE1PR07MB3435.eurprd07.prod.outlook.com> |
| x-ms-exchange-senderadcheck | 1 |
| x-exchange-antispam-report-cfa-test | BCL:0; PCL:0; RULEID:(8211001083)(102415395)(6040522)(2401047)(5005006)(8121501046)(3002001)(10201501046)(3231382)(944501410)(4983020)(52105095)(93006095)(93001095)(148016)(149066)(150057)(6041310)(20161123562045)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123558120)(201708071742011)(7699051)(76991095); SRVR:HE1PR07MB3435; BCL:0; PCL:0; RULEID:; SRVR:HE1PR07MB3435; |
| x-forefront-prvs | 0850800A29 |
| x-forefront-antispam-report | SFV:NSPM; SFS:(10019020)(366004)(396003)(136003)(39860400002)(376002)(346002)(199004)(189003)(83716004)(25786009)(8936002)(36756003)(85182001)(54896002)(6512007)(6506007)(82746002)(33896004)(486006)(6436002)(71200400001)(71190400001)(66574009)(5640700003)(33656002)(476003)(2906002)(7736002)(478600001)(2501003)(86362001)(46003)(2900100001)(81166006)(5024004)(81156014)(14444005)(2616005)(8676002)(256004)(105586002)(14454004)(102836004)(97736004)(74482002)(5660300001)(85202003)(106356001)(6916009)(99286004)(6116002)(6486002)(2351001)(99936001)(316002)(68736007)(186003)(786003)(53936002); DIR:OUT; SFP:1102; SCL:1; SRVR:HE1PR07MB3435; H:HE1PR07MB4396.eurprd07.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; |
| received-spf | None (protection.outlook.com: uit.no does not designate permitted sender hosts) |
| x-microsoft-antispam-message-info | xGVAimsE3OSx/TiCTJYcWC+3XS+4vW9KMKfUG9tHetdmSZHuK5haCWRDqAqHKfr+TjX54GfwbGeq02obJA5nyfWrxPB7tpWj4i7NjR3eoq5j9CzirkndbFa4gB0u6QX8gfXYgFFMohF/HWTaa1qfziVjlJP95PnYIVXOBtI+W1LliKLrHqes8sz2Pqj7zjUN2lL43d1cEMZ68ouw836MK8tS0VvD7ApvCgb36xw0CcE/vO/vv3F0sRsusvxRjMcHC+ZBjUJotF6n6iPPZQg5Im7Ucj5aYTLaIbIm68O0PdaOrlxbOCexm99LkdnfA+6qdZ4oO6WdYnlB4swJVV1HcArxQI6rqqw34diBcndMN98= |
| spamdiagnosticoutput | 1:99 |
| spamdiagnosticmetadata | NSPM |
| X-OriginatorOrg | uit.no |
| X-MS-Exchange-CrossTenant-Network-Message-Id | 085e4701-9be5-4b3b-cbf8-08d64556e199 |
| X-MS-Exchange-CrossTenant-originalarrivaltime | 08 Nov 2018 08:47:57.2929 (UTC) |
| X-MS-Exchange-CrossTenant-fromentityheader | Hosted |
| X-MS-Exchange-CrossTenant-id | 4e7f212d-74db-4563-a57b-8ae44ed05526 |
| X-MS-Exchange-Transport-CrossTenantHeadersStamped | HE1PR07MB3435 |
| X-detected-operating-system | by eggs.gnu.org: Windows 7 or 8 [fuzzy] |
| X-Received-From | 40.107.7.94 |
| X-Mailman-Approved-At | Thu, 08 Nov 2018 09:07:09 -0500 |
| X-Content-Filtered-By | Mailman/MimeDel 2.1.21 |
| X-BeenThere | bug-gnu-utils@gnu.org |
| X-Mailman-Version | 2.1.21 |
| Precedence | list |
| List-Id | Bug reports for the GNU utilities <bug-gnu-utils.gnu.org> |
| List-Unsubscribe | <https://lists.gnu.org/mailman/options/bug-gnu-utils>, <mailto:bug-gnu-utils-request@gnu.org?subject=unsubscribe> |
| List-Archive | <http://lists.gnu.org/archive/html/bug-gnu-utils/> |
| List-Post | <mailto:bug-gnu-utils@gnu.org> |
| List-Help | <mailto:bug-gnu-utils-request@gnu.org?subject=help> |
| List-Subscribe | <https://lists.gnu.org/mailman/listinfo/bug-gnu-utils>, <mailto:bug-gnu-utils-request@gnu.org?subject=subscribe> |
| Xref | csiph.com gnu.utils.bug:2252 |
Show key headers only | View raw
[Multipart message — attachments visible in raw view] - view raw
Hello
Using diff on text files with long lines risk corrupting UTF-8 enocded files when used with the default column width of 130 columns, if a multibyte char happens to be on the border of that limit. The diff command will truncate the resulting diff output in the middle of the byte sequence, producing malformed UTF-8 text.
To reproduce:
diff -y Input-text-1.txt Input-text-2.txt
The bug can be circumvented by setting the column width to a randomly high number, as long as it is higher than the longest diff line produced:
diff -y -W 200 Input-text-1.txt Input-text-2.txt
The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached. The text (excluding --------) is also reproduced below in case the attachments are removed during e-mail transfer.
Regards,
Sjur Moshagen
Input-text-1.txt:
--------
"<ja>"
"ja" CC
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Attr
"iešguhtet" Pron Indef Gen
"<lágan>"
"lága" N Sem/Dummytag Ess
"lága" N Sem/Dummytag Sg Loc South Err/Orth
"lágan" A Sem/Hum Attr
"lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
"lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
"lágan" A Sem/Hum Sg Nom
"láhka" N Sem/Rule Sg Loc South Err/Orth
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Input-text-2.txt
--------
"<ja>"
"ja" CC
"<iešguđet lágan>"
"iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Back to gnu.utils.bug | Previous | Next | Find similar
UTF-8 corruption bug with diff -y Sjur Nørstebø Moshagen <sjur.n.moshagen@uit.no> - 2018-11-08 08:47 +0000
csiph-web