Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!goblin2!goblin.stu.neva.ru!newsfeed1.swip.net!uio.no!ntnu.no!not-for-mail From: Bent C Dalager Newsgroups: comp.lang.java.programmer Subject: Re: ascii char 26 Date: Sun, 11 Sep 2011 23:18:51 +0000 (UTC) Organization: Norwegian university of science and technology Lines: 34 Message-ID: References: <16f8836c-27b9-483b-a71f-61d7d6cfd188@i2g2000yqm.googlegroups.com> NNTP-Posting-Host: microbel.pvv.ntnu.no X-Trace: orkan.itea.ntnu.no 1315783131 20484 129.241.210.179 (11 Sep 2011 23:18:51 GMT) X-Complaints-To: usenet@ntnu.no NNTP-Posting-Date: Sun, 11 Sep 2011 23:18:51 +0000 (UTC) User-Agent: slrn/pre1.0.0-18 (Linux) Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7837 On 2011-09-11, bob wrote: > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8? Unicode has multiple different hyphens and hyphen-like characters. The traditional ASCII hyphen is the Unicode "hyphen-minus" which encodes to 0x2d in utf-8. http://www.fileformat.info/info/unicode/char/2d/index.htm suggests the following additional hyphen-like characters that you may actually be working with in your string, and that will probably be mapped to 26 in your case: hyphen U+2010 non-breaking hyphen U+2011 figure dash U+2012 en dash U+2013 minus sign U+2212 roman uncia sign U+10191 If hyphens are of particular interest to you it may be a better approach to replace non-ASCII-supported hyphens from the above list with "hyphen-minus", before you transcode to ASCII. One would tend to think there ought to be a library function somewhere to convert a unicode string to ASCII-supported variants of its various characters where possible, that you should be using instead. I don't know if such a function is easily available. Cheers, Bent D -- Bent Dalager - bcd@pvv.org - http://www.pvv.org/~bcd powered by emacs