Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Joshua Cranmer Newsgroups: comp.lang.java.programmer Subject: Re: ascii char 26 Date: Sun, 11 Sep 2011 16:52:41 -0500 Organization: A noiseless patient Spider Lines: 19 Message-ID: References: <16f8836c-27b9-483b-a71f-61d7d6cfd188@i2g2000yqm.googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sun, 11 Sep 2011 21:53:17 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="WpcHJSul77m+zlbR9GVqkA"; logging-data="13813"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/4H63KF/LmWaxDPxw0yiDCb3Q84FGVEg8=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 In-Reply-To: <16f8836c-27b9-483b-a71f-61d7d6cfd188@i2g2000yqm.googlegroups.com> Cancel-Lock: sha1:8oEyatM3XYj5dk5jQD/vLi8q7i0= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7828 On 9/11/2011 4:33 PM, bob wrote: > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8? The US-ASCII encoder only properly encodes characters in the range of 0-127, i.e., the characters that are present in ASCII. Any other character is replaced with some sort of substitution character; in this case, it looks like the charset has chosen to use ^Z as the "I don't know what this character is" character (I would have guessed '?' instead, but I suppose they decided to go with the less-commonly used variant). My guess is your input is using one of the characters like the minus sign, em dash, or perhaps an en dash instead (there may be others), which are visually close in appearance to a hyphen but do not share the same Unicode codepoint. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth