Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #7852

Re: ascii char 26

From bob <bob@coolgroups.com>
Newsgroups comp.lang.java.programmer
Subject Re: ascii char 26
Date 2011-09-11 19:12 -0700
Organization http://groups.google.com
Message-ID <63554bdb-dab4-43e7-b809-5128fd831f3c@m38g2000vbn.googlegroups.com> (permalink)
References <16f8836c-27b9-483b-a71f-61d7d6cfd188@i2g2000yqm.googlegroups.com> <j4jakd$dfl$1@dont-email.me>

Show all headers | View raw


You're right.  I messed up, and it was the em dash.  It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

	public static String convertToAscii(String html) {
		html = html.replaceAll("\u2019", "'");
		html = html.replaceAll("\u201D", "\"");
		html = html.replaceAll("\u201C", "\"");

		// mdash
		html = html.replaceAll("\u2014", "-");


		byte[] b = null;
		try {
			b = html.getBytes("US-ASCII");

		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return html;
	}

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.



On Sep 11, 4:52 pm, Joshua Cranmer <Pidgeo...@verizon.invalid> wrote:
> On 9/11/2011 4:33 PM, bob wrote:
>
> > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
>
> The US-ASCII encoder only properly encodes characters in the range of
> 0-127, i.e., the characters that are present in ASCII. Any other
> character is replaced with some sort of substitution character; in this
> case, it looks like the charset has chosen to use ^Z as the "I don't
> know what this character is" character (I would have guessed '?'
> instead, but I suppose they decided to go with the less-commonly used
> variant).
>
> My guess is your input is using one of the characters like the minus
> sign, em dash, or perhaps an en dash instead (there may be others),
> which are visually close in appearance to a hyphen but do not share the
> same Unicode codepoint.
>
> --
> Beware of bugs in the above code; I have only proved it correct, not
> tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

ascii char 26 bob <bob@coolgroups.com> - 2011-09-11 14:33 -0700
  Re: ascii char 26 Arne Vajhøj <arne@vajhoej.dk> - 2011-09-11 17:48 -0400
  Re: ascii char 26 Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-09-11 16:52 -0500
    Re: ascii char 26 Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-09-11 18:28 -0400
    Re: ascii char 26 bob <bob@coolgroups.com> - 2011-09-11 19:12 -0700
      Re: ascii char 26 Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-09-11 21:25 -0500
        Re: ascii char 26 bob <bob@coolgroups.com> - 2011-09-12 01:30 -0700
  Re: ascii char 26 Roedy Green <see_website@mindprod.com.invalid> - 2011-09-11 15:25 -0700
  Re: ascii char 26 Bent C Dalager <bcd@pvv.ntnu.no> - 2011-09-11 23:18 +0000
    Re: ascii char 26 Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-09-11 18:37 -0500
    Re: ascii char 26 Retahiv Oopsiscame <roopsisc@gmail.com> - 2011-09-11 16:53 -0700
      Re: ascii char 26 Roedy Green <see_website@mindprod.com.invalid> - 2011-09-14 11:55 -0700

csiph-web