Groups > comp.lang.java.programmer > #15933 > unrolled thread

number of bytes for each (uni)code point while using utf-8 as encoding ...

Started by	lbrt chx _ gemale
First post	2012-07-11 00:08 +0000
Last post	2012-07-11 14:05 -0700
Articles	4 — 4 participants

Back to article view | Back to comp.lang.java.programmer

  number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-11 00:08 +0000
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... rossum <rossum48@coldmail.com> - 2012-07-11 16:09 +0100
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Robert Klemme <shortcutter@googlemail.com> - 2012-07-11 22:03 +0200
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-11 14:05 -0700

#15933 — number of bytes for each (uni)code point while using utf-8 as encoding ...

From	lbrt chx _ gemale
Date	2012-07-11 00:08 +0000
Subject	number of bytes for each (uni)code point while using utf-8 as encoding ...
Message-ID	<1341965282.664308@nntp.aceinnovative.com>

~ 
 I obviously and I would say -very clearly- meant a -file's encoding- is either incorrectly set by authors or is corrupted in transit. (I never said anything about failing disks ...)
~ 
 Sometimes we technical people sound like lawyers/politicians trying to correct peoples' minds and/or trying to prove something to one self
~ 
 What I asked is an entirely technical question, namely; how to get the length of the sequence of bytes defining a code point
~ 
 lbrtchx

[toc] | [next] | [standalone]

#15940

From	rossum <rossum48@coldmail.com>
Date	2012-07-11 16:09 +0100
Message-ID	<0t4rv7d9lokdbm0287lf7h76u41a0qunvu@4ax.com>
In reply to	#15933

On 11 Jul 2012 00:08:02 GMT, lbrt chx _ gemale wrote:

> how to get the length of the sequence of bytes defining a code point
Use a look up table.

Start Code Point   End Code Point   Num Bytes  
----------------   --------------   ---------
     U+0000           U+007F            1
     U+0080           U+07FF            2
     U+0800           U+FFFF            3
     U+10000          U+1FFFFF          4
     U+200000         U+3FFFFFF         5
     U+4000000        U+7FFFFFFF        6


rossum

[toc] | [prev] | [next] | [standalone]

#15944

From	Robert Klemme <shortcutter@googlemail.com>
Date	2012-07-11 22:03 +0200
Message-ID	<a664fsFnrhU1@mid.individual.net>
In reply to	#15933

On 11.07.2012 02:08, lbrt chx _ gemale wrote:
> What I
> asked is an entirely technical question, namely; how to get the
> length of the sequence of bytes defining a code point ~ lbrtchx

Would you also disclose why you need that information btw. what you want 
to do with it?  I don't see the use case.

And please try to keep the thread together - it's quite tedious to 
follow a discussion spread across a number of threads.  Thank you!

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#15946

From	Lew <lewbloch@gmail.com>
Date	2012-07-11 14:05 -0700
Message-ID	<6d3b2b50-0404-40fa-b611-7cf242b51c4f@googlegroups.com>
In reply to	#15933

On Tuesday, July 10, 2012 5:08:02 PM UTC-7, (unknown) wrote:
> ~ 
>  I obviously and I would say -very clearly- meant a -file&#39;s encoding- is either incorrectly set by authors or is corrupted in transit. (I never said anything about failing disks ...)

And those two cases were answered in your other thread. 
Obviously, and very clearly.

Drop your attitude.

Um, please.

> ~ 
>  Sometimes we technical people sound like lawyers/politicians trying to correct peoples&#39; minds and/or trying to prove something to one self

Is that what you're doing?

> ~ 
>  What I asked is an entirely technical question, namely; how to get the length of the sequence of bytes defining a code point

And what was answered was a set of entirely technical responses, 
namely how to get the length of the sequence of bytes 
defining a code point.

What is your problem?

-- 
Lew

[toc] | [prev] | [standalone]

csiph-web

number of bytes for each (uni)code point while using utf-8 as encoding ...

Contents

#15933 — number of bytes for each (uni)code point while using utf-8 as encoding ...

#15940

#15944

#15946