Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.help > #1846 > unrolled thread
| Started by | Young <ycp101@gmail.com> |
|---|---|
| First post | 2012-06-13 02:23 +0000 |
| Last post | 2012-06-13 10:24 +0100 |
| Articles | 15 — 8 participants |
Back to article view | Back to comp.lang.java.help
Actual width of unicode chracters. Young <ycp101@gmail.com> - 2012-06-13 02:23 +0000
Re: Actual width of unicode chracters. Roedy Green <see_website@mindprod.com.invalid> - 2012-06-12 20:42 -0700
Re: Actual width of unicode chracters. markspace <-@.> - 2012-06-12 20:45 -0700
Re: Actual width of unicode chracters. markspace <-@.> - 2012-06-13 00:58 -0700
Re: Actual width of unicode chracters. Young <ycp101@gmail.com> - 2012-06-13 08:30 +0000
Re: Actual width of unicode chracters. markspace <-@.> - 2012-06-13 08:45 -0700
Re: Actual width of unicode chracters. Lew <lewbloch@gmail.com> - 2012-06-13 14:14 -0700
Re: Actual width of unicode chracters. Steven Simpson <ss@domain.invalid> - 2012-06-14 08:06 +0100
Re: Actual width of unicode chracters. "mayeul.marguet" <mayeul.marguet@free.fr> - 2012-06-14 10:26 +0200
Re: Actual width of unicode chracters. "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-06-14 15:32 +0200
Re: Actual width of unicode chracters. "mayeul.marguet" <mayeul.marguet@free.fr> - 2012-06-14 16:47 +0200
Re: Actual width of unicode chracters. "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-06-16 20:39 +0200
Re: Actual width of unicode chracters. Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-06-13 00:00 -0400
Re: Actual width of unicode chracters. markspace <-@.> - 2012-06-13 00:24 -0700
Re: Actual width of unicode chracters. Steven Simpson <ss@domain.invalid> - 2012-06-13 10:24 +0100
| From | Young <ycp101@gmail.com> |
|---|---|
| Date | 2012-06-13 02:23 +0000 |
| Subject | Actual width of unicode chracters. |
| Message-ID | <jr8tid$64v$1@tnews.hananet.net> |
I am trying to write a class that draws tables in console mode. Korean
character takes two space width for one character. When I count length of
a string having numbers and Korean Characters together, I cannot get
correct width. In result, I cannot align vertical line right.
So, I have wrote a new method, countWidth().
private boolean isEnglish(Char s) {
if (s < 256) {
return true;
} else {
return false;
}
However, this would work properly when I have English and Korean in
string variables. How would I know how many character spaces in console?
Thank you.
[toc] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-06-12 20:42 -0700 |
| Message-ID | <c03gt7h0dff7nc303f0d28j18pvpim7nn3@4ax.com> |
| In reply to | #1846 |
On Wed, 13 Jun 2012 02:23:09 +0000 (UTC), Young <ycp101@gmail.com> wrote, quoted or indirectly quoted someone who said : >When I count length of >a string having numbers and Korean Characters together, I cannot get >correct width. see http://mindprod.com/jgloss/stringwidth.html -- Roedy Green Canadian Mind Products http://mindprod.com Controlling complexity is the essence of computer programming. ~ Brian W. Kernighan 1942-01-01 .
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-06-12 20:45 -0700 |
| Message-ID | <jr92cs$9m1$1@dont-email.me> |
| In reply to | #1846 |
On 6/12/2012 7:23 PM, Young wrote: > I am trying to write a class that draws tables in console mode. Korean > character takes two space width for one character. I think "width" here does not mean what you think. > When I count length of > a string having numbers and Korean Characters together, I cannot get > correct width. String korean = ... int count = korean.codePointCount( 0, korean.length()-1 ); > In result, I cannot align vertical line right. Er, what? Are you looking for this? <http://docs.oracle.com/javase/tutorial/2d/text/measuringtext.html> > So, I have wrote a new method, countWidth(). Use "codePointCount()" above. > However, this would work properly when I have English and Korean in > string variables. How would I know how many character spaces in console? Use "codePointCount()" above. A short, self-contained, compilable example would help here: http://sscce.org/
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-06-13 00:58 -0700 |
| Message-ID | <jr9h8n$ark$1@dont-email.me> |
| In reply to | #1848 |
On 6/12/2012 8:45 PM, markspace wrote:
> String korean = ...
> int count = korean.codePointCount( 0, korean.length()-1 );
> A short, self-contained, compilable example would help here:
> http://sscce.org/
An SSCCE helps a lot! I was wrong about the -1 to length(), it should
just be length(). The following seems to work for me. If you have some
different idea, please write a code example so we can tell what you mean.
package quicktest;
import javax.swing.JFrame;
import javax.swing.JTextArea;
import javax.swing.SwingUtilities;
/**
*
* @author Brenden
*/
public class Cjk {
public static void main(String[] args) {
SwingUtilities.invokeLater(new Runnable() {
public void run() {
createGui();
}
});
}
private static void createGui() {
JFrame frame = new JFrame();
int[] korChars = new int[10];
for (int i = 0; i < korChars.length; i++) {
korChars[i] = 0xAC00+i;
}
String kor = "test: "+new String( korChars, 0, korChars.length );
JTextArea ta = new JTextArea( kor );
frame.add(ta);
ta.append("\nCount: "+ kor.codePointCount(0, kor.length() ));
frame.pack();
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setLocationRelativeTo(null);
frame.setVisible(true);
}
}
[toc] | [prev] | [next] | [standalone]
| From | Young <ycp101@gmail.com> |
|---|---|
| Date | 2012-06-13 08:30 +0000 |
| Message-ID | <jr9j3p$qg0$1@tnews.hananet.net> |
| In reply to | #1851 |
On Wed, 13 Jun 2012 00:58:46 -0700, markspace wrote:
> On 6/12/2012 8:45 PM, markspace wrote:
>> String korean = ...
>> int count = korean.codePointCount( 0, korean.length()-1 );
>
>> A short, self-contained, compilable example would help here:
>> http://sscce.org/
>
>
> An SSCCE helps a lot! I was wrong about the -1 to length(), it should
> just be length(). The following seems to work for me. If you have some
> different idea, please write a code example so we can tell what you
> mean.
>
>
>
> package quicktest;
>
> import javax.swing.JFrame;
> import javax.swing.JTextArea;
> import javax.swing.SwingUtilities;
>
> /**
> *
> * @author Brenden
> */
> public class Cjk {
>
> public static void main(String[] args) {
> SwingUtilities.invokeLater(new Runnable() {
>
> public void run() {
> createGui();
> }
> });
> }
>
> private static void createGui() {
> JFrame frame = new JFrame();
>
> int[] korChars = new int[10];
> for (int i = 0; i < korChars.length; i++) {
> korChars[i] = 0xAC00+i;
> }
> String kor = "test: "+new String( korChars, 0, korChars.length );
> JTextArea ta = new JTextArea( kor );
> frame.add(ta);
>
> ta.append("\nCount: "+ kor.codePointCount(0, kor.length() ));
>
> frame.pack();
> frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
> frame.setLocationRelativeTo(null);
> frame.setVisible(true);
> }
> }
Thank you for the tries, I don't understand why I should use
codePointCount() method. The length() method gives same result. I want to
know how many literal spaces in console. Since Korean letter takes two
literal spaces in console mode. The result would be 26 instead of 16.
Anyone knows what I am missing? Thanks
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-06-13 08:45 -0700 |
| Message-ID | <jracjl$en8$1@dont-email.me> |
| In reply to | #1852 |
On 6/13/2012 1:30 AM, Young wrote: > Thank you for the tries, I don't understand why I should use > codePointCount() method. The length() method gives same result. Because if you are using the full Unicode plane, the results will not be correct otherwise. Is that hard to understand? Are you *sure* that you have no characters outside the BMP? > I want to > know how many literal spaces in console. Since Korean letter takes two > literal spaces in console mode. The result would be 26 instead of 16. > Anyone knows what I am missing? Besides Korean, how many other double width characters do you have in your character set? (I think I see now what you mean by "double width" characters on your console.)
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-06-13 14:14 -0700 |
| Message-ID | <ac90eba4-9efe-4c35-9f2b-2590326f1fe5@googlegroups.com> |
| In reply to | #1852 |
Young wrote:
> markspace wrote:
>> markspace wrote:
>>> String korean = ...
>>> int count = korean.codePointCount( 0, korean.length()-1 );
>>
>>> A short, self-contained, compilable example would help here:
>>> http://sscce.org/
>>
>>
>> An SSCCE helps a lot! I was wrong about the -1 to length(), it should
>> just be length(). The following seems to work for me. If you have some
>> different idea, please write a code example so we can tell what you
>> mean.
>>
>>
>>
>> package quicktest;
>>
>> import javax.swing.JFrame;
>> import javax.swing.JTextArea;
>> import javax.swing.SwingUtilities;
>>
>> /**
>> *
>> * @author Brenden
>> */
>> public class Cjk {
>>
>> public static void main(String[] args) {
>> SwingUtilities.invokeLater(new Runnable() {
>>
>> public void run() {
>> createGui();
>> }
>> });
>> }
>>
>> private static void createGui() {
>> JFrame frame = new JFrame();
>>
>> int[] korChars = new int[10];
>> for (int i = 0; i < korChars.length; i++) {
>> korChars[i] = 0xAC00+i;
>> }
>> String kor = "test: "+new String( korChars, 0, korChars.length );
>> JTextArea ta = new JTextArea( kor );
>> frame.add(ta);
>>
>> ta.append("\nCount: "+ kor.codePointCount(0, kor.length() ));
>>
>> frame.pack();
>> frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
>> frame.setLocationRelativeTo(null);
>> frame.setVisible(true);
>> }
>> }
>
> Thank you for the tries, I don't understand why I should use
> codePointCount() method. The length() method gives same result. I want to
Not in general it doesn't.
Read the Javadocs for the two methods and you'll see why.
> know how many literal spaces in console. Since Korean letter takes two
> literal spaces in console mode. The result would be 26 instead of 16.
> Anyone knows what I am missing? Thanks
They've told you what you're missing. You're welcome.
--
Lew
[toc] | [prev] | [next] | [standalone]
| From | Steven Simpson <ss@domain.invalid> |
|---|---|
| Date | 2012-06-14 08:06 +0100 |
| Message-ID | <75foa9-b46.ln1@s.simpson148.btinternet.com> |
| In reply to | #1858 |
On 13/06/12 22:14, Lew wrote: > Young wrote: >> Thank you for the tries, I don't understand why I should use >> codePointCount() method. The length() method gives same result. I >> want to > Not in general it doesn't. > > Read the Javadocs for the two methods and you'll see why. I've just read it, and not seen any surprises - it doesn't seem to have anything to do with the OP's problem, counting spaces occupied by a character when displayed on a console. Whether a code point takes up two chars inside a program is unrelated to whether it takes up two display positions on a console. Am I missing something? -- ss at comp dot lancs dot ac dot uk
[toc] | [prev] | [next] | [standalone]
| From | "mayeul.marguet" <mayeul.marguet@free.fr> |
|---|---|
| Date | 2012-06-14 10:26 +0200 |
| Message-ID | <4fd99db2$0$6118$426a74cc@news.free.fr> |
| In reply to | #1861 |
On 14/06/2012 09:06, Steven Simpson wrote:
> On 13/06/12 22:14, Lew wrote:
>> Young wrote:
>>> Thank you for the tries, I don't understand why I should use
>>> codePointCount() method. The length() method gives same result. I
>>> want to
>> Not in general it doesn't.
>>
>> Read the Javadocs for the two methods and you'll see why.
>
> I've just read it, and not seen any surprises - it doesn't seem to have
> anything to do with the OP's problem, counting spaces occupied by a
> character when displayed on a console. Whether a code point takes up two
> chars inside a program is unrelated to whether it takes up two display
> positions on a console. Am I missing something?
From the start, what the OP calls a 'width' is actually the number of
bytes used to represent the character.
Korean characters might be big and large, but not to the point that
they'd be twice as large as a monospace roman character. Even when using
strange fonts where that would happen, they wouldn't be /exactly/ twice
as large, and therefore trying to maintain alignment would be futile.
Some encodings for korean characters use two bytes for korean characters
and one byte for ASCII characters.
Yet, I do not see what codePointCount() has to do with the problem. To
the best of my knowledge no modern korean characters are outside the
BMP, and even if that was the case I doubt very much that the console
the OP is writing to would use an utf-16 encoding.
I would rely more on String.getBytes("charset-name") where charset-name
would be replaced by the actual charset used by the OP's console.
Possibly "euc-kr".
--
Mayeul
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-06-14 15:32 +0200 |
| Message-ID | <slrnjtjpvk.8bu.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #1862 |
On 2012-06-14 08:26, mayeul.marguet <mayeul.marguet@free.fr> wrote: > On 14/06/2012 09:06, Steven Simpson wrote: >> On 13/06/12 22:14, Lew wrote: >>> Young wrote: >>>> Thank you for the tries, I don't understand why I should use >>>> codePointCount() method. The length() method gives same result. I >>>> want to >>> Not in general it doesn't. >>> >>> Read the Javadocs for the two methods and you'll see why. >> >> I've just read it, and not seen any surprises - it doesn't seem to have >> anything to do with the OP's problem, counting spaces occupied by a >> character when displayed on a console. Whether a code point takes up two >> chars inside a program is unrelated to whether it takes up two display >> positions on a console. Am I missing something? Right. Counting Java chars is very wrong. Counting code points is less wrong, but still wrong, since not every code point takes the same amount of screen space: If we assume a text terminal, a code point may take up 0, 1 or 2 positions. You'll have to loop over the code points and add up the width of each code point. (A method which does this probably already exists, but it isn't codePointCount()) > From the start, what the OP calls a 'width' is actually the number of > bytes used to represent the character. > Korean characters might be big and large, but not to the point that > they'd be twice as large as a monospace roman character. Even when using > strange fonts where that would happen, they wouldn't be /exactly/ twice > as large, and therefore trying to maintain alignment would be futile. If the OP is trying to align them on a text terminal: No it wouldn't be futile. Text terminals have a fixed character grid, and wide Asian characters occupy 2 character cells. This is what the Unicode wide, narrow, fullwidth and halfwidth properties are about (Somebody already posted a link to the relevant specs). Just start a text terminal (xterm, gnome-terminal, konsole, or whatever) and look at some text with Asian characters. > Some encodings for korean characters use two bytes for korean characters > and one byte for ASCII characters. Yes, but that's irrelevant for the OPs problem (although in some Asian encodings the two-byte characters are exactly those which also occupy two positions on the screen, so converting to such an encoding and counting the number of bytes would yield the right answer). hp -- _ | Peter J. Holzer | Deprecating human carelessness and |_|_) | Sysadmin WSR | ignorance has no successful track record. | | | hjp@hjp.at | __/ | http://www.hjp.at/ | -- Bill Code on asrg@irtf.org
[toc] | [prev] | [next] | [standalone]
| From | "mayeul.marguet" <mayeul.marguet@free.fr> |
|---|---|
| Date | 2012-06-14 16:47 +0200 |
| Message-ID | <4fd9f71a$0$1992$426a74cc@news.free.fr> |
| In reply to | #1863 |
On 14/06/2012 15:32, Peter J. Holzer wrote: > If the OP is trying to align them on a text terminal: No it wouldn't be > futile. Text terminals have a fixed character grid, and wide Asian > characters occupy 2 character cells. This is what the Unicode wide, > narrow, fullwidth and halfwidth properties are about (Somebody already > posted a link to the relevant specs). > > Just start a text terminal (xterm, gnome-terminal, konsole, or whatever) > and look at some text with Asian characters. I'll have to trust you on that for now, but that makes sense. I'll verify later with an up-to-date system. Then that would mean that the OP meant exactly what he said and pretty much everything I said and was understood here, was wrong. I was still right, though, in implying that codePointCount() is pointless. First because, as you point out, counting chars is wrong. Second because, in the context of this problem, everything is in the BMP and codePointCount() will make no difference with length(). I guess it's a simple matter of telling fullwidth & non-fullwidth characters apart, then counting them, fullwidth counting for two. I am not knowledgeable enough with korean language to find out how to tell them apart, maybe the list of characters is a known unicode range? -- Mayeul
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-06-16 20:39 +0200 |
| Message-ID | <slrnjtpkmu.ue3.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #1864 |
On 2012-06-14 14:47, mayeul.marguet <mayeul.marguet@free.fr> wrote: > On 14/06/2012 15:32, Peter J. Holzer wrote: >> If the OP is trying to align them on a text terminal: No it wouldn't be >> futile. Text terminals have a fixed character grid, and wide Asian >> characters occupy 2 character cells. This is what the Unicode wide, >> narrow, fullwidth and halfwidth properties are about (Somebody already >> posted a link to the relevant specs). >> >> Just start a text terminal (xterm, gnome-terminal, konsole, or whatever) >> and look at some text with Asian characters. > > I'll have to trust you on that for now, but that makes sense. I'll > verify later with an up-to-date system. There is a screenshot on wikipedia: http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms > Then that would mean that the OP meant exactly what he said and pretty > much everything I said and was understood here, was wrong. > I was still right, though, in implying that codePointCount() is > pointless. First because, as you point out, counting chars is wrong. > Second because, in the context of this problem, everything is in the BMP > and codePointCount() will make no difference with length(). It would still be stupid to assume that all characters are from the BMP. While it is very tempting to assume that (since all the "important" characters are in the BMP) it is bound to fail sooner or later. Doing it right is only marginally more complicated. > I guess it's a simple matter of telling fullwidth & non-fullwidth > characters apart, then counting them, fullwidth counting for two. > I am not knowledgeable enough with korean language to find out how to > tell them apart, maybe the list of characters is a known unicode range? For Korean, probably yes. But then somebody enters a Chinese name ... (A Korean would probably think of including the main CJK block. But would they think of the four extension blocks (3 of which are not in the BMP)?) Don't invent a narrow specialized method when a generic method already exists. In this case the property is already defined for every Unicode character, you just have to use it. (The Java SE doesn't seem to provide a way to get at this information but a few minutes of googling turned up icu4j, which seems to provide it: http://userguide.icu-project.org/strings/properties). hp -- _ | Peter J. Holzer | Deprecating human carelessness and |_|_) | Sysadmin WSR | ignorance has no successful track record. | | | hjp@hjp.at | __/ | http://www.hjp.at/ | -- Bill Code on asrg@irtf.org
[toc] | [prev] | [next] | [standalone]
| From | Joshua Cranmer <Pidgeot18@verizon.invalid> |
|---|---|
| Date | 2012-06-13 00:00 -0400 |
| Message-ID | <jr939r$dci$1@dont-email.me> |
| In reply to | #1846 |
On 6/12/2012 10:23 PM, Young wrote:
> I am trying to write a class that draws tables in console mode. Korean
> character takes two space width for one character. When I count length of
> a string having numbers and Korean Characters together, I cannot get
> correct width. In result, I cannot align vertical line right.
> So, I have wrote a new method, countWidth().
>
> private boolean isEnglish(Char s) {
> if (s < 256) {
> return true;
> } else {
> return false;
> }
>
> However, this would work properly when I have English and Korean in
> string variables. How would I know how many character spaces in console?
> Thank you.
Unicode characters cannot be neatly divided into "one space" and "two
space" characters. The string represented "a`" (substitute the combining
diacritic instead of a backtick) is two characters wide but generally
occupies as much space as an `a'. Since the rendering of a character
depends on its context, you can only accurate measure the size of text
by computing the size of the entire text. Even then, you still need to
know the font being used: in particular, letters like `l' or `m' have
extremely variable widths depending on the font.
However, since you said you're using console mode, I presume you intend
to mean that you're using a font that you know to be "fixed-width." In
that case, you could probably get away with dividing the range of
Unicode text into diacritics (which take up no extra space), halfwidth
(taking the space normally associated with the Latin alphabet) or
fullwidth (taking up two characters).
<http://docs.oracle.com/javase/6/docs/api/java/awt/im/InputSubset.html>
gives several subsets which are probably sufficient for most input that
you'd care about, but I leave the actual code to do the mapping up to you.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-06-13 00:24 -0700 |
| Message-ID | <jr9f6k$k7$1@dont-email.me> |
| In reply to | #1846 |
On 6/12/2012 7:23 PM, Young wrote: > However, this would work properly when I have English and Korean in > string variables. How would I know how many character spaces in console? > Thank you. Something in here might be useful as well: <http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp>
[toc] | [prev] | [next] | [standalone]
| From | Steven Simpson <ss@domain.invalid> |
|---|---|
| Date | 2012-06-13 10:24 +0100 |
| Message-ID | <uq2ma9-n64.ln1@s.simpson148.btinternet.com> |
| In reply to | #1846 |
On 13/06/12 03:23, Young wrote: > I am trying to write a class that draws tables in console mode. Korean > character takes two space width for one character. When I count length of > a string having numbers and Korean Characters together, I cannot get > correct width. In result, I cannot align vertical line right. Can you build something from files here?: <http://www.unicode.org/Public/UNIDATA/> ...especially EastAsianWidth.txt and its associated documentation?: <http://unicode.org/reports/tr11/> -- ss at comp dot lancs dot ac dot uk
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.help
csiph-web