Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: stevegdula@yahoo.com Newsgroups: comp.lang.basic.visual.misc Subject: Re: How to handle LARGE UTF-8 file Date: Thu, 8 Mar 2012 17:51:43 -0800 (PST) Organization: http://groups.google.com Lines: 32 Message-ID: <17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1> References: <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> NNTP-Posting-Host: 67.167.18.95 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Trace: posting.google.com 1331257903 26023 127.0.0.1 (9 Mar 2012 01:51:43 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Fri, 9 Mar 2012 01:51:43 +0000 (UTC) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=67.167.18.95; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P User-Agent: G2/1.0 X-Received-Bytes: 2091 Xref: csiph.com comp.lang.basic.visual.misc:893 Farnsworth, Your first reply, byte order actually seems to match my sample data. ASCII(254) UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE I haven't currently digested the detailed UTF-8 Wiki explanation yet and I hopefully won't have to unless I end up needing to write my own UTF-8 record decoder. I am hoping to merely strip out the Byte Order Mark(BOM) &HEFBBBF,inspect for end of record &H0D0A (one line = one record), and pass that to the afore mentioned API call. Thanks, ~Steve On Thursday, March 8, 2012 3:05:30 PM UTC-6, Farnsworth wrote: > Farnsworth wrote: > > Besides what others suggested, check this link to see how the > > characters are encoded: > > > > http://en.wikipedia.org/wiki/Utf-8#Description > > > > So ASCII 254(1111 1110) = > > > > Byte 1: 110 00011 = &HC3 > > Byte 2: 10 111110 = &HBE > > I made a mistake in the byte order, so it should be the other way around: > > Byte 1: 110 11110 = &HDE > Byte 2: 10 000111 = &H87