Groups > comp.lang.basic.visual.misc > #895

Re: How to handle LARGE UTF-8 file

From	Schmidt <sss@online.de>
Newsgroups	comp.lang.basic.visual.misc
Subject	Re: How to handle LARGE UTF-8 file
Date	2012-03-09 07:32 +0100
Organization	Aioe.org NNTP Server
Message-ID	<jjc86g$e24$1@speranza.aioe.org> (permalink)
References	<29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> <jjb6ma$4nq$1@speranza.aioe.org> <jjb6uu$5h2$1@speranza.aioe.org> <17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1>

Show all headers | View raw

Am 09.03.2012 02:51, schrieb stevegdula@yahoo.com:

> I am hoping to merely strip out the Byte Order Mark(BOM)&HEFBBBF,
 > inspect for end of record&H0D0A (one line = one record),
 > and pass that to the afore mentioned API call.

That's the right approach.

In a refined, speedoptimized version you could read
even entire record-groups in 32kB-chunks.

So what I'd do is, search for a helper-class which can
read 7GB-Files (which is using currency-Types for the
FilePositionPointer - there's some of them floating
around in the Web).

Read 32kB or 64kB ByteArray-chunks from the file
(skipping the UTF8-BOM on the first read chunk of curse).

First action on a yet *undecoded* ByteArray-Chunk would
be, to loop backwards until you find the vbLF-character,
to determine the ending of the last "fully contained"
record" within the current chunk.

Adapt your absolute 64Bit FilePointer-Position-Variable
to this last records vbLF-Position, so that you know
from where to read the next FileChunk.

Shorten the ByteArray of the current chunk, to exclude
this last found vbCr+vbLF marker at the end of the chunk
from the ByteArray per Redim Preserve.

Then decode the entire (shortened) ByteArray from UTF8 to
a normal VB-WideString (BStr) using multibytetowidechar-API.

Then do a normal VB-Split-Command on the decoded String
using vbCrLf.

The resulting String-Array now contains properly decoded
16Bit wide Unicode-String-Records which you can loop over
from Indexes 0 to Ubound(StrArray) to do your Record-processing.

Keep in mind, that the VB-strings in this array now contain
real Unicode - and *not* ANSI - e.g. the Euro-Sign (when contained)
would be present in these Strings as 16Bit AscW-Value 8364 (&H20AC).

So I would try to deal in your conversion routines with that
fact - and not attempt any additional ANSI-conversion from
these already nicely converted 16Bit-Unicode-WStrings, this
Records-StringArray now contains.

Olaf

Back to comp.lang.basic.visual.misc | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
  Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
    Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
  Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
    Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
        Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
        Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
          Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
            Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700

csiph-web