Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.basic.visual.misc > #895
| From | Schmidt <sss@online.de> |
|---|---|
| Newsgroups | comp.lang.basic.visual.misc |
| Subject | Re: How to handle LARGE UTF-8 file |
| Date | 2012-03-09 07:32 +0100 |
| Organization | Aioe.org NNTP Server |
| Message-ID | <jjc86g$e24$1@speranza.aioe.org> (permalink) |
| References | <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> <jjb6ma$4nq$1@speranza.aioe.org> <jjb6uu$5h2$1@speranza.aioe.org> <17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1> |
Am 09.03.2012 02:51, schrieb stevegdula@yahoo.com: > I am hoping to merely strip out the Byte Order Mark(BOM)&HEFBBBF, > inspect for end of record&H0D0A (one line = one record), > and pass that to the afore mentioned API call. That's the right approach. In a refined, speedoptimized version you could read even entire record-groups in 32kB-chunks. So what I'd do is, search for a helper-class which can read 7GB-Files (which is using currency-Types for the FilePositionPointer - there's some of them floating around in the Web). Read 32kB or 64kB ByteArray-chunks from the file (skipping the UTF8-BOM on the first read chunk of curse). First action on a yet *undecoded* ByteArray-Chunk would be, to loop backwards until you find the vbLF-character, to determine the ending of the last "fully contained" record" within the current chunk. Adapt your absolute 64Bit FilePointer-Position-Variable to this last records vbLF-Position, so that you know from where to read the next FileChunk. Shorten the ByteArray of the current chunk, to exclude this last found vbCr+vbLF marker at the end of the chunk from the ByteArray per Redim Preserve. Then decode the entire (shortened) ByteArray from UTF8 to a normal VB-WideString (BStr) using multibytetowidechar-API. Then do a normal VB-Split-Command on the decoded String using vbCrLf. The resulting String-Array now contains properly decoded 16Bit wide Unicode-String-Records which you can loop over from Indexes 0 to Ubound(StrArray) to do your Record-processing. Keep in mind, that the VB-strings in this array now contain real Unicode - and *not* ANSI - e.g. the Euro-Sign (when contained) would be present in these Strings as 16Bit AscW-Value 8364 (&H20AC). So I would try to deal in your conversion routines with that fact - and not attempt any additional ANSI-conversion from these already nicely converted 16Bit-Unicode-WStrings, this Records-StringArray now contains. Olaf
Back to comp.lang.basic.visual.misc | Previous | Next — Previous in thread | Next in thread | Find similar
How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700
csiph-web