Groups > comp.lang.basic.visual.misc > #887 > unrolled thread

How to handle LARGE UTF-8 file

Started by	stevegdula@yahoo.com
First post	2012-03-08 08:05 -0800
Last post	2012-03-14 08:54 -0700
Articles	11 — 5 participants

Back to article view | Back to comp.lang.basic.visual.misc

  How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
    Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
      Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
        Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
    Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
      Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
        Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
          Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
          Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
            Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
              Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700

#887 — How to handle LARGE UTF-8 file

From	stevegdula@yahoo.com
Date	2012-03-08 08:05 -0800
Subject	How to handle LARGE UTF-8 file
Message-ID	<29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5>

Hi folks,

I recently had a large text file approaching 7GB in size dropped on me. The contents of which are supposed to be delimited text field data from a database. It's prohibitive size will not let me open it in a robust text editor so I've just sampled the first 32K out of it via opening it as a Binary file with 'Get & Put'. This at least allowed me to see what I was dealing with.

The entity who provided the data has shut down all responsibility for the data so I cannot optionally ask for the data in another format.

The little 32K subset of text turned out to be Encoded UTF-8 text with the EF BB BF header and is comprised of some 166 fields of delimited data. At least some subset of this data will eventually need to be loaded into an older legal database which only supports ANSI text.

I've tried loading the entire thing into an Office 2010 Access database, but because the text is UTF8 Encoded it seems to insist that it is loading an XML document and errors out during load. My hope was to export out only the fields we need in ANSI format.

Because the UTF8 format is not double-byte unicode all of the time (best I can tell from my research) I cannot simply step thru the data and consistently ignore the 'extra' byte.

I experimented with 'StrConv' with no success for getting ANSI text out of sampled pieces of text.

My goal is to step thru this text file and export out some more manageable 2GB ANSI segments or some such approach.

Can anyone offer any suggestions on how I can achieve my goal?

TIA !

~Steve

[toc] | [next] | [standalone]

#888

From	Deanna Earley <dee.earley@icode.co.uk>
Date	2012-03-08 16:55 +0000
Message-ID	<jjao9g$sis$1@speranza.aioe.org>
In reply to	#887

On 08/03/2012 16:05, stevegdula@yahoo.com wrote:
> Hi folks,
>
> I recently had a large text file approaching 7GB in size dropped on
> me.  The contents of which are supposed to be delimited text field
> data from a database.  It's prohibitive size will not let me open it
> in a robust text editor so I've just sampled the first 32K out of it
> via opening it as a Binary file with 'Get&  Put'.  This at least
> allowed me to see what I was dealing with.
>
> The little 32K subset of text turned out to be Encoded UTF-8 text
> with the EF BB BF header and is comprised of some 166 fields of
> delimited data.  At least some subset of this data will eventually
> need to be loaded into an older legal database which only supports
> ANSI text.

While the data may be UTF-8 format, will it actually contain any "non 
ascii" text?
UTF-8 and ASCII are identical for the first 128 code points.

You can check this be reading chunks (into a byte array) and scanning 
for values > 127.

-- 
Deanna Earley (dee.earley@icode.co.uk)
i-Catcher Development Team
http://www.icode.co.uk/icatcher/

iCode Systems

(Replies direct to my email address will be ignored.
Please reply to the group.)

[toc] | [prev] | [next] | [standalone]

#889

From	"Bob Butler" <bob_butler@cox.invalid>
Date	2012-03-08 10:13 -0800
Message-ID	<jjat0a$j6n$1@dont-email.me>
In reply to	#888

"Deanna Earley" <dee.earley@icode.co.uk> wrote in message 
news:jjao9g$sis$1@speranza.aioe.org...
> On 08/03/2012 16:05, stevegdula@yahoo.com wrote:
>> Hi folks,
>>
>> I recently had a large text file approaching 7GB in size dropped on
>> me.  The contents of which are supposed to be delimited text field
>> data from a database.  It's prohibitive size will not let me open it
>> in a robust text editor so I've just sampled the first 32K out of it
>> via opening it as a Binary file with 'Get&  Put'.  This at least
>> allowed me to see what I was dealing with.
>>
>> The little 32K subset of text turned out to be Encoded UTF-8 text
>> with the EF BB BF header and is comprised of some 166 fields of
>> delimited data.  At least some subset of this data will eventually
>> need to be loaded into an older legal database which only supports
>> ANSI text.
>
> While the data may be UTF-8 format, will it actually contain any "non 
> ascii" text?
> UTF-8 and ASCII are identical for the first 128 code points.
>
> You can check this be reading chunks (into a byte array) and scanning for 
> values > 127.

If it does have any special characters you should be able to leverage the 
WideCharToMultiByte API call to convert from UTF8 to Unicode and then figure 
out what to do with the special characters for inserting into the database.

[toc] | [prev] | [next] | [standalone]

#890

From	stevegdula@yahoo.com
Date	2012-03-08 10:49 -0800
Message-ID	<20959473.1240.1331232566638.JavaMail.geo-discussion-forums@vbze11>
In reply to	#889

>-- Deanna,
>While the data may be UTF-8 format, will it actually contain 
>any "non ascii" text?  UTF-8 and ASCII are identical for the
>first 128 code points.

Unfortunately this is part of the conumdrum.  This is a delimited text export where the source used ASCII(20) to delimit the fields and ASCII(254) to qualify the field content.

>-- Bob,
>If it does have any special characters you should be able to leverage
>the WideCharToMultiByte API call to convert from UTF8 to Unicode and
>then figure out what to do with the special characters for inserting
>into the database. 

Thanks all !  I will look into using the 'WideCharToMultiByte' api call to see if I can parse this data into ANSI text.

~Steve

[toc] | [prev] | [next] | [standalone]

#891

From	"Farnsworth" <nospam@nospam.com>
Date	2012-03-08 16:00 -0500
Message-ID	<jjb6ma$4nq$1@speranza.aioe.org>
In reply to	#887

Besides what others suggested, check this link to see how the characters are 
encoded:

http://en.wikipedia.org/wiki/Utf-8#Description

So ASCII 254(1111 1110) =

Byte 1: 110 00011 = &HC3
Byte 2: 10 111110 = &HBE

[toc] | [prev] | [next] | [standalone]

#892

From	"Farnsworth" <nospam@nospam.com>
Date	2012-03-08 16:05 -0500
Message-ID	<jjb6uu$5h2$1@speranza.aioe.org>
In reply to	#891

Farnsworth wrote:
> Besides what others suggested, check this link to see how the
> characters are encoded:
>
> http://en.wikipedia.org/wiki/Utf-8#Description
>
> So ASCII 254(1111 1110) =
>
> Byte 1: 110 00011 = &HC3
> Byte 2: 10 111110 = &HBE

I made a mistake in the byte order, so it should be the other way around:

Byte 1: 110 11110 = &HDE
Byte 2: 10 000111 = &H87

[toc] | [prev] | [next] | [standalone]

#893

From	stevegdula@yahoo.com
Date	2012-03-08 17:51 -0800
Message-ID	<17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1>
In reply to	#892

Farnsworth,

Your first reply, byte order actually seems to match my sample data.

ASCII(254)
UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE

I haven't currently digested the detailed UTF-8 Wiki explanation yet and I hopefully won't have to unless I end up needing to write my own UTF-8 record decoder.

I am hoping to merely strip out the Byte Order Mark(BOM) &HEFBBBF,inspect for end of record &H0D0A (one line = one record), and pass that to the afore mentioned API call.

Thanks,

~Steve

On Thursday, March 8, 2012 3:05:30 PM UTC-6, Farnsworth wrote:
> Farnsworth wrote:
> > Besides what others suggested, check this link to see how the
> > characters are encoded:
> >
> > http://en.wikipedia.org/wiki/Utf-8#Description
> >
> > So ASCII 254(1111 1110) =
> >
> > Byte 1: 110 00011 = &HC3
> > Byte 2: 10 111110 = &HBE
> 
> I made a mistake in the byte order, so it should be the other way around:
> 
> Byte 1: 110 11110 = &HDE
> Byte 2: 10 000111 = &H87

[toc] | [prev] | [next] | [standalone]

#894

From	"Farnsworth" <nospam@nospam.com>
Date	2012-03-08 23:32 -0500
Message-ID	<jjc14v$v0i$1@speranza.aioe.org>
In reply to	#893

stevegdula@yahoo.com wrote:
> Farnsworth,
>
> Your first reply, byte order actually seems to match my sample data.
>
> ASCII(254)
> UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE
>
> I haven't currently digested the detailed UTF-8 Wiki explanation yet
> and I hopefully won't have to unless I end up needing to write my own
> UTF-8 record decoder.
>
> I am hoping to merely strip out the Byte Order Mark(BOM)
> &HEFBBBF,inspect for end of record &H0D0A (one line = one record),
> and pass that to the afore mentioned API call.

If you look at the list at Wiki article, you notice each of the extra bytes 
is always >= 128, so you can read a large chunk, 1MB+, and you would know if 
you need to read few extra bytes or not if the last byte is >=128.

As for CR LF, InStrB can be used for byte arrays. Example:

Debug.Print InStrB(arr, vbCrLf)

Finally, check ParseCSV01 routine at this page to parse the lines:

http://www.xbeat.net/vbspeed/c_ParseCSV.php

[toc] | [prev] | [next] | [standalone]

#895

From	Schmidt <sss@online.de>
Date	2012-03-09 07:32 +0100
Message-ID	<jjc86g$e24$1@speranza.aioe.org>
In reply to	#893

Am 09.03.2012 02:51, schrieb stevegdula@yahoo.com:

> I am hoping to merely strip out the Byte Order Mark(BOM)&HEFBBBF,
 > inspect for end of record&H0D0A (one line = one record),
 > and pass that to the afore mentioned API call.

That's the right approach.

In a refined, speedoptimized version you could read
even entire record-groups in 32kB-chunks.

So what I'd do is, search for a helper-class which can
read 7GB-Files (which is using currency-Types for the
FilePositionPointer - there's some of them floating
around in the Web).

Read 32kB or 64kB ByteArray-chunks from the file
(skipping the UTF8-BOM on the first read chunk of curse).

First action on a yet *undecoded* ByteArray-Chunk would
be, to loop backwards until you find the vbLF-character,
to determine the ending of the last "fully contained"
record" within the current chunk.

Adapt your absolute 64Bit FilePointer-Position-Variable
to this last records vbLF-Position, so that you know
from where to read the next FileChunk.

Shorten the ByteArray of the current chunk, to exclude
this last found vbCr+vbLF marker at the end of the chunk
from the ByteArray per Redim Preserve.

Then decode the entire (shortened) ByteArray from UTF8 to
a normal VB-WideString (BStr) using multibytetowidechar-API.

Then do a normal VB-Split-Command on the decoded String
using vbCrLf.

The resulting String-Array now contains properly decoded
16Bit wide Unicode-String-Records which you can loop over
from Indexes 0 to Ubound(StrArray) to do your Record-processing.

Keep in mind, that the VB-strings in this array now contain
real Unicode - and *not* ANSI - e.g. the Euro-Sign (when contained)
would be present in these Strings as 16Bit AscW-Value 8364 (&H20AC).

So I would try to deal in your conversion routines with that
fact - and not attempt any additional ANSI-conversion from
these already nicely converted 16Bit-Unicode-WStrings, this
Records-StringArray now contains.

Olaf

[toc] | [prev] | [next] | [standalone]

#896

From	"Farnsworth" <nospam@nospam.com>
Date	2012-03-09 13:40 -0500
Message-ID	<jjdirb$5ga$1@speranza.aioe.org>
In reply to	#895

Schmidt wrote:
> So what I'd do is, search for a helper-class which can
> read 7GB-Files (which is using currency-Types for the
> FilePositionPointer - there's some of them floating
> around in the Web).

I forgot that VB won't go beyond 2 GB in files open For Binary mode. One 
such class I used in the past is CFile.cls in this VB6 source library:

http://sourceforge.net/projects/codebox/

It has most API functions for File I/O, and easy to use methods. It's 
limited to 32-Bit though, but you can easily add extra parameters for the 
higher 32-Bit.

[toc] | [prev] | [next] | [standalone]

#955

From	stevegdula@yahoo.com
Date	2012-03-14 08:54 -0700
Message-ID	<23509326.306.1331740445575.JavaMail.geo-discussion-forums@vbhb20>
In reply to	#896

Thanks everyone who offered guidance.  I was finally able to overcome all the obstacles necessary to deal with this VERY cumbersome blob of data.  Here is a synopsis of the factors:

The source provided a 7GB UTF-8 encoded text file. The file was structured as one record per line (136 fields), data delimited with Ascii(20) and qualified with Ascii(254).  I eventually needed to get a field/record subset of this data into an ANSI ONLY legal database.

The first step is obtaining a Helper Class which could deal the with the beast of a large file.  I was able to obtain this very nice "HugeBinaryFile.cls" from here: http://www.vbforums.com/showthread.php?t=531321
It uses the 64 bit VB 'Currency' datatype as the file pointer.

The following functions were also crucial to the 'byteArray to string' related actions (you could just as easily retained the Unicode if you needed to preserve any potential double byte character sets.)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Public Function ByteArrayToString(bytArray() As Byte) As String

    Dim i As Long
    Dim sAns As String
    
    sAns = StrConv(bytArray, vbUnicode)
    i = InStr(sAns, Chr(0))
    If i > 0 Then
        ByteArrayToString = Left(sAns, i - 1)
    Else
        ByteArrayToString = sAns
    End If
    
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
'Declare in your module
Public Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _
ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, _
ByVal cchWideChar As Long) As Long

Public Const CP_UTF8 = 65001
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Function UTF8ToANSI(ByRef instring As String) As String
Dim i As Long
Dim L As Long
Dim temp As String

L = Len(instring)
temp = String(L * 2, 0)
i = MultiByteToWideChar(CP_UTF8, 0, instring & Chr(0), -1, temp, L)
If i > 0 Then
  UTF8ToANSI = StrConv(Left(temp, (i - 1) * 2), vbFromUnicode)
  i = InStr(UTF8ToANSI, Chr(0))
  If i Then UTF8ToANSI = Left(UTF8ToANSI, i - 1)
Else
  UTF8ToANSI = instring
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This main body of code reads buffer sampled "chunks" from the big source file. It ignores the first 3 bytes of UTF-8 BOM, It then searches for the end of record CRLF chars within the body of the sampled chunks; if the end of rec chars are not detected then the code doubles the size of the buffer and re-reads the bytes from the current record pointer position ~ in this way the buffer only gets as big as it needs to be.  Currently, there is no protection to prevent the buffer from growing overly large, but as this is a one time deal it can be added in for other projects.

Any detected records are then converted to ANSI text and exported to a 1GB text file segement. The code will continue exporting out incremented 1GB ansi text files until the source file has been completely analyzed.

Of special note, I was not able to get the 'InstrB' to work as suggested directly with arrays.  Further searching seemed to indicate that it would need to be converted to string first and then run (which defeats the purpose for me when I needed to maintain an accurate byte pointer of the source file where the CRLF's occur). Converting sampled byteArray to/from string changes the number of bytes.  I ended up simply inspecting the sampled byte array within a For Each loop to detect the CRLF's to maintain relative file pointer info.

I have commented out the references to form Label objects which I used for my own feedback.  I do have my own 'ParsingEngine module' to implement but that is outside the scope of this post.  The resulting data can now be easily imported and auto-parsed into table(s) by something like MS Access 2010. From there querying the data is straight forward.

Once again - thanks everyone for your useful direction!

~Steve
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Private Sub cmdUseBigFileClass_Click()

'== <Huge File Class> Related Variables =========
Dim hbfFile As HugeBinaryFile
Dim bytBuf() As Byte
Dim lngBufferSize As Long

lngBufferSize = 4096    'Starting Buffer Size 4KB
ReDim bytBuf(1 To lngBufferSize)
lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")

Dim strExportPath As String     'Exported ANSI File Location
Dim intFragNum As Integer       'Exported ANSI File Fragment #
Dim intExportFileNum As Integer
Dim strTempUniString As String
Dim strTempAnsiString As String
Dim curABSNextRecPointer As Currency '( Rec Ptr in Abs Byte Position)
Dim lngRelEORPos As Long 'Relative End Of Rec Position(current chunk)
Dim strCurRecLine As String
Dim strHeaderRec As String
Dim curRecCtr As Currency
Dim vntTemp As Variant, vntLast As Variant, lngRelCtr As Long

intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum

Set hbfFile = New HugeBinaryFile
hbfFile.OpenFile strSourcePath   'Defined Elsewhere via file browse object
'lblBytes.Caption = Format(hbfFile.FileLen, "##,###,###,###,##0")
DoEvents

curABSNextRecPointer = 3 'Initialize @ First Byte beyond UTF-8 BOM (0~2)

Do Until (curABSNextRecPointer >= hbfFile.FileLen - 2) Or (hbfFile.EOF)
     
     'Get Into Next Rec Pointer Position
     hbfFile.SeekAbsolute curABSNextRecPointer
     hbfFile.ReadBytes bytBuf 'Read the next Multi-K Byte sample
     
     'Manually Inspect the byte Array for next  [Chr(13)+Chr(10)] Rec Terminator
     'This does job of "InstrB" directly with Byte Array
     vntLast = 0
     lngRelEORPos = 0
     lngRelCtr = 0
     For Each vntTemp In bytBuf   ' Iterate through each element.
        lngRelCtr = lngRelCtr + 1 'Used For Relative File Pointer Location
         If vntTemp = 13 Then vntLast = vntTemp 'CR Found
         If vntTemp = 10 Then 'Line Feed Found
            If vntLast = 13 Then
                'We found the Rec Terminator!
                lngRelEORPos = lngRelCtr - 1 'Store Relative Position
                Exit For
            End If
         End If
     Next
     
     If lngRelEORPos > 0 Then
     
        'Next Record Resides at this ABS Byte position.
        curABSNextRecPointer = curABSNextRecPointer + lngRelEORPos + 1
        
        'Truncate the Byte Array to exclude the CRLF part
        'of the current record data
        ReDim Preserve bytBuf(1 To lngRelEORPos - 1)
        
        strTempUniString = ByteArrayToString(bytBuf)
        strTempAnsiString = UTF8ToANSI(strTempUniString)
        ReDim bytBuf(1 To lngBufferSize) 'Return to set buffer size
        strCurRecLine = strTempAnsiString
        curRecCtr = curRecCtr + 1
        If curRecCtr = 1 Then
          'Store Header Record for all exported file 1GB fragments
          strHeaderRec = strCurRecLine 
          Print #intExportFileNum, strHeaderRec
        Else
            'Export Current ANSI Record
            Print #intExportFileNum, strCurRecLine
        End If
       If LOF(intExportFileNum) >= (2 ^ 30) Then
          'The Currently Exported ANSI file is at 1GB, Close it
          'and start a new file fragment.
          Close intExportFileNum
          intFragNum = intFragNum + 1
          intExportFileNum = FreeFile()
          Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
          Print #intExportFileNum, strHeaderRec
       End If
    Else
        'Warning, End of Record NOT detected within current
        'byte array sample size!
        
        'Double the current buffer size & Re-Try record read
        lngBufferSize = lngBufferSize * 2
        ReDim bytBuf(1 To lngBufferSize) 'Destructive Allocation
        'lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
    End If
   
    'If curRecCtr Mod 100 = 0 Then
      'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
      'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
      'DoEvents
    'End If
    
Loop

Close intExportFileNum
hbfFile.CloseFile
Set hbfFile = Nothing

'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")

End Sub

[toc] | [prev] | [standalone]

csiph-web

How to handle LARGE UTF-8 file

Contents

#887 — How to handle LARGE UTF-8 file

#888

#889

#890

#891

#892

#893

#894

#895

#896

#955