Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: stevegdula@yahoo.com Newsgroups: comp.lang.basic.visual.misc Subject: Re: How to handle LARGE UTF-8 file Date: Wed, 14 Mar 2012 08:54:05 -0700 (PDT) Organization: http://groups.google.com Lines: 212 Message-ID: <23509326.306.1331740445575.JavaMail.geo-discussion-forums@vbhb20> References: <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> <17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1> NNTP-Posting-Host: 4.28.51.130 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1331740448 28288 127.0.0.1 (14 Mar 2012 15:54:08 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Wed, 14 Mar 2012 15:54:08 +0000 (UTC) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=4.28.51.130; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P User-Agent: G2/1.0 Xref: csiph.com comp.lang.basic.visual.misc:955 Thanks everyone who offered guidance. I was finally able to overcome all t= he obstacles necessary to deal with this VERY cumbersome blob of data. Her= e is a synopsis of the factors: The source provided a 7GB UTF-8 encoded text file. The file was structured = as one record per line (136 fields), data delimited with Ascii(20) and qual= ified with Ascii(254). I eventually needed to get a field/record subset of= this data into an ANSI ONLY legal database. The first step is obtaining a Helper Class which could deal the with the be= ast of a large file. I was able to obtain this very nice "HugeBinaryFile.c= ls" from here: http://www.vbforums.com/showthread.php?t=3D531321 It uses the 64 bit VB 'Currency' datatype as the file pointer. The following functions were also crucial to the 'byteArray to string' rela= ted actions (you could just as easily retained the Unicode if you needed to= preserve any potential double byte character sets.) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Public Function ByteArrayToString(bytArray() As Byte) As String Dim i As Long Dim sAns As String =20 sAns =3D StrConv(bytArray, vbUnicode) i =3D InStr(sAns, Chr(0)) If i > 0 Then ByteArrayToString =3D Left(sAns, i - 1) Else ByteArrayToString =3D sAns End If =20 End Function - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 'Declare in your module Public Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage = As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _ ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, _ ByVal cchWideChar As Long) As Long Public Const CP_UTF8 =3D 65001 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Function UTF8ToANSI(ByRef instring As String) As String Dim i As Long Dim L As Long Dim temp As String L =3D Len(instring) temp =3D String(L * 2, 0) i =3D MultiByteToWideChar(CP_UTF8, 0, instring & Chr(0), -1, temp, L) If i > 0 Then UTF8ToANSI =3D StrConv(Left(temp, (i - 1) * 2), vbFromUnicode) i =3D InStr(UTF8ToANSI, Chr(0)) If i Then UTF8ToANSI =3D Left(UTF8ToANSI, i - 1) Else UTF8ToANSI =3D instring End If End Function - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This main body of code reads buffer sampled "chunks" from the big source fi= le. It ignores the first 3 bytes of UTF-8 BOM, It then searches for the end= of record CRLF chars within the body of the sampled chunks; if the end of = rec chars are not detected then the code doubles the size of the buffer and= re-reads the bytes from the current record pointer position ~ in this way = the buffer only gets as big as it needs to be. Currently, there is no prot= ection to prevent the buffer from growing overly large, but as this is a on= e time deal it can be added in for other projects. Any detected records are then converted to ANSI text and exported to a 1GB = text file segement. The code will continue exporting out incremented 1GB an= si text files until the source file has been completely analyzed. Of special note, I was not able to get the 'InstrB' to work as suggested di= rectly with arrays. Further searching seemed to indicate that it would nee= d to be converted to string first and then run (which defeats the purpose f= or me when I needed to maintain an accurate byte pointer of the source file= where the CRLF's occur). Converting sampled byteArray to/from string chang= es the number of bytes. I ended up simply inspecting the sampled byte arra= y within a For Each loop to detect the CRLF's to maintain relative file poi= nter info. I have commented out the references to form Label objects which I used for = my own feedback. I do have my own 'ParsingEngine module' to implement but = that is outside the scope of this post. The resulting data can now be easi= ly imported and auto-parsed into table(s) by something like MS Access 2010.= From there querying the data is straight forward. Once again - thanks everyone for your useful direction! ~Steve - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Private Sub cmdUseBigFileClass_Click() '=3D=3D Related Variables =3D=3D=3D=3D=3D=3D=3D=3D=3D Dim hbfFile As HugeBinaryFile Dim bytBuf() As Byte Dim lngBufferSize As Long lngBufferSize =3D 4096 'Starting Buffer Size 4KB ReDim bytBuf(1 To lngBufferSize) lblCurBufSize.Caption =3D Format(lngBufferSize, "###,###,###,###") Dim strExportPath As String 'Exported ANSI File Location Dim intFragNum As Integer 'Exported ANSI File Fragment # Dim intExportFileNum As Integer Dim strTempUniString As String Dim strTempAnsiString As String Dim curABSNextRecPointer As Currency '( Rec Ptr in Abs Byte Position) Dim lngRelEORPos As Long 'Relative End Of Rec Position(current chunk) Dim strCurRecLine As String Dim strHeaderRec As String Dim curRecCtr As Currency Dim vntTemp As Variant, vntLast As Variant, lngRelCtr As Long intExportFileNum =3D FreeFile() Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExpor= tFileNum Set hbfFile =3D New HugeBinaryFile hbfFile.OpenFile strSourcePath 'Defined Elsewhere via file browse object 'lblBytes.Caption =3D Format(hbfFile.FileLen, "##,###,###,###,##0") DoEvents curABSNextRecPointer =3D 3 'Initialize @ First Byte beyond UTF-8 BOM (0~2) Do Until (curABSNextRecPointer >=3D hbfFile.FileLen - 2) Or (hbfFile.EOF) =20 'Get Into Next Rec Pointer Position hbfFile.SeekAbsolute curABSNextRecPointer hbfFile.ReadBytes bytBuf 'Read the next Multi-K Byte sample =20 'Manually Inspect the byte Array for next [Chr(13)+Chr(10)] Rec Termi= nator 'This does job of "InstrB" directly with Byte Array vntLast =3D 0 lngRelEORPos =3D 0 lngRelCtr =3D 0 For Each vntTemp In bytBuf ' Iterate through each element. lngRelCtr =3D lngRelCtr + 1 'Used For Relative File Pointer Locatio= n If vntTemp =3D 13 Then vntLast =3D vntTemp 'CR Found If vntTemp =3D 10 Then 'Line Feed Found If vntLast =3D 13 Then 'We found the Rec Terminator! lngRelEORPos =3D lngRelCtr - 1 'Store Relative Position Exit For End If End If Next =20 If lngRelEORPos > 0 Then =20 'Next Record Resides at this ABS Byte position. curABSNextRecPointer =3D curABSNextRecPointer + lngRelEORPos + 1 =20 'Truncate the Byte Array to exclude the CRLF part 'of the current record data ReDim Preserve bytBuf(1 To lngRelEORPos - 1) =20 strTempUniString =3D ByteArrayToString(bytBuf) strTempAnsiString =3D UTF8ToANSI(strTempUniString) ReDim bytBuf(1 To lngBufferSize) 'Return to set buffer size strCurRecLine =3D strTempAnsiString curRecCtr =3D curRecCtr + 1 If curRecCtr =3D 1 Then 'Store Header Record for all exported file 1GB fragments strHeaderRec =3D strCurRecLine=20 Print #intExportFileNum, strHeaderRec Else 'Export Current ANSI Record Print #intExportFileNum, strCurRecLine End If If LOF(intExportFileNum) >=3D (2 ^ 30) Then 'The Currently Exported ANSI file is at 1GB, Close it 'and start a new file fragment. Close intExportFileNum intFragNum =3D intFragNum + 1 intExportFileNum =3D FreeFile() Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output A= s intExportFileNum Print #intExportFileNum, strHeaderRec End If Else 'Warning, End of Record NOT detected within current 'byte array sample size! =20 'Double the current buffer size & Re-Try record read lngBufferSize =3D lngBufferSize * 2 ReDim bytBuf(1 To lngBufferSize) 'Destructive Allocation 'lblCurBufSize.Caption =3D Format(lngBufferSize, "###,###,###,###") End If =20 'If curRecCtr Mod 100 =3D 0 Then 'lblRecCount.Caption =3D Format(curRecCtr, "###,###,###,###") 'lblBytesRead.Caption =3D Format(curABSNextRecPointer, "##,###,###,##= #,##0") 'DoEvents 'End If =20 Loop Close intExportFileNum hbfFile.CloseFile Set hbfFile =3D Nothing 'lblRecCount.Caption =3D Format(curRecCtr, "###,###,###,###") 'lblBytesRead.Caption =3D Format(curABSNextRecPointer, "##,###,###,###,##0"= ) End Sub