Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.basic.visual.misc > #887 > unrolled thread
| Started by | stevegdula@yahoo.com |
|---|---|
| First post | 2012-03-08 08:05 -0800 |
| Last post | 2012-03-14 08:54 -0700 |
| Articles | 11 — 5 participants |
Back to article view | Back to comp.lang.basic.visual.misc
How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700
| From | stevegdula@yahoo.com |
|---|---|
| Date | 2012-03-08 08:05 -0800 |
| Subject | How to handle LARGE UTF-8 file |
| Message-ID | <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> |
Hi folks, I recently had a large text file approaching 7GB in size dropped on me. The contents of which are supposed to be delimited text field data from a database. It's prohibitive size will not let me open it in a robust text editor so I've just sampled the first 32K out of it via opening it as a Binary file with 'Get & Put'. This at least allowed me to see what I was dealing with. The entity who provided the data has shut down all responsibility for the data so I cannot optionally ask for the data in another format. The little 32K subset of text turned out to be Encoded UTF-8 text with the EF BB BF header and is comprised of some 166 fields of delimited data. At least some subset of this data will eventually need to be loaded into an older legal database which only supports ANSI text. I've tried loading the entire thing into an Office 2010 Access database, but because the text is UTF8 Encoded it seems to insist that it is loading an XML document and errors out during load. My hope was to export out only the fields we need in ANSI format. Because the UTF8 format is not double-byte unicode all of the time (best I can tell from my research) I cannot simply step thru the data and consistently ignore the 'extra' byte. I experimented with 'StrConv' with no success for getting ANSI text out of sampled pieces of text. My goal is to step thru this text file and export out some more manageable 2GB ANSI segments or some such approach. Can anyone offer any suggestions on how I can achieve my goal? TIA ! ~Steve
[toc] | [next] | [standalone]
| From | Deanna Earley <dee.earley@icode.co.uk> |
|---|---|
| Date | 2012-03-08 16:55 +0000 |
| Message-ID | <jjao9g$sis$1@speranza.aioe.org> |
| In reply to | #887 |
On 08/03/2012 16:05, stevegdula@yahoo.com wrote: > Hi folks, > > I recently had a large text file approaching 7GB in size dropped on > me. The contents of which are supposed to be delimited text field > data from a database. It's prohibitive size will not let me open it > in a robust text editor so I've just sampled the first 32K out of it > via opening it as a Binary file with 'Get& Put'. This at least > allowed me to see what I was dealing with. > > The little 32K subset of text turned out to be Encoded UTF-8 text > with the EF BB BF header and is comprised of some 166 fields of > delimited data. At least some subset of this data will eventually > need to be loaded into an older legal database which only supports > ANSI text. While the data may be UTF-8 format, will it actually contain any "non ascii" text? UTF-8 and ASCII are identical for the first 128 code points. You can check this be reading chunks (into a byte array) and scanning for values > 127. -- Deanna Earley (dee.earley@icode.co.uk) i-Catcher Development Team http://www.icode.co.uk/icatcher/ iCode Systems (Replies direct to my email address will be ignored. Please reply to the group.)
[toc] | [prev] | [next] | [standalone]
| From | "Bob Butler" <bob_butler@cox.invalid> |
|---|---|
| Date | 2012-03-08 10:13 -0800 |
| Message-ID | <jjat0a$j6n$1@dont-email.me> |
| In reply to | #888 |
"Deanna Earley" <dee.earley@icode.co.uk> wrote in message news:jjao9g$sis$1@speranza.aioe.org... > On 08/03/2012 16:05, stevegdula@yahoo.com wrote: >> Hi folks, >> >> I recently had a large text file approaching 7GB in size dropped on >> me. The contents of which are supposed to be delimited text field >> data from a database. It's prohibitive size will not let me open it >> in a robust text editor so I've just sampled the first 32K out of it >> via opening it as a Binary file with 'Get& Put'. This at least >> allowed me to see what I was dealing with. >> >> The little 32K subset of text turned out to be Encoded UTF-8 text >> with the EF BB BF header and is comprised of some 166 fields of >> delimited data. At least some subset of this data will eventually >> need to be loaded into an older legal database which only supports >> ANSI text. > > While the data may be UTF-8 format, will it actually contain any "non > ascii" text? > UTF-8 and ASCII are identical for the first 128 code points. > > You can check this be reading chunks (into a byte array) and scanning for > values > 127. If it does have any special characters you should be able to leverage the WideCharToMultiByte API call to convert from UTF8 to Unicode and then figure out what to do with the special characters for inserting into the database.
[toc] | [prev] | [next] | [standalone]
| From | stevegdula@yahoo.com |
|---|---|
| Date | 2012-03-08 10:49 -0800 |
| Message-ID | <20959473.1240.1331232566638.JavaMail.geo-discussion-forums@vbze11> |
| In reply to | #889 |
>-- Deanna, >While the data may be UTF-8 format, will it actually contain >any "non ascii" text? UTF-8 and ASCII are identical for the >first 128 code points. Unfortunately this is part of the conumdrum. This is a delimited text export where the source used ASCII(20) to delimit the fields and ASCII(254) to qualify the field content. >-- Bob, >If it does have any special characters you should be able to leverage >the WideCharToMultiByte API call to convert from UTF8 to Unicode and >then figure out what to do with the special characters for inserting >into the database. Thanks all ! I will look into using the 'WideCharToMultiByte' api call to see if I can parse this data into ANSI text. ~Steve
[toc] | [prev] | [next] | [standalone]
| From | "Farnsworth" <nospam@nospam.com> |
|---|---|
| Date | 2012-03-08 16:00 -0500 |
| Message-ID | <jjb6ma$4nq$1@speranza.aioe.org> |
| In reply to | #887 |
Besides what others suggested, check this link to see how the characters are encoded: http://en.wikipedia.org/wiki/Utf-8#Description So ASCII 254(1111 1110) = Byte 1: 110 00011 = &HC3 Byte 2: 10 111110 = &HBE
[toc] | [prev] | [next] | [standalone]
| From | "Farnsworth" <nospam@nospam.com> |
|---|---|
| Date | 2012-03-08 16:05 -0500 |
| Message-ID | <jjb6uu$5h2$1@speranza.aioe.org> |
| In reply to | #891 |
Farnsworth wrote: > Besides what others suggested, check this link to see how the > characters are encoded: > > http://en.wikipedia.org/wiki/Utf-8#Description > > So ASCII 254(1111 1110) = > > Byte 1: 110 00011 = &HC3 > Byte 2: 10 111110 = &HBE I made a mistake in the byte order, so it should be the other way around: Byte 1: 110 11110 = &HDE Byte 2: 10 000111 = &H87
[toc] | [prev] | [next] | [standalone]
| From | stevegdula@yahoo.com |
|---|---|
| Date | 2012-03-08 17:51 -0800 |
| Message-ID | <17156310.66.1331257903071.JavaMail.geo-discussion-forums@vbkc1> |
| In reply to | #892 |
Farnsworth, Your first reply, byte order actually seems to match my sample data. ASCII(254) UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE I haven't currently digested the detailed UTF-8 Wiki explanation yet and I hopefully won't have to unless I end up needing to write my own UTF-8 record decoder. I am hoping to merely strip out the Byte Order Mark(BOM) &HEFBBBF,inspect for end of record &H0D0A (one line = one record), and pass that to the afore mentioned API call. Thanks, ~Steve On Thursday, March 8, 2012 3:05:30 PM UTC-6, Farnsworth wrote: > Farnsworth wrote: > > Besides what others suggested, check this link to see how the > > characters are encoded: > > > > http://en.wikipedia.org/wiki/Utf-8#Description > > > > So ASCII 254(1111 1110) = > > > > Byte 1: 110 00011 = &HC3 > > Byte 2: 10 111110 = &HBE > > I made a mistake in the byte order, so it should be the other way around: > > Byte 1: 110 11110 = &HDE > Byte 2: 10 000111 = &H87
[toc] | [prev] | [next] | [standalone]
| From | "Farnsworth" <nospam@nospam.com> |
|---|---|
| Date | 2012-03-08 23:32 -0500 |
| Message-ID | <jjc14v$v0i$1@speranza.aioe.org> |
| In reply to | #893 |
stevegdula@yahoo.com wrote: > Farnsworth, > > Your first reply, byte order actually seems to match my sample data. > > ASCII(254) > UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE > > I haven't currently digested the detailed UTF-8 Wiki explanation yet > and I hopefully won't have to unless I end up needing to write my own > UTF-8 record decoder. > > I am hoping to merely strip out the Byte Order Mark(BOM) > &HEFBBBF,inspect for end of record &H0D0A (one line = one record), > and pass that to the afore mentioned API call. If you look at the list at Wiki article, you notice each of the extra bytes is always >= 128, so you can read a large chunk, 1MB+, and you would know if you need to read few extra bytes or not if the last byte is >=128. As for CR LF, InStrB can be used for byte arrays. Example: Debug.Print InStrB(arr, vbCrLf) Finally, check ParseCSV01 routine at this page to parse the lines: http://www.xbeat.net/vbspeed/c_ParseCSV.php
[toc] | [prev] | [next] | [standalone]
| From | Schmidt <sss@online.de> |
|---|---|
| Date | 2012-03-09 07:32 +0100 |
| Message-ID | <jjc86g$e24$1@speranza.aioe.org> |
| In reply to | #893 |
Am 09.03.2012 02:51, schrieb stevegdula@yahoo.com: > I am hoping to merely strip out the Byte Order Mark(BOM)&HEFBBBF, > inspect for end of record&H0D0A (one line = one record), > and pass that to the afore mentioned API call. That's the right approach. In a refined, speedoptimized version you could read even entire record-groups in 32kB-chunks. So what I'd do is, search for a helper-class which can read 7GB-Files (which is using currency-Types for the FilePositionPointer - there's some of them floating around in the Web). Read 32kB or 64kB ByteArray-chunks from the file (skipping the UTF8-BOM on the first read chunk of curse). First action on a yet *undecoded* ByteArray-Chunk would be, to loop backwards until you find the vbLF-character, to determine the ending of the last "fully contained" record" within the current chunk. Adapt your absolute 64Bit FilePointer-Position-Variable to this last records vbLF-Position, so that you know from where to read the next FileChunk. Shorten the ByteArray of the current chunk, to exclude this last found vbCr+vbLF marker at the end of the chunk from the ByteArray per Redim Preserve. Then decode the entire (shortened) ByteArray from UTF8 to a normal VB-WideString (BStr) using multibytetowidechar-API. Then do a normal VB-Split-Command on the decoded String using vbCrLf. The resulting String-Array now contains properly decoded 16Bit wide Unicode-String-Records which you can loop over from Indexes 0 to Ubound(StrArray) to do your Record-processing. Keep in mind, that the VB-strings in this array now contain real Unicode - and *not* ANSI - e.g. the Euro-Sign (when contained) would be present in these Strings as 16Bit AscW-Value 8364 (&H20AC). So I would try to deal in your conversion routines with that fact - and not attempt any additional ANSI-conversion from these already nicely converted 16Bit-Unicode-WStrings, this Records-StringArray now contains. Olaf
[toc] | [prev] | [next] | [standalone]
| From | "Farnsworth" <nospam@nospam.com> |
|---|---|
| Date | 2012-03-09 13:40 -0500 |
| Message-ID | <jjdirb$5ga$1@speranza.aioe.org> |
| In reply to | #895 |
Schmidt wrote: > So what I'd do is, search for a helper-class which can > read 7GB-Files (which is using currency-Types for the > FilePositionPointer - there's some of them floating > around in the Web). I forgot that VB won't go beyond 2 GB in files open For Binary mode. One such class I used in the past is CFile.cls in this VB6 source library: http://sourceforge.net/projects/codebox/ It has most API functions for File I/O, and easy to use methods. It's limited to 32-Bit though, but you can easily add extra parameters for the higher 32-Bit.
[toc] | [prev] | [next] | [standalone]
| From | stevegdula@yahoo.com |
|---|---|
| Date | 2012-03-14 08:54 -0700 |
| Message-ID | <23509326.306.1331740445575.JavaMail.geo-discussion-forums@vbhb20> |
| In reply to | #896 |
Thanks everyone who offered guidance. I was finally able to overcome all the obstacles necessary to deal with this VERY cumbersome blob of data. Here is a synopsis of the factors:
The source provided a 7GB UTF-8 encoded text file. The file was structured as one record per line (136 fields), data delimited with Ascii(20) and qualified with Ascii(254). I eventually needed to get a field/record subset of this data into an ANSI ONLY legal database.
The first step is obtaining a Helper Class which could deal the with the beast of a large file. I was able to obtain this very nice "HugeBinaryFile.cls" from here: http://www.vbforums.com/showthread.php?t=531321
It uses the 64 bit VB 'Currency' datatype as the file pointer.
The following functions were also crucial to the 'byteArray to string' related actions (you could just as easily retained the Unicode if you needed to preserve any potential double byte character sets.)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Public Function ByteArrayToString(bytArray() As Byte) As String
Dim i As Long
Dim sAns As String
sAns = StrConv(bytArray, vbUnicode)
i = InStr(sAns, Chr(0))
If i > 0 Then
ByteArrayToString = Left(sAns, i - 1)
Else
ByteArrayToString = sAns
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
'Declare in your module
Public Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _
ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, _
ByVal cchWideChar As Long) As Long
Public Const CP_UTF8 = 65001
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Function UTF8ToANSI(ByRef instring As String) As String
Dim i As Long
Dim L As Long
Dim temp As String
L = Len(instring)
temp = String(L * 2, 0)
i = MultiByteToWideChar(CP_UTF8, 0, instring & Chr(0), -1, temp, L)
If i > 0 Then
UTF8ToANSI = StrConv(Left(temp, (i - 1) * 2), vbFromUnicode)
i = InStr(UTF8ToANSI, Chr(0))
If i Then UTF8ToANSI = Left(UTF8ToANSI, i - 1)
Else
UTF8ToANSI = instring
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This main body of code reads buffer sampled "chunks" from the big source file. It ignores the first 3 bytes of UTF-8 BOM, It then searches for the end of record CRLF chars within the body of the sampled chunks; if the end of rec chars are not detected then the code doubles the size of the buffer and re-reads the bytes from the current record pointer position ~ in this way the buffer only gets as big as it needs to be. Currently, there is no protection to prevent the buffer from growing overly large, but as this is a one time deal it can be added in for other projects.
Any detected records are then converted to ANSI text and exported to a 1GB text file segement. The code will continue exporting out incremented 1GB ansi text files until the source file has been completely analyzed.
Of special note, I was not able to get the 'InstrB' to work as suggested directly with arrays. Further searching seemed to indicate that it would need to be converted to string first and then run (which defeats the purpose for me when I needed to maintain an accurate byte pointer of the source file where the CRLF's occur). Converting sampled byteArray to/from string changes the number of bytes. I ended up simply inspecting the sampled byte array within a For Each loop to detect the CRLF's to maintain relative file pointer info.
I have commented out the references to form Label objects which I used for my own feedback. I do have my own 'ParsingEngine module' to implement but that is outside the scope of this post. The resulting data can now be easily imported and auto-parsed into table(s) by something like MS Access 2010. From there querying the data is straight forward.
Once again - thanks everyone for your useful direction!
~Steve
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Private Sub cmdUseBigFileClass_Click()
'== <Huge File Class> Related Variables =========
Dim hbfFile As HugeBinaryFile
Dim bytBuf() As Byte
Dim lngBufferSize As Long
lngBufferSize = 4096 'Starting Buffer Size 4KB
ReDim bytBuf(1 To lngBufferSize)
lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
Dim strExportPath As String 'Exported ANSI File Location
Dim intFragNum As Integer 'Exported ANSI File Fragment #
Dim intExportFileNum As Integer
Dim strTempUniString As String
Dim strTempAnsiString As String
Dim curABSNextRecPointer As Currency '( Rec Ptr in Abs Byte Position)
Dim lngRelEORPos As Long 'Relative End Of Rec Position(current chunk)
Dim strCurRecLine As String
Dim strHeaderRec As String
Dim curRecCtr As Currency
Dim vntTemp As Variant, vntLast As Variant, lngRelCtr As Long
intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
Set hbfFile = New HugeBinaryFile
hbfFile.OpenFile strSourcePath 'Defined Elsewhere via file browse object
'lblBytes.Caption = Format(hbfFile.FileLen, "##,###,###,###,##0")
DoEvents
curABSNextRecPointer = 3 'Initialize @ First Byte beyond UTF-8 BOM (0~2)
Do Until (curABSNextRecPointer >= hbfFile.FileLen - 2) Or (hbfFile.EOF)
'Get Into Next Rec Pointer Position
hbfFile.SeekAbsolute curABSNextRecPointer
hbfFile.ReadBytes bytBuf 'Read the next Multi-K Byte sample
'Manually Inspect the byte Array for next [Chr(13)+Chr(10)] Rec Terminator
'This does job of "InstrB" directly with Byte Array
vntLast = 0
lngRelEORPos = 0
lngRelCtr = 0
For Each vntTemp In bytBuf ' Iterate through each element.
lngRelCtr = lngRelCtr + 1 'Used For Relative File Pointer Location
If vntTemp = 13 Then vntLast = vntTemp 'CR Found
If vntTemp = 10 Then 'Line Feed Found
If vntLast = 13 Then
'We found the Rec Terminator!
lngRelEORPos = lngRelCtr - 1 'Store Relative Position
Exit For
End If
End If
Next
If lngRelEORPos > 0 Then
'Next Record Resides at this ABS Byte position.
curABSNextRecPointer = curABSNextRecPointer + lngRelEORPos + 1
'Truncate the Byte Array to exclude the CRLF part
'of the current record data
ReDim Preserve bytBuf(1 To lngRelEORPos - 1)
strTempUniString = ByteArrayToString(bytBuf)
strTempAnsiString = UTF8ToANSI(strTempUniString)
ReDim bytBuf(1 To lngBufferSize) 'Return to set buffer size
strCurRecLine = strTempAnsiString
curRecCtr = curRecCtr + 1
If curRecCtr = 1 Then
'Store Header Record for all exported file 1GB fragments
strHeaderRec = strCurRecLine
Print #intExportFileNum, strHeaderRec
Else
'Export Current ANSI Record
Print #intExportFileNum, strCurRecLine
End If
If LOF(intExportFileNum) >= (2 ^ 30) Then
'The Currently Exported ANSI file is at 1GB, Close it
'and start a new file fragment.
Close intExportFileNum
intFragNum = intFragNum + 1
intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
Print #intExportFileNum, strHeaderRec
End If
Else
'Warning, End of Record NOT detected within current
'byte array sample size!
'Double the current buffer size & Re-Try record read
lngBufferSize = lngBufferSize * 2
ReDim bytBuf(1 To lngBufferSize) 'Destructive Allocation
'lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
End If
'If curRecCtr Mod 100 = 0 Then
'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
'DoEvents
'End If
Loop
Close intExportFileNum
hbfFile.CloseFile
Set hbfFile = Nothing
'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
End Sub
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.basic.visual.misc
csiph-web