Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From: stevegdula@yahoo.com
Newsgroups: comp.lang.basic.visual.misc
Subject: How to handle LARGE UTF-8 file
Date: Thu, 8 Mar 2012 08:05:04 -0800 (PST)
Organization: http://groups.google.com
Lines: 40
Message-ID: <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5>
NNTP-Posting-Host: 4.28.51.130
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: posting.google.com 1331222704 19471 127.0.0.1 (8 Mar 2012 16:05:04 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Thu, 8 Mar 2012 16:05:04 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=4.28.51.130; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P
User-Agent: G2/1.0
X-Received-Bytes: 2554
Xref: csiph.com comp.lang.basic.visual.misc:887

Hi folks,

I recently had a large text file approaching 7GB in size dropped on me.  Th=
e contents of which are supposed to be delimited text field data from a dat=
abase.  It's prohibitive size will not let me open it in a robust text edit=
or so I've just sampled the first 32K out of it via opening it as a Binary =
file with 'Get & Put'.  This at least allowed me to see what I was dealing =
with.

The entity who provided the data has shut down all responsibility for the d=
ata so I cannot optionally ask for the data in another format.

The little 32K subset of text turned out to be Encoded UTF-8 text with the =
EF BB BF header and is comprised of some 166 fields of delimited data.  At =
least some subset of this data will eventually need to be loaded into an ol=
der legal database which only supports ANSI text.

I've tried loading the entire thing into an Office 2010 Access database, bu=
t because the text is UTF8 Encoded it seems to insist that it is loading an=
 XML document and errors out during load.  My hope was to export out only t=
he fields we need in ANSI format.

Because the UTF8 format is not double-byte unicode all of the time (best I =
can tell from my research) I cannot simply step thru the data and consisten=
tly ignore the 'extra' byte.

I experimented with 'StrConv' with no success for getting ANSI text out of =
sampled pieces of text.

My goal is to step thru this text file and export out some more manageable =
2GB ANSI segments or some such approach.

Can anyone offer any suggestions on how I can achieve my goal?

TIA !

~Steve