Groups > comp.lang.basic.visual.misc > #887

How to handle LARGE UTF-8 file

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From	stevegdula@yahoo.com
Newsgroups	comp.lang.basic.visual.misc
Subject	How to handle LARGE UTF-8 file
Date	Thu, 8 Mar 2012 08:05:04 -0800 (PST)
Organization	http://groups.google.com
Lines	40
Message-ID	<29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> (permalink)
NNTP-Posting-Host	4.28.51.130
Mime-Version	1.0
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	quoted-printable
X-Trace	posting.google.com 1331222704 19471 127.0.0.1 (8 Mar 2012 16:05:04 GMT)
X-Complaints-To	groups-abuse@google.com
NNTP-Posting-Date	Thu, 8 Mar 2012 16:05:04 +0000 (UTC)
Complaints-To	groups-abuse@google.com
Injection-Info	glegroupsg2000goo.googlegroups.com; posting-host=4.28.51.130; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P
User-Agent	G2/1.0
X-Received-Bytes	2554
Xref	csiph.com comp.lang.basic.visual.misc:887

Show key headers only | View raw

Hi folks,

I recently had a large text file approaching 7GB in size dropped on me. The contents of which are supposed to be delimited text field data from a database. It's prohibitive size will not let me open it in a robust text editor so I've just sampled the first 32K out of it via opening it as a Binary file with 'Get & Put'. This at least allowed me to see what I was dealing with.

The entity who provided the data has shut down all responsibility for the data so I cannot optionally ask for the data in another format.

The little 32K subset of text turned out to be Encoded UTF-8 text with the EF BB BF header and is comprised of some 166 fields of delimited data. At least some subset of this data will eventually need to be loaded into an older legal database which only supports ANSI text.

I've tried loading the entire thing into an Office 2010 Access database, but because the text is UTF8 Encoded it seems to insist that it is loading an XML document and errors out during load. My hope was to export out only the fields we need in ANSI format.

Because the UTF8 format is not double-byte unicode all of the time (best I can tell from my research) I cannot simply step thru the data and consistently ignore the 'extra' byte.

I experimented with 'StrConv' with no success for getting ANSI text out of sampled pieces of text.

My goal is to step thru this text file and export out some more manageable 2GB ANSI segments or some such approach.

Can anyone offer any suggestions on how I can achieve my goal?

TIA !

~Steve

Back to comp.lang.basic.visual.misc | Previous | Next — Next in thread | Find similar

Thread

How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
  Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
    Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
  Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
    Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
        Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
        Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
          Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
            Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700

csiph-web