Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.basic.visual.misc > #887

How to handle LARGE UTF-8 file

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From stevegdula@yahoo.com
Newsgroups comp.lang.basic.visual.misc
Subject How to handle LARGE UTF-8 file
Date Thu, 8 Mar 2012 08:05:04 -0800 (PST)
Organization http://groups.google.com
Lines 40
Message-ID <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> (permalink)
NNTP-Posting-Host 4.28.51.130
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-Trace posting.google.com 1331222704 19471 127.0.0.1 (8 Mar 2012 16:05:04 GMT)
X-Complaints-To groups-abuse@google.com
NNTP-Posting-Date Thu, 8 Mar 2012 16:05:04 +0000 (UTC)
Complaints-To groups-abuse@google.com
Injection-Info glegroupsg2000goo.googlegroups.com; posting-host=4.28.51.130; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P
User-Agent G2/1.0
X-Received-Bytes 2554
Xref csiph.com comp.lang.basic.visual.misc:887

Show key headers only | View raw


Hi folks,

I recently had a large text file approaching 7GB in size dropped on me.  The contents of which are supposed to be delimited text field data from a database.  It's prohibitive size will not let me open it in a robust text editor so I've just sampled the first 32K out of it via opening it as a Binary file with 'Get & Put'.  This at least allowed me to see what I was dealing with.

The entity who provided the data has shut down all responsibility for the data so I cannot optionally ask for the data in another format.

The little 32K subset of text turned out to be Encoded UTF-8 text with the EF BB BF header and is comprised of some 166 fields of delimited data.  At least some subset of this data will eventually need to be loaded into an older legal database which only supports ANSI text.

I've tried loading the entire thing into an Office 2010 Access database, but because the text is UTF8 Encoded it seems to insist that it is loading an XML document and errors out during load.  My hope was to export out only the fields we need in ANSI format.

Because the UTF8 format is not double-byte unicode all of the time (best I can tell from my research) I cannot simply step thru the data and consistently ignore the 'extra' byte.

I experimented with 'StrConv' with no success for getting ANSI text out of sampled pieces of text.

My goal is to step thru this text file and export out some more manageable 2GB ANSI segments or some such approach.

Can anyone offer any suggestions on how I can achieve my goal?

TIA !

~Steve


Back to comp.lang.basic.visual.misc | Previous | NextNext in thread | Find similar


Thread

How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
  Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
    Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
  Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
    Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
        Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
        Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
          Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
            Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700

csiph-web