Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: stevegdula@yahoo.com Newsgroups: comp.lang.basic.visual.misc Subject: How to handle LARGE UTF-8 file Date: Thu, 8 Mar 2012 08:05:04 -0800 (PST) Organization: http://groups.google.com Lines: 40 Message-ID: <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5> NNTP-Posting-Host: 4.28.51.130 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1331222704 19471 127.0.0.1 (8 Mar 2012 16:05:04 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Thu, 8 Mar 2012 16:05:04 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=4.28.51.130; posting-account=6DX0cgkAAAAoDsfrvrkw7olQC-OfHI_P User-Agent: G2/1.0 X-Received-Bytes: 2554 Xref: csiph.com comp.lang.basic.visual.misc:887 Hi folks, I recently had a large text file approaching 7GB in size dropped on me. Th= e contents of which are supposed to be delimited text field data from a dat= abase. It's prohibitive size will not let me open it in a robust text edit= or so I've just sampled the first 32K out of it via opening it as a Binary = file with 'Get & Put'. This at least allowed me to see what I was dealing = with. The entity who provided the data has shut down all responsibility for the d= ata so I cannot optionally ask for the data in another format. The little 32K subset of text turned out to be Encoded UTF-8 text with the = EF BB BF header and is comprised of some 166 fields of delimited data. At = least some subset of this data will eventually need to be loaded into an ol= der legal database which only supports ANSI text. I've tried loading the entire thing into an Office 2010 Access database, bu= t because the text is UTF8 Encoded it seems to insist that it is loading an= XML document and errors out during load. My hope was to export out only t= he fields we need in ANSI format. Because the UTF8 format is not double-byte unicode all of the time (best I = can tell from my research) I cannot simply step thru the data and consisten= tly ignore the 'extra' byte. I experimented with 'StrConv' with no success for getting ANSI text out of = sampled pieces of text. My goal is to step thru this text file and export out some more manageable = 2GB ANSI segments or some such approach. Can anyone offer any suggestions on how I can achieve my goal? TIA ! ~Steve