Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.basic.visual.misc > #888

Re: How to handle LARGE UTF-8 file

From Deanna Earley <dee.earley@icode.co.uk>
Newsgroups comp.lang.basic.visual.misc
Subject Re: How to handle LARGE UTF-8 file
Date 2012-03-08 16:55 +0000
Organization Aioe.org NNTP Server
Message-ID <jjao9g$sis$1@speranza.aioe.org> (permalink)
References <29897294.1014.1331222704653.JavaMail.geo-discussion-forums@vblb5>

Show all headers | View raw


On 08/03/2012 16:05, stevegdula@yahoo.com wrote:
> Hi folks,
>
> I recently had a large text file approaching 7GB in size dropped on
> me.  The contents of which are supposed to be delimited text field
> data from a database.  It's prohibitive size will not let me open it
> in a robust text editor so I've just sampled the first 32K out of it
> via opening it as a Binary file with 'Get&  Put'.  This at least
> allowed me to see what I was dealing with.
>
> The little 32K subset of text turned out to be Encoded UTF-8 text
> with the EF BB BF header and is comprised of some 166 fields of
> delimited data.  At least some subset of this data will eventually
> need to be loaded into an older legal database which only supports
> ANSI text.

While the data may be UTF-8 format, will it actually contain any "non 
ascii" text?
UTF-8 and ASCII are identical for the first 128 code points.

You can check this be reading chunks (into a byte array) and scanning 
for values > 127.

-- 
Deanna Earley (dee.earley@icode.co.uk)
i-Catcher Development Team
http://www.icode.co.uk/icatcher/

iCode Systems

(Replies direct to my email address will be ignored.
Please reply to the group.)

Back to comp.lang.basic.visual.misc | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 08:05 -0800
  Re: How to handle LARGE UTF-8 file Deanna Earley <dee.earley@icode.co.uk> - 2012-03-08 16:55 +0000
    Re: How to handle LARGE UTF-8 file "Bob Butler" <bob_butler@cox.invalid> - 2012-03-08 10:13 -0800
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 10:49 -0800
  Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:00 -0500
    Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 16:05 -0500
      Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-08 17:51 -0800
        Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-08 23:32 -0500
        Re: How to handle LARGE UTF-8 file Schmidt <sss@online.de> - 2012-03-09 07:32 +0100
          Re: How to handle LARGE UTF-8 file "Farnsworth" <nospam@nospam.com> - 2012-03-09 13:40 -0500
            Re: How to handle LARGE UTF-8 file stevegdula@yahoo.com - 2012-03-14 08:54 -0700

csiph-web