Path: csiph.com!goblin3!goblin1!goblin.stu.neva.ru!uio.no!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CAFHq_S4dQxkQoHhP0hQfNvZ0KVtz2F-PFqbePOyYUGFXCLqipg@mail.gmail.com>
References: <CAFHq_S4dQxkQoHhP0hQfNvZ0KVtz2F-PFqbePOyYUGFXCLqipg@mail.gmail.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Wed, 23 Sep 2015 10:07:36 -0600
Subject: Re: Readlines returns non ASCII character
To: Python <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.106.1443024499.28679.python-list@python.org>
Lines: 21
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:97042

On Wed, Sep 23, 2015 at 6:47 AM, SANKAR . <shankarphy@gmail.com> wrote:
> Hi all,
>
> I am not a expert programmer but I have to extract information from a large
> file.
>  I used  codecs.open(..) with UTF16 encoding to read this file. It could
> read all the lines in the file but returns with the non Ascii characters.
> Below are 5 sample lines. How do I avoid having this non Ascii items. Is
> there a better way to read this?

I suspect that what you want is not "non-ASCII" but just to read the
file without all the mojibake, which is likely an indication that
you're using the wrong encoding.

Do you know that UTF-16 is actually the encoding of the file?

Based on the spaces that appear between adjacent characters, I would
guess that this is probably in a 32-bit encoding, perhaps UTF-32. On
the other hand, the repeated 0x00ff 0x00fe 0x00ff are very curious; I
don't see how that could be valid UTF-32. Are you sure that this is a
text file and not some propietary binary data format?