Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #68689 > unrolled thread

csv read _csv.Error: line contains NULL byte

Started bychip9munk@gmail.com
First post2014-03-21 06:29 -0700
Last post2014-03-21 15:15 +0000
Articles 6 — 3 participants

Back to article view | Back to comp.lang.python


Contents

  csv read _csv.Error: line contains NULL byte chip9munk@gmail.com - 2014-03-21 06:29 -0700
    Re: csv read _csv.Error: line contains NULL byte Tim Golden <mail@timgolden.me.uk> - 2014-03-21 13:39 +0000
      Re: csv read _csv.Error: line contains NULL byte chip9munk@gmail.com - 2014-03-21 07:46 -0700
        Re: csv read _csv.Error: line contains NULL byte chip9munk@gmail.com - 2014-03-21 07:59 -0700
        Re: csv read _csv.Error: line contains NULL byte Tim Golden <mail@timgolden.me.uk> - 2014-03-21 14:59 +0000
        Re: csv read _csv.Error: line contains NULL byte Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-21 15:15 +0000

#68689 — csv read _csv.Error: line contains NULL byte

Fromchip9munk@gmail.com
Date2014-03-21 06:29 -0700
Subjectcsv read _csv.Error: line contains NULL byte
Message-ID<22aeefa3-cf82-457c-ab85-6f0366ff7b4e@googlegroups.com>
Hi all!

I am reading from a huge csv file (> 20 Gb), so I have to read line by line:

for i, row in enumerate(input_reader):
      #  and I do something on each row

Everything works fine until i get to a row with some strange symbols "0I`00�^"
at that point I get an error: _csv.Error: line contains NULL byte

How can i skip such row and continue going, or "decipher" it in some way?

I have tried :
csvFile = open(input_file_path, 'rb')
csvFile = open(input_file_path, 'rU')
csvFile = open(input_file_path, 'r')

and nothing works.

if I do:

try:
    for i, row in enumerate(input_reader):
      #  and I do something on each row
except Exception:
    sys.exc_clear() 

i simply stop an that line. I would like to skip it and move on.

Please help!

Best,

Chip Munk

[toc] | [next] | [standalone]


#68690

FromTim Golden <mail@timgolden.me.uk>
Date2014-03-21 13:39 +0000
Message-ID<mailman.8354.1395409181.18130.python-list@python.org>
In reply to#68689
On 21/03/2014 13:29, chip9munk@gmail.com wrote:
> Hi all!
> 
> I am reading from a huge csv file (> 20 Gb), so I have to read line by line:
> 
> for i, row in enumerate(input_reader):
>       #  and I do something on each row
> 
> Everything works fine until i get to a row with some strange symbols "0I`00�^"
> at that point I get an error: _csv.Error: line contains NULL byte
> 
> How can i skip such row and continue going, or "decipher" it in some way?

Well you have several options:

Without disturbing your existing code too much, you could wrap the
input_reader in a generator which skips malformed lines. That would look
something like this:

def unfussy_reader(reader):
    while True:
        try:
            yield next(reader)
        except csv.Error:
            # log the problem or whatever
            continue


If you knew what to do with the malformed data, you strip it out and
carry on. Whatever works best for you.

Alternatively you could subclass the standard Reader and do something
equivalent to the above in the __next__ method.

TJG

[toc] | [prev] | [next] | [standalone]


#68696

Fromchip9munk@gmail.com
Date2014-03-21 07:46 -0700
Message-ID<fefcec40-3bd9-4a94-9ae8-4f214fce2302@googlegroups.com>
In reply to#68690
On Friday, March 21, 2014 2:39:37 PM UTC+1, Tim Golden wrote:

> Without disturbing your existing code too much, you could wrap the
> 
> input_reader in a generator which skips malformed lines. That would look
> 
> something like this:
> 
> 
> 
> def unfussy_reader(reader):
> 
>     while True:
> 
>         try:
> 
>             yield next(reader)
> 
>         except csv.Error:
> 
>             # log the problem or whatever
> 
>             continue


I am sorry I do not understand how to get to each row in this way.

Please could you explain also this:
If I define this function, 
how do I change my for loop to get each row?

Thanks!

[toc] | [prev] | [next] | [standalone]


#68697

Fromchip9munk@gmail.com
Date2014-03-21 07:59 -0700
Message-ID<c66fbee9-d585-4d0d-98be-e925f2cfef5f@googlegroups.com>
In reply to#68696
Ok, I have figured it out:

for i, row in enumerate(unfussy_reader(input_reader): 
      #  and I do something on each row 

Sorry, it is my first "face to face" with generators!

Thank you very much!

Best,
Chip Munk

[toc] | [prev] | [next] | [standalone]


#68698

FromTim Golden <mail@timgolden.me.uk>
Date2014-03-21 14:59 +0000
Message-ID<mailman.8361.1395414009.18130.python-list@python.org>
In reply to#68696
On 21/03/2014 14:46, chip9munk@gmail.com wrote:
> I am sorry I do not understand how to get to each row in this way.
> 
> Please could you explain also this:
> If I define this function, 
> how do I change my for loop to get each row?

Does this help?

<code>
#!python3
import csv

def unfussy_reader(csv_reader):
    while True:
        try:
            yield next(csv_reader)
        except csv.Error:
            # log the problem or whatever
            print("Problem with some row")
            continue

if __name__ == '__main__':
    #
    # Generate malformed csv file for
    # demonstration purposes
    #
    with open("temp.csv", "w") as fout:
        fout.write("abc,def\nghi\x00,klm\n123,456")

    #
    # Open the malformed file for reading, fire up a
    # conventional CSV reader over it, wrap that reader
    # in our "unfussy" generator and enumerate over that
    # generator.
    #
    with open("temp.csv") as fin:
        reader = unfussy_reader(csv.reader(fin))
        for n, row in enumerate(reader):
            print(n, "=>", row)


</code>


TJG

[toc] | [prev] | [next] | [standalone]


#68699

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-03-21 15:15 +0000
Message-ID<mailman.8362.1395414925.18130.python-list@python.org>
In reply to#68696
On 21/03/2014 14:46, chip9munk@gmail.com wrote:
> On Friday, March 21, 2014 2:39:37 PM UTC+1, Tim Golden wrote:
>
>> Without disturbing your existing code too much, you could wrap the
>>
>> input_reader in a generator which skips malformed lines. That would look
>>
>> something like this:
>>
>>
>>
>> def unfussy_reader(reader):
>>
>>      while True:
>>
>>          try:
>>
>>              yield next(reader)
>>
>>          except csv.Error:
>>
>>              # log the problem or whatever
>>
>>              continue
>
>
> I am sorry I do not understand how to get to each row in this way.
>
> Please could you explain also this:
> If I define this function,
> how do I change my for loop to get each row?
>
> Thanks!
>

I'm pleased to see that you have answers.  In return would you either 
use the mailing list 
https://mail.python.org/mailman/listinfo/python-list or read and action 
this https://wiki.python.org/moin/GoogleGroupsPython to prevent us 
seeing double line spacing and single line paragraphs, thanks.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web