Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'skip:[ 20': 0.04; 'elif': 0.05; 'indices': 0.07; 'sys': 0.07; 'string': 0.09; '[2]:': 0.09; '[3]:': 0.09; '__name__': 0.09; 'filename': 0.09; 'friday,': 0.09; 'iterate': 0.09; 'overflow': 0.09; 'throws': 0.09; 'def': 0.12; "'__main__':": 0.16; "'w')": 0.16; '[4]:': 0.16; 'csv': 0.16; 'parser.': 0.16; 'received:195.186': 0.16; 'received:bluewin.ch': 0.16; 'reinvent': 0.16; 'script,': 0.16; 'seconds,': 0.16; 'simplest': 0.16; 'undocumented': 0.16; 'unexpected': 0.16; 'wayne': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'file,': 0.19; 'possible,': 0.19; 'skip:f 30': 0.19; 'stack': 0.19; 'starts': 0.20; 'input': 0.22; 'import': 0.22; 'print': 0.22; 'this?': 0.23; 'header:User-Agent:1': 0.23; 'tend': 0.24; 'meeting': 0.26; 'pass': 0.26; 'header:In-Reply-To:1': 0.27; 'record': 0.27; 'point': 0.28; 'correct': 0.29; 'code': 0.31; '"")': 0.31; 'along.': 0.31; 'names.': 0.31; 'quotes': 0.31; 'race,': 0.31; 'file': 0.32; 'advice': 0.35; 'but': 0.35; 'there': 0.35; 'module.': 0.36; 'seconds': 0.37; 'list': 0.37; 'list.': 0.37; 'being': 0.38; 'skip:o 20': 0.38; 'mine': 0.38; 'skip:[ 10': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'little': 0.38; 'to:addr:python.org': 0.39; 'how': 0.40; 'easy': 0.60; 'number,': 0.60; 'gone': 0.61; 'john': 0.61; 'name': 0.63; 'july': 0.63; 'myself': 0.63; 'due': 0.66; 'here': 0.66; 'believe': 0.68; 'date,': 0.68; 'results': 0.69; 'further,': 0.74; 'protect': 0.79; 'horse': 0.84; 'trainer': 0.84; 'poorly': 0.93; 'race': 0.95 Date: Fri, 04 Jul 2014 15:24:31 +0200 From: "F.R." User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: fixing an horrific formatted csv file. References: <47e2e29d-b5c3-4aa6-abf9-3b1e46eb0dec@googlegroups.com> <0d3871c6-81d4-4168-9408-ad85299b0955@googlegroups.com> <11ecf009-6f81-4fa5-bee9-b52b9407f0af@googlegroups.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 180 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1404480345 news.xs4all.nl 2864 [2001:888:2000:d::a6]:39605 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:73953 On 07/04/2014 12:28 PM, flebber wrote: > On Friday, 4 July 2014 14:12:15 UTC+10, flebber wrote: >> I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names. >> >> >> >> How is it the best to do this? >> >> >> >> so in my file I had on line 44 this trainer name. >> >> >> >> "Michael, Wayne & John Hawkes" >> >> >> >> and in line 95 this horse name. >> >> Inz'n'out >> >> >> >> this throws of my capturing correct item 9. How do I protect against this? >> >> >> >> Here is current code. >> >> >> >> import re >> >> from sys import argv >> >> SCRIPT, FILENAME = argv >> >> >> >> >> >> def out_file_name(file_name): >> >> """take an input file and keep the name with appended _clean""" >> >> file_parts = file_name.split(".",) >> >> output_file = file_parts[0] + '_clean.' + file_parts[1] >> >> return output_file >> >> >> >> >> >> def race_table(text_file): >> >> """utility to reorganise poorly made csv entry""" >> >> input_table = [[item.strip(' "') for item in record.split(',')] >> >> for record in text_file.splitlines()] >> >> # At this point look at input_table to find the record indices >> >> output_table = [] >> >> for record in input_table: >> >> if record[0] == 'Meeting': >> >> meeting = record[3] >> >> elif record[0] == 'Race': >> >> date = record[13] >> >> race = record[1] >> >> elif record[0] == 'Horse': >> >> number = record[1] >> >> name = record[2] >> >> results = record[9] >> >> res_split = re.split('[- ]', results) >> >> starts = res_split[0] >> >> wins = res_split[1] >> >> seconds = res_split[2] >> >> thirds = res_split[3] >> >> prizemoney = res_split[4] >> >> trainer = record[4] >> >> location = record[5] >> >> print(name, wins, seconds) >> >> output_table.append((meeting, date, race, number, name, >> >> starts, wins, seconds, thirds, prizemoney, >> >> trainer, location)) >> >> return output_table >> >> >> >> MY_FILE = out_file_name(FILENAME) >> >> >> >> # with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out: >> >> # for line in race_table(f_in.readline()): >> >> # new_row = line >> >> with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out: >> >> CONTENT = f_in.read() >> >> # print(content) >> >> FILE_CONTENTS = race_table(CONTENT) >> >> # print new_name >> >> f_out.write(str(FILE_CONTENTS)) >> >> >> >> >> >> if __name__ == '__main__': >> >> pass > So I found this on stack overflow > > In [2]: import string > > In [3]: identity = string.maketrans("", "") > > In [4]: x = ['+5556', '-1539', '-99', '+1500'] > > In [5]: x = [s.translate(identity, "+-") for s in x] > > In [6]: x > Out[6]: ['5556', '1539', '99', '1500'] > > but it fails in my file, due to I believe mine being a list of list. Is there an easy way to iterate the sublists without flattening? > > Current code. > > input_table = [[item.strip(' "') for item in record.split(',')] > for record in text_file.splitlines()] > # At this point look at input_table to find the record indices > identity = string.maketrans("", "") > print(input_table) > input_table = [s.translate(identity, ",'") for s > in input_table] > > Sayth Take Gregory's advice and use the csv module. Don't reinvent a csv parser. My "csv" splitter was the simplest approach possible, which I tend to use with undocumented formats, tweaking for unexpected features as they come along. Frederic