Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Peter Otten <__peter__@web.de> Newsgroups: comp.lang.python Subject: Re: Read and count Date: Thu, 10 Mar 2016 10:33:09 +0100 Organization: None Lines: 119 Message-ID: References: <2095750566.7009618.1457559033672.JavaMail.yahoo.ref@mail.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Trace: news.uni-berlin.de g0ebvx2O+YfCIrXFIGWpnAvz87naxon49PlEoS2YsfXA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python)': 0.05; 'python3': 0.05; 'newline': 0.07; 'val': 0.07; 'collections': 0.09; 'csv': 0.09; 'iterate': 0.09; 'learner': 0.09; 'observation': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'rows': 0.09; 'tuple': 0.09; 'python': 0.10; '(moving': 0.16; 'counter()': 0.16; 'int64': 0.16; 'line.split()': 0.16; 'non-empty': 0.16; 'parentheses': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'received:t-ipconnect.de': 0.16; 'wrote:': 0.16; 'string': 0.17; '2001': 0.18; 'skip': 0.18; 'string,': 0.18; '>>>': 0.20; 'all,': 0.20; 'library': 0.20; 'year,': 0.22; 'keys': 0.22; 'sep': 0.22; 'trying': 0.22; "python's": 0.23; 'third-party': 0.23; 'import': 0.24; 'header': 0.24; 'sort': 0.25; 'module': 0.25; 'header:User- Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; 'linux': 0.26; 'skip:" 20': 0.26; 'least': 0.27; 'cat': 0.29; 'dictionary': 0.29; 'omitted': 0.29; 'str': 0.29; 'character': 0.29; 'print': 0.30; 'code': 0.30; 'table': 0.32; 'statement': 0.32; 'file': 0.34; 'city.': 0.35; 'library.': 0.35; 'but': 0.36; 'lines': 0.36; 'closing': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'starting': 0.37; 'things': 0.38; 'skip:p 20': 0.38; 'end': 0.39; 'test': 0.39; 'data': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'some': 0.40; 'default': 0.61; 'more': 0.63; 'city': 0.65; 'here': 0.66; 'special': 0.73; '2002': 0.79; 'counts': 0.81 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: p57bd8240.dip0.t-ipconnect.de User-Agent: KNode/4.13.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104494 Jussi Piitulainen wrote: > Val Krem writes: > >> Hi all, >> >> I am a new learner about python (moving from R to python) and trying >> read and count the number of observation by year for each city. >> >> >> The data set look like >> city year x >> >> XC1 2001 10 >> XC1 2001 20 >> XC1 2002 20 >> XC1 2002 10 >> XC1 2002 10 >> >> Yv2 2001 10 >> Yv2 2002 20 >> Yv2 2002 20 >> Yv2 2002 10 >> Yv2 2002 10 >> >> out put will be >> >> city >> xc1 2001 2 >> xc1 2002 3 >> yv1 2001 1 >> yv2 2002 3 >> >> >> Below is my starting code >> count=0 >> fo=open("dat", "r+") >> str = fo.read(); >> print "Read String is : ", str >> >> fo.close() > > Below's some of the basics that you want to study. Also look up the csv > module in Python's standard library. You will want to learn these things > even if you end up using some sort of third-party data-frame library (I > don't know those but they exist). With pandas: $ cat sample.txt city year x XC1 2001 10 XC1 2001 20 XC1 2002 20 XC1 2002 10 XC1 2002 10 Yv2 2001 10 Yv2 2002 20 Yv2 2002 20 Yv2 2002 10 Yv2 2002 10 $ python3 Python 3.4.3 (default, Oct 14 2015, 20:28:29) [GCC 4.8.4] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> table = pandas.read_csv("sample.txt", delimiter=r"\s+") >>> table city year x 0 XC1 2001 10 1 XC1 2001 20 2 XC1 2002 20 3 XC1 2002 10 4 XC1 2002 10 5 Yv2 2001 10 6 Yv2 2002 20 7 Yv2 2002 20 8 Yv2 2002 10 9 Yv2 2002 10 [10 rows x 3 columns] >>> table.groupby(["city", "year"])["x"].count() city year XC1 2001 2 2002 3 Yv2 2001 1 2002 4 dtype: int64 > from collections import Counter > > # collections.Counter is a special dictionary type for just this > counts = Counter() > > # with statement ensures closing the file > with open("dat") as fo: > # file object provides lines > next(fo) # skip header line > for line in fo: > # test requires non-empty string, but lines > # contain at least newline character so ok > if line.isspace(): continue > # .split() at whitespace, omits empty fields > city, year, x = line.split() > # collections.Counter has default 0, > # key is a tuple (city, year), parentheses omitted here > counts[city, year] += 1 > > print("city") > for city, year in sorted(counts): # iterate over keys > print(city.lower(), year, counts[city, year], sep = "\t") > > # Alternatively: > # for cy, n in sorted(counts.items()): > # city, year = cy > # print(city.lower(), year, n, sep = "\t")