Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Tue, 3 Jun 2014 20:11:54 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: Chris Angelico <rosuav@gmail.com>
Subject: Re: Unicode and Python - how often do you index strings?
In-Reply-To: <CAPTjJmr4iHdaCy61w2rz-oL6FcarRzzTeEU44Fxn2Z=gS0fh-Q@mail.gmail.com>
References: <CAPTjJmr4iHdaCy61w2rz-oL6FcarRzzTeEU44Fxn2Z=gS0fh-Q@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: "python-list@python.org" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.10658.1401844358.18130.python-list@python.org>
Lines: 69
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:72566

On 2014-06-04 10:39, Chris Angelico wrote:
> A current discussion regarding Python's Unicode support centres (or
> centers, depending on how close you are to the cent[er]{2} of the
> universe) around one critical question: Is string indexing common?
> 
> Python strings can be indexed with integers to produce characters
> (strings of length 1). They can also be iterated over from beginning
> to end. Lots of operations can be built on either one of those two
> primitives; the question is, how much can NOT be implemented
> efficiently over iteration, and MUST use indexing? Theories are
> great, but solid use-cases are better - ideally, examples from
> actual production code (actual code optional).

Many of my string-indexing uses revolve around a sliding window which
can be done with itertools[1], though I often just roll it as
something like

  n = 3
  for i in range(1 + len(s) - n):
    do_something(s[i:i+n])

So that could be supplanted by the SO iterator linked below.

The other use big case I have from production code involves a
column-offset delimited file where the headers have a row of
underscores under them delimiting the field widths, so it looks
something like

  EmpID     Name                Cost Center
  --------- ------------------- -----------------------------
  314159    Longstocking, Pippi RJ45
  265358    Davis, Miles        JA22
  979328    Bell, Alexander     RJ15

I then take row 2 and use it to make a mapping of header-name to a
slice-object for slicing the subsequent strings:

  import re
  r = re.compile('-+') # a sequence of 1+ dashes
  f = file("data.txt")
  headers = next(f)
  lines = next(f)
  header_map = dict((
      headers[i.start():i.end()].strip().upper(),
      slice(i.start(), i.end())
      )
    for i in r.finditer(lines)
    )
  for row in f:
    print("EmpID = %s" % row[header_map["EMPID"]].strip())
    print("Name = %s" % row[header_map["NAME"]].strip())
    # ...

which I presume uses string indexing under the hood.

Perhaps there's a better way of doing that, but it's what I currently
use to process these large-ish files (largest max out at 10-20MB each)

There might be other use-cases I've done, but those two leap to mind.

-tkc


[1]
http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python