Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #25666

Re: Finding duplicate file names and modifying them based on elements of the path

From "Larry.Martell@gmail.com" <larry.martell@gmail.com>
Newsgroups comp.lang.python
Subject Re: Finding duplicate file names and modifying them based on elements of the path
Date 2012-07-19 18:01 -0700
Organization http://groups.google.com
Message-ID <2862aea5-9d5c-4979-8ca2-0bb01f9db32c@m3g2000vbl.googlegroups.com> (permalink)
References <b2f1993c-8872-44ed-9e69-0895e4059532@mi5g2000pbc.googlegroups.com> <7xipdkwuqd.fsf@ruckus.brouhaha.com> <14831ee0-fd74-4906-852c-764ba2d8b1d5@h20g2000yqe.googlegroups.com> <7xipdjilko.fsf@ruckus.brouhaha.com>

Show all headers | View raw


On Jul 19, 1:43 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
>   for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
>      gs = list(group)  # convert iterator to a list
>      n = len(gs)       # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state.  It bumps a pointer and maybe consumes some input
> every time you advance it.  In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

> This isn't as bad as it sounds once you get used to it, but it can be
> a source of frustration at first.
>
> BTW, if you just want to count the elements of an iterator (while
> consuming it),
>
>      n = sum(1 for x in xs)
>
> counts the elements of xs without having to expand it into an in-memory
> list.
>
> Itertools really makes Python feel a lot more expressive and clean,
> despite little kinks like the above.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-18 15:20 -0700
  Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-18 15:49 -0700
    Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:00 -0700
      Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:43 -0700
        Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
          Re: Finding duplicate file names and modifying them based on elements of the path Peter Otten <__peter__@web.de> - 2012-07-20 09:35 +0200
          Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-20 00:51 -0700
          Re: Finding duplicate file names and modifying them based on elements of the path Paul Rudin <paul.nospam@rudin.co.uk> - 2012-07-20 09:37 +0100
    Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:52 -0700
      Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:56 -0700
        Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 17:58 -0700
  Re: Finding duplicate file names and modifying them based on elements of the path Simon Cropper <simoncropper@fossworkflowguides.com> - 2012-07-19 10:36 +1000
    Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:54 -0700
      RE: Finding duplicate file names and modifying them based on elements of the path "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-07-19 19:02 +0000
        Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:06 -0700
          Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-19 22:32 +0100
            Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
              Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 20:07 -0700
                Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-20 16:45 +0100

csiph-web