Groups > comp.lang.python > #25588 > unrolled thread

Finding duplicate file names and modifying them based on elements of the path

Started by	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
First post	2012-07-18 15:20 -0700
Last post	2012-07-20 16:45 +0100
Articles	19 — 7 participants

Back to article view | Back to comp.lang.python

  Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-18 15:20 -0700
    Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-18 15:49 -0700
      Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:00 -0700
        Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:43 -0700
          Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
            Re: Finding duplicate file names and modifying them based on elements of the path Peter Otten <__peter__@web.de> - 2012-07-20 09:35 +0200
            Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-20 00:51 -0700
            Re: Finding duplicate file names and modifying them based on elements of the path Paul Rudin <paul.nospam@rudin.co.uk> - 2012-07-20 09:37 +0100
      Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:52 -0700
        Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:56 -0700
          Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 17:58 -0700
    Re: Finding duplicate file names and modifying them based on elements of the path Simon Cropper <simoncropper@fossworkflowguides.com> - 2012-07-19 10:36 +1000
      Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:54 -0700
        RE: Finding duplicate file names and modifying them based on elements of the path "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-07-19 19:02 +0000
          Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:06 -0700
            Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-19 22:32 +0100
              Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
                Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 20:07 -0700
                  Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-20 16:45 +0100

#25588 — Finding duplicate file names and modifying them based on elements of the path

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-18 15:20 -0700
Subject	Finding duplicate file names and modifying them based on elements of the path
Message-ID	<b2f1993c-8872-44ed-9e69-0895e4059532@mi5g2000pbc.googlegroups.com>

I have an interesting problem I'm trying to solve. I have a solution
almost working, but it's super ugly, and know there has to be a
better, cleaner way to do it.

I have a list of path names that have this form:

/dir0/dir1/dir2/dir3/dir4/dir5/dir6/file

I need to find all the file names (basenames) in the list that are
duplicates, and for each one that is a dup, prepend dir4 to the
filename as long as the dir4/file pair is unique. If there are
multiple dir4/files in the list, then I also need to add a sequence
number based on the sorted value of dir5 (which is a date in ddMONyy
format).

For example, if my list contains:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3

Then I want to end up with:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3

My solution involves multiple maps and multiple iterations through the
data. How would you folks do this?

[toc] | [next] | [standalone]

#25590

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-18 15:49 -0700
Message-ID	<7xipdkwuqd.fsf@ruckus.brouhaha.com>
In reply to	#25588

"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a
> better, cleaner way to do it. ...
>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?

You could post your code and ask for suggestions how to improve it.
There are a lot of not-so-natural constraints in that problem, so it
stands to reason that the code will be a bit messy.  The whole
specification seems like an antipattern though.  You should just give a
sensible encoding for the filename regardless of whether other fields
are duplicated or not.  You also don't seem to address the case where
basename, dir4, and dir5 are all duplicated.

The approach I'd take for the spec as you wrote it is:

1. Sort the list on the (basename, dir4, dir5) triple, saving original
   location (numeric index) of each item  
2. Use itertools.groupby to group together duplicate basenames.
3. Within the groups, use groupby again to gather duplicate dir4's,
4. Within -those- groups, group by dir5 and assign sequence numbers in
   groups where there's more than one file
5. Unsort to get the rewritten items back into the original order.

Actual code is left as an exercise.

[toc] | [prev] | [next] | [standalone]

#25642

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 12:00 -0700
Message-ID	<14831ee0-fd74-4906-852c-764ba2d8b1d5@h20g2000yqe.googlegroups.com>
In reply to	#25590

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

I replied to this before, but I don't see, so if this is a duplicate,
sorry.

Thanks for the reply Paul. I had not heard of itertools. It sounds
like just what I need for this. But I am having 1 issue - how do you
know how many items are in each group? Without knowing that I have to
either make 2 passes through the data, or else work on the previous
item (when I'm in an iteration after the first then I know I have
dups). But that very quickly gets crazy with trying to keep the
previous values.

[toc] | [prev] | [next] | [standalone]

#25647

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-19 12:43 -0700
Message-ID	<7xipdjilko.fsf@ruckus.brouhaha.com>
In reply to	#25642

"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:
> Thanks for the reply Paul. I had not heard of itertools. It sounds
> like just what I need for this. But I am having 1 issue - how do you
> know how many items are in each group?

Simplest is:

  for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
     gs = list(group)  # convert iterator to a list
     n = len(gs)       # this is the number of elements

there is some theoretical inelegance in that it requires each group to
fit in memory, but you weren't really going to have billions of files
with the same basename.

If you're not used to iterators and itertools, note there are some
subtleties to using groupby to iterate over files, because an iterator
actually has state.  It bumps a pointer and maybe consumes some input
every time you advance it.  In a situation like the above, you've got
some nexted iterators (the groupby iterator generating groups, and the
individual group iterators that come out of the groupby) that wrap the
same file handle, so bad confusion can result if you advance both
iterators without being careful (one can consume file input that you
thought would go to another).

This isn't as bad as it sounds once you get used to it, but it can be
a source of frustration at first.  

BTW, if you just want to count the elements of an iterator (while
consuming it),

     n = sum(1 for x in xs)

counts the elements of xs without having to expand it into an in-memory
list.

Itertools really makes Python feel a lot more expressive and clean,
despite little kinks like the above.

[toc] | [prev] | [next] | [standalone]

#25666

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 18:01 -0700
Message-ID	<2862aea5-9d5c-4979-8ca2-0bb01f9db32c@m3g2000vbl.googlegroups.com>
In reply to	#25647

On Jul 19, 1:43 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
>   for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
>      gs = list(group)  # convert iterator to a list
>      n = len(gs)       # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state.  It bumps a pointer and maybe consumes some input
> every time you advance it.  In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

> This isn't as bad as it sounds once you get used to it, but it can be
> a source of frustration at first.
>
> BTW, if you just want to count the elements of an iterator (while
> consuming it),
>
>      n = sum(1 for x in xs)
>
> counts the elements of xs without having to expand it into an in-memory
> list.
>
> Itertools really makes Python feel a lot more expressive and clean,
> despite little kinks like the above.

[toc] | [prev] | [next] | [standalone]

#25675

From	Peter Otten <__peter__@web.de>
Date	2012-07-20 09:35 +0200
Message-ID	<mailman.2336.1342769705.4697.python-list@python.org>
In reply to	#25666

Larry.Martell@gmail.com wrote:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Many itertools functions work that way. It allows you to iterate over the 
items even if there is more data than fits into memory. 
If you need to keep all items and are sure that your computer can cope with 
them at once you can always throw in a

group = list(group)

[toc] | [prev] | [next] | [standalone]

#25676

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-20 00:51 -0700
Message-ID	<7xk3xyswee.fsf@ruckus.brouhaha.com>
In reply to	#25666

"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:
> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Yes, that is correct, you have to carefully watch where the stuff in the
iterators is getting consumed, including when there are nested
iterators.  That's what I was mentioning earlier--it got me confused at
first, but I use that style all the time now and it is pretty natural.

[toc] | [prev] | [next] | [standalone]

#25683

From	Paul Rudin <paul.nospam@rudin.co.uk>
Date	2012-07-20 09:37 +0100
Message-ID	<87r4s696ax.fsf@no-fixed-abode.cable.virginmedia.net>
In reply to	#25666

"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

You've consumed the *group* which is an iterator, in order to construct
a list from its elements. Sorry if this is excessively nit-picking, but
it generally helps to keep these things very clear in your own mind.

[toc] | [prev] | [next] | [standalone]

#25646

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 11:52 -0700
Message-ID	<8d08181d-316d-4fdc-8f03-1d9b90929d89@e7g2000yqh.googlegroups.com>
In reply to	#25590

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.

reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):

for file, file_group in groupby(reports_list, lambda x: x[0]):
    # if file is unique in file_group do nothing, but how can I tell
if file is unique?
    for tool, tool_group in groupby(file_group, lambda x: x[1]):
        # if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
        for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):


You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or      defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?

Thanks!
-larry

[toc] | [prev] | [next] | [standalone]

#25648

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-19 12:56 -0700
Message-ID	<7xipdjbk3z.fsf@ruckus.brouhaha.com>
In reply to	#25646

"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:
> You can't do a len on the iterator that is returned from groupby, and
> I've tried to do something with imap or      defaultdict, but I'm not
> getting anywhere. I guess I can just make 2 passes through the data,
> the first time getting counts. Or am I missing something about how
> groupby works?

I posted another reply to your other message, which reached me earlier.
If you're still stuck, post again, though I probably won't be able to
reply til tomorrow or the next day.

[toc] | [prev] | [next] | [standalone]

#25664

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 17:58 -0700
Message-ID	<51ef259b-f7fd-49ae-975e-cb09d65fdd5d@cu1g2000vbb.googlegroups.com>
In reply to	#25648

On Jul 19, 1:56 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > You can't do a len on the iterator that is returned from groupby, and
> > I've tried to do something with imap or      defaultdict, but I'm not
> > getting anywhere. I guess I can just make 2 passes through the data,
> > the first time getting counts. Or am I missing something about how
> > groupby works?
>
> I posted another reply to your other message, which reached me earlier.
> If you're still stuck, post again, though I probably won't be able to
> reply til tomorrow or the next day.

I really appreciate the offer, but I'm going to go with MRAB's
solution. It works, and I understand it ;-)

[toc] | [prev] | [next] | [standalone]

#25594

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2012-07-19 10:36 +1000
Message-ID	<mailman.2281.1342658195.4697.python-list@python.org>
In reply to	#25588

On 19/07/12 08:20, Larry.Martell@gmail.com wrote:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a
> better, cleaner way to do it.
>
> I have a list of path names that have this form:
>
> /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> I need to find all the file names (basenames) in the list that are
> duplicates, and for each one that is a dup, prepend dir4 to the
> filename as long as the dir4/file pair is unique. If there are
> multiple dir4/files in the list, then I also need to add a sequence
> number based on the sorted value of dir5 (which is a date in ddMONyy
> format).
>
> For example, if my list contains:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> Then I want to end up with:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?
>

Hi Larry,

I am making the assumption that you intend to collapse the directory 
tree and store each file in the same directory, otherwise I can't think 
of why you need to do this.

If this is the case, then I would...

1. import all the files into an array
2. parse path to extract forth level directory name and base name.
3. reiterate through the array
    3.1 check if base filename exists in recipient directory
    3.2 if not, copy to recipient directory
    3.3 if present, append the directory path then save
    3.4 create log of success or failure

Personally, I would not have some files with abcd_file1 and others as 
file2 because if it is important enough to store a file in a separate 
directory you should also note where file2 came from as well. When 
looking at your results at a later date you are going to have to open 
file2 (which I presume must record where it relates to) to figure out 
where it came from. If it is in the name it is easier to review.

In short, consistency is the name of the game; if you are going to do it 
for some then do it for all; and finally it will be easier for others 
later to work out what you have done.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages           http://www.fossworkflowguides.com/gis
    bash / Python    http://www.fossworkflowguides.com/scripting

[toc] | [prev] | [next] | [standalone]

#25641

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 11:54 -0700
Message-ID	<aad5b32d-b0b4-47c1-b134-4827ec0472b2@t20g2000yqn.googlegroups.com>
In reply to	#25594

On Jul 18, 6:36 pm, Simon Cropper
<simoncrop...@fossworkflowguides.com> wrote:
> On 19/07/12 08:20, Larry.Mart...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it.
>
> > I have a list of path names that have this form:
>
> > /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> > I need to find all the file names (basenames) in the list that are
> > duplicates, and for each one that is a dup, prepend dir4 to the
> > filename as long as the dir4/file pair is unique. If there are
> > multiple dir4/files in the list, then I also need to add a sequence
> > number based on the sorted value of dir5 (which is a date in ddMONyy
> > format).
>
> > For example, if my list contains:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> > Then I want to end up with:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> Hi Larry,
>
> I am making the assumption that you intend to collapse the directory
> tree and store each file in the same directory, otherwise I can't think
> of why you need to do this.

Hi Simon, thanks for the reply. It's not quite this - what I am doing
is creating a zip file with relative path names, and if there are
duplicate files the parts of the path that are not be carried over
need to get prepended to the file names to make then unique,
>
> If this is the case, then I would...
>
> 1. import all the files into an array
> 2. parse path to extract forth level directory name and base name.
> 3. reiterate through the array
>     3.1 check if base filename exists in recipient directory
>     3.2 if not, copy to recipient directory
>     3.3 if present, append the directory path then save
>     3.4 create log of success or failure
>
> Personally, I would not have some files with abcd_file1 and others as
> file2 because if it is important enough to store a file in a separate
> directory you should also note where file2 came from as well. When
> looking at your results at a later date you are going to have to open
> file2 (which I presume must record where it relates to) to figure out
> where it came from. If it is in the name it is easier to review.
>
> In short, consistency is the name of the game; if you are going to do it
> for some then do it for all; and finally it will be easier for others
> later to work out what you have done.

Yeah, I know, but this is for a client, and this is what they want.

[toc] | [prev] | [next] | [standalone]

#25643

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-07-19 19:02 +0000
Message-ID	<mailman.2313.1342724574.4697.python-list@python.org>
In reply to	#25641

> > I am making the assumption that you intend to collapse the directory
> > tree and store each file in the same directory, otherwise I can't think
> > of why you need to do this.
> 
> Hi Simon, thanks for the reply. It's not quite this - what I am doing
> is creating a zip file with relative path names, and if there are
> duplicate files the parts of the path that are not be carried over
> need to get prepended to the file names to make then unique,

Depending on the file system of the client, you can hit file name
length limits. I would think it would be better to just create
the full structure in the zip.

Just something to keep in mind, especially if you see funky behavior.

Ramit

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#25644

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 12:06 -0700
Message-ID	<3a201907-c6dd-4c1f-b921-6f508d0af6e8@r3g2000yqh.googlegroups.com>
In reply to	#25643

On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > > I am making the assumption that you intend to collapse the directory
> > > tree and store each file in the same directory, otherwise I can't think
> > > of why you need to do this.
>
> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > is creating a zip file with relative path names, and if there are
> > duplicate files the parts of the path that are not be carried over
> > need to get prepended to the file names to make then unique,
>
> Depending on the file system of the client, you can hit file name
> length limits. I would think it would be better to just create
> the full structure in the zip.
>
> Just something to keep in mind, especially if you see funky behavior.

Thanks, but it's not what the client wants.

[toc] | [prev] | [next] | [standalone]

#25653

From	MRAB <python@mrabarnett.plus.com>
Date	2012-07-19 22:32 +0100
Message-ID	<mailman.2319.1342733568.4697.python-list@python.org>
In reply to	#25644

On 19/07/2012 20:06, Larry.Martell@gmail.com wrote:
> On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
>> > > I am making the assumption that you intend to collapse the directory
>> > > tree and store each file in the same directory, otherwise I can't think
>> > > of why you need to do this.
>>
>> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
>> > is creating a zip file with relative path names, and if there are
>> > duplicate files the parts of the path that are not be carried over
>> > need to get prepended to the file names to make then unique,
>>
>> Depending on the file system of the client, you can hit file name
>> length limits. I would think it would be better to just create
>> the full structure in the zip.
>>
>> Just something to keep in mind, especially if you see funky behavior.
>
> Thanks, but it's not what the client wants.
>
Here's another solution, not using itertools:

from collections import defaultdict
from os.path import basename, dirname
from time import strftime, strptime

# Starting with the original paths

paths = [
     "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
     "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
     "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
     "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
     "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
]

def make_dir5_key(path):
     date = strptime(path.split("/")[6], "%d%b%y")
     return strftime("%y%b%d", date)

# Collect the paths into a dict keyed by the basename

files = defaultdict(list)
for path in paths:
     files[basename(path)].append(path)

# Process a list of paths if there's more than one entry

renaming = []

for name, entries in files.items():
     if len(entries) > 1:
         # Collect the paths in each subgroup into a dict keyed by dir4

         subgroup = defaultdict(list)
         for path in entries:
             subgroup[path.split("/")[5]].append(path)

         for dir4, subentries in subgroup.items():
             # Sort the subentries by dir5 (date)
             subentries.sort(key=make_dir5_key)

             if len(subentries) > 1:
                 for index, path in enumerate(subentries):
                     renaming.append((path, 
"{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
             else:
                 path = subentries[0]
                 renaming.append((path, "{}/{}_{}".format(dirname(path), 
dir4, name)))
     else:
         path = entries[0]

for old_path, new_path in renaming:
     print("Rename {!r} to {!r}".format(old_path, new_path))

[toc] | [prev] | [next] | [standalone]

#25665

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 18:01 -0700
Message-ID	<b63eea74-291b-4f9a-80dc-15130ebe5ce5@x21g2000vbc.googlegroups.com>
In reply to	#25653

On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
>
>
>
>
>
>
> > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> >> > > I am making the assumption that you intend to collapse the directory
> >> > > tree and store each file in the same directory, otherwise I can't think
> >> > > of why you need to do this.
>
> >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> >> > is creating a zip file with relative path names, and if there are
> >> > duplicate files the parts of the path that are not be carried over
> >> > need to get prepended to the file names to make then unique,
>
> >> Depending on the file system of the client, you can hit file name
> >> length limits. I would think it would be better to just create
> >> the full structure in the zip.
>
> >> Just something to keep in mind, especially if you see funky behavior.
>
> > Thanks, but it's not what the client wants.
>
> Here's another solution, not using itertools:
>
> from collections import defaultdict
> from os.path import basename, dirname
> from time import strftime, strptime
>
> # Starting with the original paths
>
> paths = [
>      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
>      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> ]
>
> def make_dir5_key(path):
>      date = strptime(path.split("/")[6], "%d%b%y")
>      return strftime("%y%b%d", date)
>
> # Collect the paths into a dict keyed by the basename
>
> files = defaultdict(list)
> for path in paths:
>      files[basename(path)].append(path)
>
> # Process a list of paths if there's more than one entry
>
> renaming = []
>
> for name, entries in files.items():
>      if len(entries) > 1:
>          # Collect the paths in each subgroup into a dict keyed by dir4
>
>          subgroup = defaultdict(list)
>          for path in entries:
>              subgroup[path.split("/")[5]].append(path)
>
>          for dir4, subentries in subgroup.items():
>              # Sort the subentries by dir5 (date)
>              subentries.sort(key=make_dir5_key)
>
>              if len(subentries) > 1:
>                  for index, path in enumerate(subentries):
>                      renaming.append((path,
> "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
>              else:
>                  path = subentries[0]
>                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> dir4, name)))
>      else:
>          path = entries[0]
>
> for old_path, new_path in renaming:
>      print("Rename {!r} to {!r}".format(old_path, new_path))

Thanks a million MRAB. I really like this solution. It's very
understandable and it works! I had never seen .format before. I had to
add the index of the positional args to them to make it work.

[toc] | [prev] | [next] | [standalone]

#25669

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-07-19 20:07 -0700
Message-ID	<68fe8c14-bb90-4505-8732-801154f508ac@tu6g2000pbc.googlegroups.com>
In reply to	#25665

On Jul 19, 7:01 pm, "Larry.Mart...@gmail.com"
<larry.mart...@gmail.com> wrote:
> On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
>
>
>
>
>
>
>
>
>
> > On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
> > > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > >> > > I am making the assumption that you intend to collapse the directory
> > >> > > tree and store each file in the same directory, otherwise I can't think
> > >> > > of why you need to do this.
>
> > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > >> > is creating a zip file with relative path names, and if there are
> > >> > duplicate files the parts of the path that are not be carried over
> > >> > need to get prepended to the file names to make then unique,
>
> > >> Depending on the file system of the client, you can hit file name
> > >> length limits. I would think it would be better to just create
> > >> the full structure in the zip.
>
> > >> Just something to keep in mind, especially if you see funky behavior.
>
> > > Thanks, but it's not what the client wants.
>
> > Here's another solution, not using itertools:
>
> > from collections import defaultdict
> > from os.path import basename, dirname
> > from time import strftime, strptime
>
> > # Starting with the original paths
>
> > paths = [
> >      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
> >      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> > ]
>
> > def make_dir5_key(path):
> >      date = strptime(path.split("/")[6], "%d%b%y")
> >      return strftime("%y%b%d", date)
>
> > # Collect the paths into a dict keyed by the basename
>
> > files = defaultdict(list)
> > for path in paths:
> >      files[basename(path)].append(path)
>
> > # Process a list of paths if there's more than one entry
>
> > renaming = []
>
> > for name, entries in files.items():
> >      if len(entries) > 1:
> >          # Collect the paths in each subgroup into a dict keyed by dir4
>
> >          subgroup = defaultdict(list)
> >          for path in entries:
> >              subgroup[path.split("/")[5]].append(path)
>
> >          for dir4, subentries in subgroup.items():
> >              # Sort the subentries by dir5 (date)
> >              subentries.sort(key=make_dir5_key)
>
> >              if len(subentries) > 1:
> >                  for index, path in enumerate(subentries):
> >                      renaming.append((path,
> > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
> >              else:
> >                  path = subentries[0]
> >                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> > dir4, name)))
> >      else:
> >          path = entries[0]
>
> > for old_path, new_path in renaming:
> >      print("Rename {!r} to {!r}".format(old_path, new_path))
>
> Thanks a million MRAB. I really like this solution. It's very
> understandable and it works! I had never seen .format before. I had to
> add the index of the positional args to them to make it work.

Also, in make_dir5_key the format specifier for strftime should be %y%m
%d so they sort properly.

[toc] | [prev] | [next] | [standalone]

#25695

From	MRAB <python@mrabarnett.plus.com>
Date	2012-07-20 16:45 +0100
Message-ID	<mailman.2347.1342799113.4697.python-list@python.org>
In reply to	#25669

On 20/07/2012 04:07, Larry.Martell@gmail.com wrote:
[snip]
> Also, in make_dir5_key the format specifier for strftime should be %y%m
> %d so they sort properly.
>
Correct. I realised that only some time later, after I'd turned off my
computer for the night. :-(

[toc] | [prev] | [standalone]

csiph-web

Finding duplicate file names and modifying them based on elements of the path

Contents

#25588 — Finding duplicate file names and modifying them based on elements of the path

#25590

#25642

#25647

#25666

#25675

#25676

#25683

#25646

#25648

#25664

#25594

#25641

#25643

#25644

#25653

#25665

#25669

#25695