Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #25588 > unrolled thread
| Started by | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| First post | 2012-07-18 15:20 -0700 |
| Last post | 2012-07-20 16:45 +0100 |
| Articles | 19 — 7 participants |
Back to article view | Back to comp.lang.python
Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-18 15:20 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-18 15:49 -0700
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:00 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:43 -0700
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Peter Otten <__peter__@web.de> - 2012-07-20 09:35 +0200
Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-20 00:51 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Paul Rudin <paul.nospam@rudin.co.uk> - 2012-07-20 09:37 +0100
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:52 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Paul Rubin <no.email@nospam.invalid> - 2012-07-19 12:56 -0700
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 17:58 -0700
Re: Finding duplicate file names and modifying them based on elements of the path Simon Cropper <simoncropper@fossworkflowguides.com> - 2012-07-19 10:36 +1000
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 11:54 -0700
RE: Finding duplicate file names and modifying them based on elements of the path "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-07-19 19:02 +0000
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 12:06 -0700
Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-19 22:32 +0100
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 18:01 -0700
Re: Finding duplicate file names and modifying them based on elements of the path "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-07-19 20:07 -0700
Re: Finding duplicate file names and modifying them based on elements of the path MRAB <python@mrabarnett.plus.com> - 2012-07-20 16:45 +0100
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-18 15:20 -0700 |
| Subject | Finding duplicate file names and modifying them based on elements of the path |
| Message-ID | <b2f1993c-8872-44ed-9e69-0895e4059532@mi5g2000pbc.googlegroups.com> |
I have an interesting problem I'm trying to solve. I have a solution almost working, but it's super ugly, and know there has to be a better, cleaner way to do it. I have a list of path names that have this form: /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file I need to find all the file names (basenames) in the list that are duplicates, and for each one that is a dup, prepend dir4 to the filename as long as the dir4/file pair is unique. If there are multiple dir4/files in the list, then I also need to add a sequence number based on the sorted value of dir5 (which is a date in ddMONyy format). For example, if my list contains: /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1 /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3 Then I want to end up with: /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1 /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3 My solution involves multiple maps and multiple iterations through the data. How would you folks do this?
[toc] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-18 15:49 -0700 |
| Message-ID | <7xipdkwuqd.fsf@ruckus.brouhaha.com> |
| In reply to | #25588 |
"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes: > I have an interesting problem I'm trying to solve. I have a solution > almost working, but it's super ugly, and know there has to be a > better, cleaner way to do it. ... > > My solution involves multiple maps and multiple iterations through the > data. How would you folks do this? You could post your code and ask for suggestions how to improve it. There are a lot of not-so-natural constraints in that problem, so it stands to reason that the code will be a bit messy. The whole specification seems like an antipattern though. You should just give a sensible encoding for the filename regardless of whether other fields are duplicated or not. You also don't seem to address the case where basename, dir4, and dir5 are all duplicated. The approach I'd take for the spec as you wrote it is: 1. Sort the list on the (basename, dir4, dir5) triple, saving original location (numeric index) of each item 2. Use itertools.groupby to group together duplicate basenames. 3. Within the groups, use groupby again to gather duplicate dir4's, 4. Within -those- groups, group by dir5 and assign sequence numbers in groups where there's more than one file 5. Unsort to get the rewritten items back into the original order. Actual code is left as an exercise.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 12:00 -0700 |
| Message-ID | <14831ee0-fd74-4906-852c-764ba2d8b1d5@h20g2000yqe.googlegroups.com> |
| In reply to | #25590 |
On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes: > > I have an interesting problem I'm trying to solve. I have a solution > > almost working, but it's super ugly, and know there has to be a > > better, cleaner way to do it. ... > > > My solution involves multiple maps and multiple iterations through the > > data. How would you folks do this? > > You could post your code and ask for suggestions how to improve it. > There are a lot of not-so-natural constraints in that problem, so it > stands to reason that the code will be a bit messy. The whole > specification seems like an antipattern though. You should just give a > sensible encoding for the filename regardless of whether other fields > are duplicated or not. You also don't seem to address the case where > basename, dir4, and dir5 are all duplicated. > > The approach I'd take for the spec as you wrote it is: > > 1. Sort the list on the (basename, dir4, dir5) triple, saving original > location (numeric index) of each item > 2. Use itertools.groupby to group together duplicate basenames. > 3. Within the groups, use groupby again to gather duplicate dir4's, > 4. Within -those- groups, group by dir5 and assign sequence numbers in > groups where there's more than one file > 5. Unsort to get the rewritten items back into the original order. > > Actual code is left as an exercise. I replied to this before, but I don't see, so if this is a duplicate, sorry. Thanks for the reply Paul. I had not heard of itertools. It sounds like just what I need for this. But I am having 1 issue - how do you know how many items are in each group? Without knowing that I have to either make 2 passes through the data, or else work on the previous item (when I'm in an iteration after the first then I know I have dups). But that very quickly gets crazy with trying to keep the previous values.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-19 12:43 -0700 |
| Message-ID | <7xipdjilko.fsf@ruckus.brouhaha.com> |
| In reply to | #25642 |
"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes:
> Thanks for the reply Paul. I had not heard of itertools. It sounds
> like just what I need for this. But I am having 1 issue - how do you
> know how many items are in each group?
Simplest is:
for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
gs = list(group) # convert iterator to a list
n = len(gs) # this is the number of elements
there is some theoretical inelegance in that it requires each group to
fit in memory, but you weren't really going to have billions of files
with the same basename.
If you're not used to iterators and itertools, note there are some
subtleties to using groupby to iterate over files, because an iterator
actually has state. It bumps a pointer and maybe consumes some input
every time you advance it. In a situation like the above, you've got
some nexted iterators (the groupby iterator generating groups, and the
individual group iterators that come out of the groupby) that wrap the
same file handle, so bad confusion can result if you advance both
iterators without being careful (one can consume file input that you
thought would go to another).
This isn't as bad as it sounds once you get used to it, but it can be
a source of frustration at first.
BTW, if you just want to count the elements of an iterator (while
consuming it),
n = sum(1 for x in xs)
counts the elements of xs without having to expand it into an in-memory
list.
Itertools really makes Python feel a lot more expressive and clean,
despite little kinks like the above.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 18:01 -0700 |
| Message-ID | <2862aea5-9d5c-4979-8ca2-0bb01f9db32c@m3g2000vbl.googlegroups.com> |
| In reply to | #25647 |
On Jul 19, 1:43 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes: > > Thanks for the reply Paul. I had not heard of itertools. It sounds > > like just what I need for this. But I am having 1 issue - how do you > > know how many items are in each group? > > Simplest is: > > for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])): > gs = list(group) # convert iterator to a list > n = len(gs) # this is the number of elements > > there is some theoretical inelegance in that it requires each group to > fit in memory, but you weren't really going to have billions of files > with the same basename. > > If you're not used to iterators and itertools, note there are some > subtleties to using groupby to iterate over files, because an iterator > actually has state. It bumps a pointer and maybe consumes some input > every time you advance it. In a situation like the above, you've got > some nexted iterators (the groupby iterator generating groups, and the > individual group iterators that come out of the groupby) that wrap the > same file handle, so bad confusion can result if you advance both > iterators without being careful (one can consume file input that you > thought would go to another). It seems that if you do a list(group) you have consumed the list. This screwed me up for a while, and seems very counter-intuitive. > This isn't as bad as it sounds once you get used to it, but it can be > a source of frustration at first. > > BTW, if you just want to count the elements of an iterator (while > consuming it), > > n = sum(1 for x in xs) > > counts the elements of xs without having to expand it into an in-memory > list. > > Itertools really makes Python feel a lot more expressive and clean, > despite little kinks like the above.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-07-20 09:35 +0200 |
| Message-ID | <mailman.2336.1342769705.4697.python-list@python.org> |
| In reply to | #25666 |
Larry.Martell@gmail.com wrote: > It seems that if you do a list(group) you have consumed the list. This > screwed me up for a while, and seems very counter-intuitive. Many itertools functions work that way. It allows you to iterate over the items even if there is more data than fits into memory. If you need to keep all items and are sure that your computer can cope with them at once you can always throw in a group = list(group)
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-20 00:51 -0700 |
| Message-ID | <7xk3xyswee.fsf@ruckus.brouhaha.com> |
| In reply to | #25666 |
"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes: > It seems that if you do a list(group) you have consumed the list. This > screwed me up for a while, and seems very counter-intuitive. Yes, that is correct, you have to carefully watch where the stuff in the iterators is getting consumed, including when there are nested iterators. That's what I was mentioning earlier--it got me confused at first, but I use that style all the time now and it is pretty natural.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rudin <paul.nospam@rudin.co.uk> |
|---|---|
| Date | 2012-07-20 09:37 +0100 |
| Message-ID | <87r4s696ax.fsf@no-fixed-abode.cable.virginmedia.net> |
| In reply to | #25666 |
"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes: > It seems that if you do a list(group) you have consumed the list. This > screwed me up for a while, and seems very counter-intuitive. You've consumed the *group* which is an iterator, in order to construct a list from its elements. Sorry if this is excessively nit-picking, but it generally helps to keep these things very clear in your own mind.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 11:52 -0700 |
| Message-ID | <8d08181d-316d-4fdc-8f03-1d9b90929d89@e7g2000yqh.googlegroups.com> |
| In reply to | #25590 |
On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy. The whole
> specification seems like an antipattern though. You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not. You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
> location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
> groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.
Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.
reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):
for file, file_group in groupby(reports_list, lambda x: x[0]):
# if file is unique in file_group do nothing, but how can I tell
if file is unique?
for tool, tool_group in groupby(file_group, lambda x: x[1]):
# if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):
You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?
Thanks!
-larry
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-19 12:56 -0700 |
| Message-ID | <7xipdjbk3z.fsf@ruckus.brouhaha.com> |
| In reply to | #25646 |
"Larry.Martell@gmail.com" <larry.martell@gmail.com> writes: > You can't do a len on the iterator that is returned from groupby, and > I've tried to do something with imap or defaultdict, but I'm not > getting anywhere. I guess I can just make 2 passes through the data, > the first time getting counts. Or am I missing something about how > groupby works? I posted another reply to your other message, which reached me earlier. If you're still stuck, post again, though I probably won't be able to reply til tomorrow or the next day.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 17:58 -0700 |
| Message-ID | <51ef259b-f7fd-49ae-975e-cb09d65fdd5d@cu1g2000vbb.googlegroups.com> |
| In reply to | #25648 |
On Jul 19, 1:56 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes: > > You can't do a len on the iterator that is returned from groupby, and > > I've tried to do something with imap or defaultdict, but I'm not > > getting anywhere. I guess I can just make 2 passes through the data, > > the first time getting counts. Or am I missing something about how > > groupby works? > > I posted another reply to your other message, which reached me earlier. > If you're still stuck, post again, though I probably won't be able to > reply til tomorrow or the next day. I really appreciate the offer, but I'm going to go with MRAB's solution. It works, and I understand it ;-)
[toc] | [prev] | [next] | [standalone]
| From | Simon Cropper <simoncropper@fossworkflowguides.com> |
|---|---|
| Date | 2012-07-19 10:36 +1000 |
| Message-ID | <mailman.2281.1342658195.4697.python-list@python.org> |
| In reply to | #25588 |
On 19/07/12 08:20, Larry.Martell@gmail.com wrote:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a
> better, cleaner way to do it.
>
> I have a list of path names that have this form:
>
> /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> I need to find all the file names (basenames) in the list that are
> duplicates, and for each one that is a dup, prepend dir4 to the
> filename as long as the dir4/file pair is unique. If there are
> multiple dir4/files in the list, then I also need to add a sequence
> number based on the sorted value of dir5 (which is a date in ddMONyy
> format).
>
> For example, if my list contains:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> Then I want to end up with:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?
>
Hi Larry,
I am making the assumption that you intend to collapse the directory
tree and store each file in the same directory, otherwise I can't think
of why you need to do this.
If this is the case, then I would...
1. import all the files into an array
2. parse path to extract forth level directory name and base name.
3. reiterate through the array
3.1 check if base filename exists in recipient directory
3.2 if not, copy to recipient directory
3.3 if present, append the directory path then save
3.4 create log of success or failure
Personally, I would not have some files with abcd_file1 and others as
file2 because if it is important enough to store a file in a separate
directory you should also note where file2 came from as well. When
looking at your results at a later date you are going to have to open
file2 (which I presume must record where it relates to) to figure out
where it came from. If it is in the name it is easier to review.
In short, consistency is the name of the game; if you are going to do it
for some then do it for all; and finally it will be easier for others
later to work out what you have done.
--
Cheers Simon
Simon Cropper - Open Content Creator
Free and Open Source Software Workflow Guides
------------------------------------------------------------
Introduction http://www.fossworkflowguides.com
GIS Packages http://www.fossworkflowguides.com/gis
bash / Python http://www.fossworkflowguides.com/scripting
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 11:54 -0700 |
| Message-ID | <aad5b32d-b0b4-47c1-b134-4827ec0472b2@t20g2000yqn.googlegroups.com> |
| In reply to | #25594 |
On Jul 18, 6:36 pm, Simon Cropper <simoncrop...@fossworkflowguides.com> wrote: > On 19/07/12 08:20, Larry.Mart...@gmail.com wrote: > > > > > > > > > > > I have an interesting problem I'm trying to solve. I have a solution > > almost working, but it's super ugly, and know there has to be a > > better, cleaner way to do it. > > > I have a list of path names that have this form: > > > /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file > > > I need to find all the file names (basenames) in the list that are > > duplicates, and for each one that is a dup, prepend dir4 to the > > filename as long as the dir4/file pair is unique. If there are > > multiple dir4/files in the list, then I also need to add a sequence > > number based on the sorted value of dir5 (which is a date in ddMONyy > > format). > > > For example, if my list contains: > > > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 > > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1 > > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3 > > > Then I want to end up with: > > > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 > > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1 > > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3 > > > My solution involves multiple maps and multiple iterations through the > > data. How would you folks do this? > > Hi Larry, > > I am making the assumption that you intend to collapse the directory > tree and store each file in the same directory, otherwise I can't think > of why you need to do this. Hi Simon, thanks for the reply. It's not quite this - what I am doing is creating a zip file with relative path names, and if there are duplicate files the parts of the path that are not be carried over need to get prepended to the file names to make then unique, > > If this is the case, then I would... > > 1. import all the files into an array > 2. parse path to extract forth level directory name and base name. > 3. reiterate through the array > 3.1 check if base filename exists in recipient directory > 3.2 if not, copy to recipient directory > 3.3 if present, append the directory path then save > 3.4 create log of success or failure > > Personally, I would not have some files with abcd_file1 and others as > file2 because if it is important enough to store a file in a separate > directory you should also note where file2 came from as well. When > looking at your results at a later date you are going to have to open > file2 (which I presume must record where it relates to) to figure out > where it came from. If it is in the name it is easier to review. > > In short, consistency is the name of the game; if you are going to do it > for some then do it for all; and finally it will be easier for others > later to work out what you have done. Yeah, I know, but this is for a client, and this is what they want.
[toc] | [prev] | [next] | [standalone]
| From | "Prasad, Ramit" <ramit.prasad@jpmorgan.com> |
|---|---|
| Date | 2012-07-19 19:02 +0000 |
| Message-ID | <mailman.2313.1342724574.4697.python-list@python.org> |
| In reply to | #25641 |
> > I am making the assumption that you intend to collapse the directory > > tree and store each file in the same directory, otherwise I can't think > > of why you need to do this. > > Hi Simon, thanks for the reply. It's not quite this - what I am doing > is creating a zip file with relative path names, and if there are > duplicate files the parts of the path that are not be carried over > need to get prepended to the file names to make then unique, Depending on the file system of the client, you can hit file name length limits. I would think it would be better to just create the full structure in the zip. Just something to keep in mind, especially if you see funky behavior. Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 12:06 -0700 |
| Message-ID | <3a201907-c6dd-4c1f-b921-6f508d0af6e8@r3g2000yqh.googlegroups.com> |
| In reply to | #25643 |
On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote: > > > I am making the assumption that you intend to collapse the directory > > > tree and store each file in the same directory, otherwise I can't think > > > of why you need to do this. > > > Hi Simon, thanks for the reply. It's not quite this - what I am doing > > is creating a zip file with relative path names, and if there are > > duplicate files the parts of the path that are not be carried over > > need to get prepended to the file names to make then unique, > > Depending on the file system of the client, you can hit file name > length limits. I would think it would be better to just create > the full structure in the zip. > > Just something to keep in mind, especially if you see funky behavior. Thanks, but it's not what the client wants.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-07-19 22:32 +0100 |
| Message-ID | <mailman.2319.1342733568.4697.python-list@python.org> |
| In reply to | #25644 |
On 19/07/2012 20:06, Larry.Martell@gmail.com wrote:
> On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
>> > > I am making the assumption that you intend to collapse the directory
>> > > tree and store each file in the same directory, otherwise I can't think
>> > > of why you need to do this.
>>
>> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
>> > is creating a zip file with relative path names, and if there are
>> > duplicate files the parts of the path that are not be carried over
>> > need to get prepended to the file names to make then unique,
>>
>> Depending on the file system of the client, you can hit file name
>> length limits. I would think it would be better to just create
>> the full structure in the zip.
>>
>> Just something to keep in mind, especially if you see funky behavior.
>
> Thanks, but it's not what the client wants.
>
Here's another solution, not using itertools:
from collections import defaultdict
from os.path import basename, dirname
from time import strftime, strptime
# Starting with the original paths
paths = [
"/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
"/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
]
def make_dir5_key(path):
date = strptime(path.split("/")[6], "%d%b%y")
return strftime("%y%b%d", date)
# Collect the paths into a dict keyed by the basename
files = defaultdict(list)
for path in paths:
files[basename(path)].append(path)
# Process a list of paths if there's more than one entry
renaming = []
for name, entries in files.items():
if len(entries) > 1:
# Collect the paths in each subgroup into a dict keyed by dir4
subgroup = defaultdict(list)
for path in entries:
subgroup[path.split("/")[5]].append(path)
for dir4, subentries in subgroup.items():
# Sort the subentries by dir5 (date)
subentries.sort(key=make_dir5_key)
if len(subentries) > 1:
for index, path in enumerate(subentries):
renaming.append((path,
"{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
else:
path = subentries[0]
renaming.append((path, "{}/{}_{}".format(dirname(path),
dir4, name)))
else:
path = entries[0]
for old_path, new_path in renaming:
print("Rename {!r} to {!r}".format(old_path, new_path))
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 18:01 -0700 |
| Message-ID | <b63eea74-291b-4f9a-80dc-15130ebe5ce5@x21g2000vbc.googlegroups.com> |
| In reply to | #25653 |
On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
>
>
>
>
>
>
> > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> >> > > I am making the assumption that you intend to collapse the directory
> >> > > tree and store each file in the same directory, otherwise I can't think
> >> > > of why you need to do this.
>
> >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> >> > is creating a zip file with relative path names, and if there are
> >> > duplicate files the parts of the path that are not be carried over
> >> > need to get prepended to the file names to make then unique,
>
> >> Depending on the file system of the client, you can hit file name
> >> length limits. I would think it would be better to just create
> >> the full structure in the zip.
>
> >> Just something to keep in mind, especially if you see funky behavior.
>
> > Thanks, but it's not what the client wants.
>
> Here's another solution, not using itertools:
>
> from collections import defaultdict
> from os.path import basename, dirname
> from time import strftime, strptime
>
> # Starting with the original paths
>
> paths = [
> "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
> "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
> "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
> "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
> "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> ]
>
> def make_dir5_key(path):
> date = strptime(path.split("/")[6], "%d%b%y")
> return strftime("%y%b%d", date)
>
> # Collect the paths into a dict keyed by the basename
>
> files = defaultdict(list)
> for path in paths:
> files[basename(path)].append(path)
>
> # Process a list of paths if there's more than one entry
>
> renaming = []
>
> for name, entries in files.items():
> if len(entries) > 1:
> # Collect the paths in each subgroup into a dict keyed by dir4
>
> subgroup = defaultdict(list)
> for path in entries:
> subgroup[path.split("/")[5]].append(path)
>
> for dir4, subentries in subgroup.items():
> # Sort the subentries by dir5 (date)
> subentries.sort(key=make_dir5_key)
>
> if len(subentries) > 1:
> for index, path in enumerate(subentries):
> renaming.append((path,
> "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
> else:
> path = subentries[0]
> renaming.append((path, "{}/{}_{}".format(dirname(path),
> dir4, name)))
> else:
> path = entries[0]
>
> for old_path, new_path in renaming:
> print("Rename {!r} to {!r}".format(old_path, new_path))
Thanks a million MRAB. I really like this solution. It's very
understandable and it works! I had never seen .format before. I had to
add the index of the positional args to them to make it work.
[toc] | [prev] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-07-19 20:07 -0700 |
| Message-ID | <68fe8c14-bb90-4505-8732-801154f508ac@tu6g2000pbc.googlegroups.com> |
| In reply to | #25665 |
On Jul 19, 7:01 pm, "Larry.Mart...@gmail.com"
<larry.mart...@gmail.com> wrote:
> On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
>
>
>
>
>
>
>
>
>
> > On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
> > > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > >> > > I am making the assumption that you intend to collapse the directory
> > >> > > tree and store each file in the same directory, otherwise I can't think
> > >> > > of why you need to do this.
>
> > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > >> > is creating a zip file with relative path names, and if there are
> > >> > duplicate files the parts of the path that are not be carried over
> > >> > need to get prepended to the file names to make then unique,
>
> > >> Depending on the file system of the client, you can hit file name
> > >> length limits. I would think it would be better to just create
> > >> the full structure in the zip.
>
> > >> Just something to keep in mind, especially if you see funky behavior.
>
> > > Thanks, but it's not what the client wants.
>
> > Here's another solution, not using itertools:
>
> > from collections import defaultdict
> > from os.path import basename, dirname
> > from time import strftime, strptime
>
> > # Starting with the original paths
>
> > paths = [
> > "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
> > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
> > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
> > "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
> > "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> > ]
>
> > def make_dir5_key(path):
> > date = strptime(path.split("/")[6], "%d%b%y")
> > return strftime("%y%b%d", date)
>
> > # Collect the paths into a dict keyed by the basename
>
> > files = defaultdict(list)
> > for path in paths:
> > files[basename(path)].append(path)
>
> > # Process a list of paths if there's more than one entry
>
> > renaming = []
>
> > for name, entries in files.items():
> > if len(entries) > 1:
> > # Collect the paths in each subgroup into a dict keyed by dir4
>
> > subgroup = defaultdict(list)
> > for path in entries:
> > subgroup[path.split("/")[5]].append(path)
>
> > for dir4, subentries in subgroup.items():
> > # Sort the subentries by dir5 (date)
> > subentries.sort(key=make_dir5_key)
>
> > if len(subentries) > 1:
> > for index, path in enumerate(subentries):
> > renaming.append((path,
> > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
> > else:
> > path = subentries[0]
> > renaming.append((path, "{}/{}_{}".format(dirname(path),
> > dir4, name)))
> > else:
> > path = entries[0]
>
> > for old_path, new_path in renaming:
> > print("Rename {!r} to {!r}".format(old_path, new_path))
>
> Thanks a million MRAB. I really like this solution. It's very
> understandable and it works! I had never seen .format before. I had to
> add the index of the positional args to them to make it work.
Also, in make_dir5_key the format specifier for strftime should be %y%m
%d so they sort properly.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-07-20 16:45 +0100 |
| Message-ID | <mailman.2347.1342799113.4697.python-list@python.org> |
| In reply to | #25669 |
On 20/07/2012 04:07, Larry.Martell@gmail.com wrote: [snip] > Also, in make_dir5_key the format specifier for strftime should be %y%m > %d so they sort properly. > Correct. I realised that only some time later, after I'd turned off my computer for the night. :-(
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web