Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64544 > unrolled thread
| Started by | Larry Martell <larry.martell@gmail.com> |
|---|---|
| First post | 2014-01-22 17:58 -0700 |
| Last post | 2014-01-23 14:58 +0000 |
| Articles | 11 — 7 participants |
Back to article view | Back to comp.lang.python
Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 17:58 -0700
Re: Case insensitive exists()? Roy Smith <roy@panix.com> - 2014-01-22 20:08 -0500
Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 18:18 -0700
Re: Case insensitive exists()? Roy Smith <roy@panix.com> - 2014-01-22 20:27 -0500
Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 21:24 -0700
Re: Case insensitive exists()? Chris Angelico <rosuav@gmail.com> - 2014-01-23 15:29 +1100
Re: Case insensitive exists()? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-01-23 15:43 +0000
Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-23 12:02 -0700
Re: Case insensitive exists()? Dan Sommers <dan@tombstonezero.net> - 2014-01-23 07:51 +0000
Re: Case insensitive exists()? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-23 12:06 +0000
Re: Case insensitive exists()? Grant Edwards <invalid@invalid.invalid> - 2014-01-23 14:58 +0000
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2014-01-22 17:58 -0700 |
| Subject | Case insensitive exists()? |
| Message-ID | <mailman.5853.1390438708.18130.python-list@python.org> |
I have the need to check for a files existence against a string, but I need to do case-insensitively. I cannot efficiently get the name of every file in the dir and compare each with my string using lower(), as I have 100's of strings to check for, each in a different dir, and each dir can have 100's of files in it. Does anyone know of an efficient way to do this? There's no switch for os.path that makes exists() check case-insensitively is there?
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-01-22 20:08 -0500 |
| Message-ID | <roy-7A7FAE.20082622012014@news.panix.com> |
| In reply to | #64544 |
In article <mailman.5853.1390438708.18130.python-list@python.org>, Larry Martell <larry.martell@gmail.com> wrote: > I have the need to check for a files existence against a string, but I > need to do case-insensitively. I cannot efficiently get the name of > every file in the dir and compare each with my string using lower(), > as I have 100's of strings to check for, each in a different dir, and > each dir can have 100's of files in it. I'm not quite sure what you're asking. Do you need to match the filename, or find the string in the contents of the file? I'm going to assume you're asking the former. One way or another, you need to iterate over all the directories and get all the filenames in each. The time to do that is going to totally swamp any processing you do in terms of converting to lower case and comparing to some set of strings. I would put all my strings into a set, then use os.walk() traverse the directories and for each path os.walk() returns, do "path.lower() in strings".
[toc] | [prev] | [next] | [standalone]
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2014-01-22 18:18 -0700 |
| Message-ID | <mailman.5855.1390439920.18130.python-list@python.org> |
| In reply to | #64545 |
On Wed, Jan 22, 2014 at 6:08 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.5853.1390438708.18130.python-list@python.org>, > Larry Martell <larry.martell@gmail.com> wrote: > >> I have the need to check for a files existence against a string, but I >> need to do case-insensitively. I cannot efficiently get the name of >> every file in the dir and compare each with my string using lower(), >> as I have 100's of strings to check for, each in a different dir, and >> each dir can have 100's of files in it. > > I'm not quite sure what you're asking. Do you need to match the > filename, or find the string in the contents of the file? I'm going to > assume you're asking the former. Yes, match the file names. e.g. if my match string is "ABC" and there's a file named "Abc" then it would be match. > One way or another, you need to iterate over all the directories and get > all the filenames in each. The time to do that is going to totally > swamp any processing you do in terms of converting to lower case and > comparing to some set of strings. > > I would put all my strings into a set, then use os.walk() traverse the > directories and for each path os.walk() returns, do "path.lower() in > strings". The issue is that I run a database query and get back rows, each with a file path (each in a different dir). And I have to check to see if that file exists. Each is a separate search with no correlation to the others. I have the full path, so I guess I'll have to do dir name on it, then a listdir then compare each item with .lower with my string .lower. It's just that the dirs have 100's and 100's of files so I'm really worried about efficiency.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-01-22 20:27 -0500 |
| Message-ID | <roy-1771A1.20270322012014@news.panix.com> |
| In reply to | #64547 |
In article <mailman.5855.1390439920.18130.python-list@python.org>, Larry Martell <larry.martell@gmail.com> wrote: > The issue is that I run a database query and get back rows, each with > a file path (each in a different dir). And I have to check to see if > that file exists. Each is a separate search with no correlation to the > others. I have the full path, so I guess I'll have to do dir name on > it, then a listdir then compare each item with .lower with my string > .lower. It's just that the dirs have 100's and 100's of files so I'm > really worried about efficiency. Oh, my, this is a much more complicated problem than you originally described. Is the whole path case-insensitive, or just the last component? In other words, if the search string is "/foo/bar/my_file_name", do all of these paths match? /FOO/BAR/MY_FILE_NAME /foo/bar/my_file_name /FoO/bAr/My_FiLe_NaMe Can you give some more background as to *why* you're doing this? Usually, if a system considers filenames to be case-insensitive, that's something that's handled by the operating system itself.
[toc] | [prev] | [next] | [standalone]
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2014-01-22 21:24 -0700 |
| Message-ID | <mailman.5867.1390451096.18130.python-list@python.org> |
| In reply to | #64548 |
On Wed, Jan 22, 2014 at 6:27 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.5855.1390439920.18130.python-list@python.org>, > Larry Martell <larry.martell@gmail.com> wrote: > >> The issue is that I run a database query and get back rows, each with >> a file path (each in a different dir). And I have to check to see if >> that file exists. Each is a separate search with no correlation to the >> others. I have the full path, so I guess I'll have to do dir name on >> it, then a listdir then compare each item with .lower with my string >> .lower. It's just that the dirs have 100's and 100's of files so I'm >> really worried about efficiency. > > Oh, my, this is a much more complicated problem than you originally > described. I try not to bother folks with simple problems ;-) > Is the whole path case-insensitive, or just the last component? In > other words, if the search string is "/foo/bar/my_file_name", do all of > these paths match? > > /FOO/BAR/MY_FILE_NAME > /foo/bar/my_file_name > /FoO/bAr/My_FiLe_NaMe Just the file name (the basename). > Can you give some more background as to *why* you're doing this? > Usually, if a system considers filenames to be case-insensitive, that's > something that's handled by the operating system itself. I can't say why it's happening. This is a big complicated system with lots of parts. There's some program that ftp's image files from an electron microscope and stores them on the file system with crazy names like: 2O_TOPO_1_2O_2UM_FOV_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg And something (perhaps the same program, perhaps a different one) records this is a database. In some cases the name recorded in the db has different cases in some characters then how it was stored in the db, e.g.: 2O_TOPO_1_2O_2UM_Fov_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg These only differ in "FOV" vs. "Fov" but that is just one example. I am writing something that is part of a django app, that based on some web entry from the user, I run a query, get back a list of files and have to go receive them and serve them up back to the browser. My script is all done and seem to be working, then today I was informed it was not serving up all the images. Debugging revealed that it was this case issue - I was matching with exists(). As I've said, coding a solution is easy, but I fear it will be too slow. Speed is important in web apps - users have high expectations. Guess I'll just have to try it and see.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-23 15:29 +1100 |
| Message-ID | <mailman.5869.1390451367.18130.python-list@python.org> |
| In reply to | #64548 |
On Thu, Jan 23, 2014 at 3:24 PM, Larry Martell <larry.martell@gmail.com> wrote: > I am writing something that is part of a django app, that based on > some web entry from the user, I run a query, get back a list of files > and have to go receive them and serve them up back to the browser. My > script is all done and seem to be working, then today I was informed > it was not serving up all the images. Debugging revealed that it was > this case issue - I was matching with exists(). As I've said, coding a > solution is easy, but I fear it will be too slow. Speed is important > in web apps - users have high expectations. Guess I'll just have to > try it and see. Would it be a problem to rename all the files? Then you could simply lower() the input name and it'll be correct. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2014-01-23 15:43 +0000 |
| Message-ID | <mailman.5897.1390491833.18130.python-list@python.org> |
| In reply to | #64548 |
On Wed, Jan 22, 2014 at 09:24:54PM -0700, Larry Martell wrote:
>
> I am writing something that is part of a django app, that based on
> some web entry from the user, I run a query, get back a list of files
> and have to go receive them and serve them up back to the browser. My
> script is all done and seem to be working, then today I was informed
> it was not serving up all the images. Debugging revealed that it was
> this case issue - I was matching with exists(). As I've said, coding a
> solution is easy, but I fear it will be too slow. Speed is important
> in web apps - users have high expectations. Guess I'll just have to
> try it and see.
How long does it actually take to serve a http request? I would expect it to
be orders of magnitudes slower than calling os.listdir on a directory
containing hundreds of files.
Here on my Linux system there are 2000+ files in /usr/bin. Calling os.listdir
takes 1.5 milliseconds (warm cache):
$ python -m timeit -s 'import os' 'os.listdir("/usr/bin")'
1000 loops, best of 3: 1.42 msec per loop
Converting those to upper case takes a further .5 milliseconds:
$ python -m timeit -s 'import os' 'map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 1.98 msec per loop
Checking a string against that list takes .05 milliseconds:
$ python -m timeit -s 'import os' \
'"WHICH" in map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 2.03 msec per loop
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2014-01-23 12:02 -0700 |
| Message-ID | <mailman.5904.1390503746.18130.python-list@python.org> |
| In reply to | #64548 |
On Wed, Jan 22, 2014 at 9:29 PM, Chris Angelico <rosuav@gmail.com> wrote: > On Thu, Jan 23, 2014 at 3:24 PM, Larry Martell <larry.martell@gmail.com> wrote: >> I am writing something that is part of a django app, that based on >> some web entry from the user, I run a query, get back a list of files >> and have to go receive them and serve them up back to the browser. My >> script is all done and seem to be working, then today I was informed >> it was not serving up all the images. Debugging revealed that it was >> this case issue - I was matching with exists(). As I've said, coding a >> solution is easy, but I fear it will be too slow. Speed is important >> in web apps - users have high expectations. Guess I'll just have to >> try it and see. > > Would it be a problem to rename all the files? Then you could simply > lower() the input name and it'll be correct. So it turned out that in the django model definition for this object there was code that was doing some character mapping that was causing this. That code was added to 'fix' another problem, but the mapping strings were not qualified enough and it was doing some unintended mapping. Changing those strings to be more specific fixed my problem. Thanks to all for the replies.
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2014-01-23 07:51 +0000 |
| Message-ID | <lbqhm2$7sf$1@dont-email.me> |
| In reply to | #64547 |
On Wed, 22 Jan 2014 18:18:32 -0700, Larry Martell wrote: > The issue is that I run a database query and get back rows, each with > a file path (each in a different dir). And I have to check to see if > that file exists. Each is a separate search with no correlation to the > others. I have the full path, so I guess I'll have to do dir name on > it, then a listdir then compare each item with .lower with my string > .lower. It's just that the dirs have 100's and 100's of files so I'm > really worried about efficiency. Okay, so it's not Python, and I have the benefit of having read all of the other answers, but what about calling "locate" [0] with the "-i" flag? or writing some sort of Python/ctypes wrapper around the part of locate that searches the database? Dan [0] http://savannah.gnu.org/projects/findutils/
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-23 12:06 +0000 |
| Message-ID | <52e105d6$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64544 |
On Wed, 22 Jan 2014 17:58:19 -0700, Larry Martell wrote: > I have the need to check for a files existence against a string, but I > need to do case-insensitively. Reading on, I see that your database assumes case-insensitive file names, while your file system is case-sensitive. Suggestions: (1) Move the files onto a case-insensitive file system. Samba, I believe, can duplicate the case-insensitive behaviour of NTFS even on ext3 or ext4 file systems. (To be pedantic, NTFS can also optionally be case- sensitive, although that it rarely used.) So if you stick the files on a samba file share set to case-insensitivity, samba will behave the way you want. (Although os.path.exists won't, you'll have to use nt.path.exists instead.) (2) Normalize the database and the files. Do a one-off run through the files on disk, lowercasing the file names, followed by a one-off run through the database, doing the same. (Watch out for ambiguous names like "Foo" and "FOO".) Then you just need to ensure new files are always named in lowercase. Also, keep in mind that just because os.path.exists reports a file exists *right now*, doesn't mean it will still exist a millisecond later when you go to use it. Consider avoiding os.path.exists altogether, and just trying to open the file. (Although I see you still have the problem that you don't know *which* directory the file will be found in. > I cannot efficiently get the name of > every file in the dir and compare each with my string using lower(), as > I have 100's of strings to check for, each in a different dir, and each > dir can have 100's of files in it. Does anyone know of an efficient way > to do this? There's no switch for os.path that makes exists() check > case-insensitively is there? Try nt.path.exists, although I'm not certain it will do what you want since it probably assumes the file system is case-insensitive. It really sounds like you have a hard problem to solve here. I strongly recommend that you change the problem, by renaming the files, or at least moving them into a consistent location, rather than have to repeatedly search multiple directories. Good luck! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2014-01-23 14:58 +0000 |
| Message-ID | <lbramg$6qv$1@reader1.panix.com> |
| In reply to | #64544 |
On 2014-01-23, Larry Martell <larry.martell@gmail.com> wrote:
> I have the need to check for a files existence against a string, but I
> need to do case-insensitively. I cannot efficiently get the name of
> every file in the dir and compare each with my string using lower(),
> as I have 100's of strings to check for, each in a different dir, and
> each dir can have 100's of files in it. Does anyone know of an
> efficient way to do this? There's no switch for os.path that makes
> exists() check case-insensitively is there?
If you're on Unix, you could use os.popen() to run a find command
using -iname.
--
Grant Edwards grant.b.edwards Yow! I'm DESPONDENT ... I
at hope there's something
gmail.com DEEP-FRIED under this
miniature DOMED STADIUM ...
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web