Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64544 > unrolled thread

Case insensitive exists()?

Started byLarry Martell <larry.martell@gmail.com>
First post2014-01-22 17:58 -0700
Last post2014-01-23 14:58 +0000
Articles 11 — 7 participants

Back to article view | Back to comp.lang.python


Contents

  Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 17:58 -0700
    Re: Case insensitive exists()? Roy Smith <roy@panix.com> - 2014-01-22 20:08 -0500
      Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 18:18 -0700
        Re: Case insensitive exists()? Roy Smith <roy@panix.com> - 2014-01-22 20:27 -0500
          Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-22 21:24 -0700
          Re: Case insensitive exists()? Chris Angelico <rosuav@gmail.com> - 2014-01-23 15:29 +1100
          Re: Case insensitive exists()? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-01-23 15:43 +0000
          Re: Case insensitive exists()? Larry Martell <larry.martell@gmail.com> - 2014-01-23 12:02 -0700
        Re: Case insensitive exists()? Dan Sommers <dan@tombstonezero.net> - 2014-01-23 07:51 +0000
    Re: Case insensitive exists()? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-23 12:06 +0000
    Re: Case insensitive exists()? Grant Edwards <invalid@invalid.invalid> - 2014-01-23 14:58 +0000

#64544 — Case insensitive exists()?

FromLarry Martell <larry.martell@gmail.com>
Date2014-01-22 17:58 -0700
SubjectCase insensitive exists()?
Message-ID<mailman.5853.1390438708.18130.python-list@python.org>
I have the need to check for a files existence against a string, but I
need to do case-insensitively. I cannot efficiently get the name of
every file in the dir and compare each with my string using lower(),
as I have 100's of strings to check for, each in a different dir, and
each dir can have 100's of files in it. Does anyone know of an
efficient way to do this? There's no switch for os.path that makes
exists() check case-insensitively is there?

[toc] | [next] | [standalone]


#64545

FromRoy Smith <roy@panix.com>
Date2014-01-22 20:08 -0500
Message-ID<roy-7A7FAE.20082622012014@news.panix.com>
In reply to#64544
In article <mailman.5853.1390438708.18130.python-list@python.org>,
 Larry Martell <larry.martell@gmail.com> wrote:

> I have the need to check for a files existence against a string, but I
> need to do case-insensitively. I cannot efficiently get the name of
> every file in the dir and compare each with my string using lower(),
> as I have 100's of strings to check for, each in a different dir, and
> each dir can have 100's of files in it.

I'm not quite sure what you're asking.  Do you need to match the 
filename, or find the string in the contents of the file?  I'm going to 
assume you're asking the former.

One way or another, you need to iterate over all the directories and get 
all the filenames in each.  The time to do that is going to totally 
swamp any processing you do in terms of converting to lower case and 
comparing to some set of strings.

I would put all my strings into a set, then use os.walk() traverse the 
directories and for each path os.walk() returns, do "path.lower() in 
strings".

[toc] | [prev] | [next] | [standalone]


#64547

FromLarry Martell <larry.martell@gmail.com>
Date2014-01-22 18:18 -0700
Message-ID<mailman.5855.1390439920.18130.python-list@python.org>
In reply to#64545
On Wed, Jan 22, 2014 at 6:08 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.5853.1390438708.18130.python-list@python.org>,
>  Larry Martell <larry.martell@gmail.com> wrote:
>
>> I have the need to check for a files existence against a string, but I
>> need to do case-insensitively. I cannot efficiently get the name of
>> every file in the dir and compare each with my string using lower(),
>> as I have 100's of strings to check for, each in a different dir, and
>> each dir can have 100's of files in it.
>
> I'm not quite sure what you're asking.  Do you need to match the
> filename, or find the string in the contents of the file?  I'm going to
> assume you're asking the former.

Yes, match the file names. e.g. if my match string is "ABC" and
there's a file named "Abc" then it would be match.

> One way or another, you need to iterate over all the directories and get
> all the filenames in each.  The time to do that is going to totally
> swamp any processing you do in terms of converting to lower case and
> comparing to some set of strings.
>
> I would put all my strings into a set, then use os.walk() traverse the
> directories and for each path os.walk() returns, do "path.lower() in
> strings".

The issue is that I run a database query and get back rows, each with
a file path (each in a different dir). And I have to check to see if
that file exists. Each is a separate search with no correlation to the
others. I have the full path, so I guess I'll have to do dir name on
it, then a listdir then compare each item with .lower with my string
.lower. It's just that the dirs have 100's and 100's of files so I'm
really worried about efficiency.

[toc] | [prev] | [next] | [standalone]


#64548

FromRoy Smith <roy@panix.com>
Date2014-01-22 20:27 -0500
Message-ID<roy-1771A1.20270322012014@news.panix.com>
In reply to#64547
In article <mailman.5855.1390439920.18130.python-list@python.org>,
 Larry Martell <larry.martell@gmail.com> wrote:

> The issue is that I run a database query and get back rows, each with
> a file path (each in a different dir). And I have to check to see if
> that file exists. Each is a separate search with no correlation to the
> others. I have the full path, so I guess I'll have to do dir name on
> it, then a listdir then compare each item with .lower with my string
> .lower. It's just that the dirs have 100's and 100's of files so I'm
> really worried about efficiency.

Oh, my, this is a much more complicated problem than you originally 
described.

Is the whole path case-insensitive, or just the last component?  In 
other words, if the search string is "/foo/bar/my_file_name", do all of 
these paths match?

/FOO/BAR/MY_FILE_NAME
/foo/bar/my_file_name
/FoO/bAr/My_FiLe_NaMe

Can you give some more background as to *why* you're doing this?  
Usually, if a system considers filenames to be case-insensitive, that's 
something that's handled by the operating system itself.

[toc] | [prev] | [next] | [standalone]


#64564

FromLarry Martell <larry.martell@gmail.com>
Date2014-01-22 21:24 -0700
Message-ID<mailman.5867.1390451096.18130.python-list@python.org>
In reply to#64548
On Wed, Jan 22, 2014 at 6:27 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.5855.1390439920.18130.python-list@python.org>,
>  Larry Martell <larry.martell@gmail.com> wrote:
>
>> The issue is that I run a database query and get back rows, each with
>> a file path (each in a different dir). And I have to check to see if
>> that file exists. Each is a separate search with no correlation to the
>> others. I have the full path, so I guess I'll have to do dir name on
>> it, then a listdir then compare each item with .lower with my string
>> .lower. It's just that the dirs have 100's and 100's of files so I'm
>> really worried about efficiency.
>
> Oh, my, this is a much more complicated problem than you originally
> described.

I try not to bother folks with simple problems ;-)

> Is the whole path case-insensitive, or just the last component?  In
> other words, if the search string is "/foo/bar/my_file_name", do all of
> these paths match?
>
> /FOO/BAR/MY_FILE_NAME
> /foo/bar/my_file_name
> /FoO/bAr/My_FiLe_NaMe

Just the file name (the basename).

> Can you give some more background as to *why* you're doing this?
> Usually, if a system considers filenames to be case-insensitive, that's
> something that's handled by the operating system itself.

I can't say why it's happening. This is a big complicated system with
lots of parts. There's some program that ftp's image files from an
electron microscope and stores them on the file system with crazy
names like:

2O_TOPO_1_2O_2UM_FOV_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg

And something (perhaps the same program, perhaps a different one)
records this is a database. In some cases the name recorded in the db
has different cases in some characters then how it was stored in the
db, e.g.:

2O_TOPO_1_2O_2UM_Fov_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg

These only differ in "FOV" vs. "Fov" but that is just one example.

I am writing something that is part of a django app, that based on
some web entry from the user, I run a query, get back a list of files
and have to go receive them and serve them up back to the browser. My
script is all done and seem to be working, then today I was informed
it was not serving up all the images. Debugging revealed that it was
this case issue - I was matching with exists(). As I've said, coding a
solution is easy, but I fear it will be too slow. Speed is important
in web apps - users have high expectations. Guess I'll just have to
try it and see.

[toc] | [prev] | [next] | [standalone]


#64566

FromChris Angelico <rosuav@gmail.com>
Date2014-01-23 15:29 +1100
Message-ID<mailman.5869.1390451367.18130.python-list@python.org>
In reply to#64548
On Thu, Jan 23, 2014 at 3:24 PM, Larry Martell <larry.martell@gmail.com> wrote:
> I am writing something that is part of a django app, that based on
> some web entry from the user, I run a query, get back a list of files
> and have to go receive them and serve them up back to the browser. My
> script is all done and seem to be working, then today I was informed
> it was not serving up all the images. Debugging revealed that it was
> this case issue - I was matching with exists(). As I've said, coding a
> solution is easy, but I fear it will be too slow. Speed is important
> in web apps - users have high expectations. Guess I'll just have to
> try it and see.

Would it be a problem to rename all the files? Then you could simply
lower() the input name and it'll be correct.

ChrisA

[toc] | [prev] | [next] | [standalone]


#64620

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2014-01-23 15:43 +0000
Message-ID<mailman.5897.1390491833.18130.python-list@python.org>
In reply to#64548
On Wed, Jan 22, 2014 at 09:24:54PM -0700, Larry Martell wrote:
> 
> I am writing something that is part of a django app, that based on
> some web entry from the user, I run a query, get back a list of files
> and have to go receive them and serve them up back to the browser. My
> script is all done and seem to be working, then today I was informed
> it was not serving up all the images. Debugging revealed that it was
> this case issue - I was matching with exists(). As I've said, coding a
> solution is easy, but I fear it will be too slow. Speed is important
> in web apps - users have high expectations. Guess I'll just have to
> try it and see.

How long does it actually take to serve a http request? I would expect it to
be orders of magnitudes slower than calling os.listdir on a directory
containing hundreds of files.

Here on my Linux system there are 2000+ files in /usr/bin. Calling os.listdir
takes 1.5 milliseconds (warm cache):

$ python -m  timeit -s 'import os' 'os.listdir("/usr/bin")'
1000 loops, best of 3: 1.42 msec per loop

Converting those to upper case takes a further .5 milliseconds:

$ python -m  timeit -s 'import os' 'map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 1.98 msec per loop

Checking a string against that list takes .05 milliseconds:

$ python -m  timeit -s 'import os' \
  '"WHICH" in map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 2.03 msec per loop


Oscar

[toc] | [prev] | [next] | [standalone]


#64628

FromLarry Martell <larry.martell@gmail.com>
Date2014-01-23 12:02 -0700
Message-ID<mailman.5904.1390503746.18130.python-list@python.org>
In reply to#64548
On Wed, Jan 22, 2014 at 9:29 PM, Chris Angelico <rosuav@gmail.com> wrote:
> On Thu, Jan 23, 2014 at 3:24 PM, Larry Martell <larry.martell@gmail.com> wrote:
>> I am writing something that is part of a django app, that based on
>> some web entry from the user, I run a query, get back a list of files
>> and have to go receive them and serve them up back to the browser. My
>> script is all done and seem to be working, then today I was informed
>> it was not serving up all the images. Debugging revealed that it was
>> this case issue - I was matching with exists(). As I've said, coding a
>> solution is easy, but I fear it will be too slow. Speed is important
>> in web apps - users have high expectations. Guess I'll just have to
>> try it and see.
>
> Would it be a problem to rename all the files? Then you could simply
> lower() the input name and it'll be correct.

So it turned out that in the django model definition for this object
there was code that was doing some character mapping that was causing
this. That code was added to 'fix' another problem, but the mapping
strings were not qualified enough and it was doing some unintended
mapping. Changing those strings to be more specific fixed my problem.

Thanks to all for the replies.

[toc] | [prev] | [next] | [standalone]


#64582

FromDan Sommers <dan@tombstonezero.net>
Date2014-01-23 07:51 +0000
Message-ID<lbqhm2$7sf$1@dont-email.me>
In reply to#64547
On Wed, 22 Jan 2014 18:18:32 -0700, Larry Martell wrote:

> The issue is that I run a database query and get back rows, each with
> a file path (each in a different dir). And I have to check to see if
> that file exists. Each is a separate search with no correlation to the
> others. I have the full path, so I guess I'll have to do dir name on
> it, then a listdir then compare each item with .lower with my string
> .lower. It's just that the dirs have 100's and 100's of files so I'm
> really worried about efficiency.

Okay, so it's not Python, and I have the benefit of having read all of
the other answers, but what about calling "locate" [0] with the "-i"
flag? or writing some sort of Python/ctypes wrapper around the part of
locate that searches the database?

Dan

[0] http://savannah.gnu.org/projects/findutils/

[toc] | [prev] | [next] | [standalone]


#64597

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-23 12:06 +0000
Message-ID<52e105d6$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to#64544
On Wed, 22 Jan 2014 17:58:19 -0700, Larry Martell wrote:

> I have the need to check for a files existence against a string, but I
> need to do case-insensitively. 

Reading on, I see that your database assumes case-insensitive file names, 
while your file system is case-sensitive.

Suggestions:

(1) Move the files onto a case-insensitive file system. Samba, I believe, 
can duplicate the case-insensitive behaviour of NTFS even on ext3 or ext4 
file systems. (To be pedantic, NTFS can also optionally be case-
sensitive, although that it rarely used.) So if you stick the files on a 
samba file share set to case-insensitivity, samba will behave the way you 
want. (Although os.path.exists won't, you'll have to use nt.path.exists 
instead.)

(2) Normalize the database and the files. Do a one-off run through the 
files on disk, lowercasing the file names, followed by a one-off run 
through the database, doing the same. (Watch out for ambiguous names like 
"Foo" and "FOO".) Then you just need to ensure new files are always named 
in lowercase.


Also, keep in mind that just because os.path.exists reports a file exists 
*right now*, doesn't mean it will still exist a millisecond later when 
you go to use it. Consider avoiding os.path.exists altogether, and just 
trying to open the file. (Although I see you still have the problem that 
you don't know *which* directory the file will be found in.

> I cannot efficiently get the name of
> every file in the dir and compare each with my string using lower(), as
> I have 100's of strings to check for, each in a different dir, and each
> dir can have 100's of files in it. Does anyone know of an efficient way
> to do this? There's no switch for os.path that makes exists() check
> case-insensitively is there?

Try nt.path.exists, although I'm not certain it will do what you want 
since it probably assumes the file system is case-insensitive.

It really sounds like you have a hard problem to solve here. I strongly 
recommend that you change the problem, by renaming the files, or at least 
moving them into a consistent location, rather than have to repeatedly 
search multiple directories. Good luck!

-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64613

FromGrant Edwards <invalid@invalid.invalid>
Date2014-01-23 14:58 +0000
Message-ID<lbramg$6qv$1@reader1.panix.com>
In reply to#64544
On 2014-01-23, Larry Martell <larry.martell@gmail.com> wrote:

> I have the need to check for a files existence against a string, but I
> need to do case-insensitively. I cannot efficiently get the name of
> every file in the dir and compare each with my string using lower(),
> as I have 100's of strings to check for, each in a different dir, and
> each dir can have 100's of files in it. Does anyone know of an
> efficient way to do this? There's no switch for os.path that makes
> exists() check case-insensitively is there?

If you're on Unix, you could use os.popen() to run a find command
using -iname.

-- 
Grant Edwards               grant.b.edwards        Yow! I'm DESPONDENT ... I
                                  at               hope there's something
                              gmail.com            DEEP-FRIED under this
                                                   miniature DOMED STADIUM ...

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web