Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #65603 > unrolled thread

Possible bug with stability of mimetypes.guess_* function output

Started byJohannes Bauer <dfnsonfsduifb@gmx.de>
First post2014-02-07 19:06 +0100
Last post2014-02-08 02:59 -0800
Articles 11 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  Possible bug with stability of mimetypes.guess_* function output Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-02-07 19:06 +0100
    Re: Possible bug with stability of mimetypes.guess_* function output Asaf Las <roegltd@gmail.com> - 2014-02-07 11:09 -0800
      Re: Possible bug with stability of mimetypes.guess_* function output Asaf Las <roegltd@gmail.com> - 2014-02-07 11:17 -0800
        Re: Possible bug with stability of mimetypes.guess_* function output Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-07 19:28 +0000
      Re: Possible bug with stability of mimetypes.guess_* function output Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-02-07 20:39 +0100
      Re: Possible bug with stability of mimetypes.guess_* function output Peter Otten <__peter__@web.de> - 2014-02-07 20:40 +0100
        Re: Possible bug with stability of mimetypes.guess_* function output Asaf Las <roegltd@gmail.com> - 2014-02-07 12:25 -0800
          Re: Possible bug with stability of mimetypes.guess_* function output Peter Otten <__peter__@web.de> - 2014-02-08 08:51 +0100
            Re: Possible bug with stability of mimetypes.guess_* function output Asaf Las <roegltd@gmail.com> - 2014-02-08 00:24 -0800
              Re: Possible bug with stability of mimetypes.guess_* function output Peter Otten <__peter__@web.de> - 2014-02-08 09:39 +0100
                Re: Possible bug with stability of mimetypes.guess_* function output Asaf Las <roegltd@gmail.com> - 2014-02-08 02:59 -0800

#65603 — Possible bug with stability of mimetypes.guess_* function output

FromJohannes Bauer <dfnsonfsduifb@gmx.de>
Date2014-02-07 19:06 +0100
SubjectPossible bug with stability of mimetypes.guess_* function output
Message-ID<ld37bb$7ji$1@news.albasani.net>
Hi group,

I'm using Python 3.3.2+ (default, Oct  9 2013, 14:50:09) [GCC 4.8.1] on
linux and have found what is very peculiar behavior at best and a bug at
worst. It regards the mimetypes module and in particular the
guess_all_extensions and guess_extension functions.

I've found that these do not return stable output. When running the
following commands, it returns one of:

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.htm', '.html', '.shtml'] .htm

$ python3 -c 'import mimetypes;
print(mimetypes.guess_all_extensions("text/html"),
mimetypes.guess_extension("text/html"))'
['.html', '.htm', '.shtml'] .html

So guess_extension(x) seems to always return guess_all_extensions(x)[0].

Curiously, "shtml" is never the first element. The other two are mixed
with a probability of around 50% which leads me to believe they're
internally managed as a set and are therefore affected by the
(relatively new) nondeterministic hashing function initialization.

I don't know if stable output is guaranteed for these functions, but it
sure would be nice. Messes up a whole bunch of things otherwise :-/

Please let me know if this is a bug or expected behavior.
Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [next] | [standalone]


#65604

FromAsaf Las <roegltd@gmail.com>
Date2014-02-07 11:09 -0800
Message-ID<03a2c4c8-313f-4382-8be9-5163d8bf644c@googlegroups.com>
In reply to#65603
On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
> Hi group,
> 
> I'm using Python 3.3.2+ (default, Oct  9 2013, 14:50:09) [GCC 4.8.1] on
> linux and have found what is very peculiar behavior at best and a bug at
> worst. It regards the mimetypes module and in particular the
> guess_all_extensions and guess_extension functions.
> 
> I've found that these do not return stable output. When running the
> following commands, it returns one of:
> 
> $ python3 -c 'import mimetypes;
> print(mimetypes.guess_all_extensions("text/html"),
> mimetypes.guess_extension("text/html"))'
> ['.htm', '.html', '.shtml'] .htm
> 
> $ python3 -c 'import mimetypes;
> print(mimetypes.guess_all_extensions("text/html"),
> mimetypes.guess_extension("text/html"))'
> ['.html', '.htm', '.shtml'] .html
> 
> So guess_extension(x) seems to always return guess_all_extensions(x)[0].
> 
> Curiously, "shtml" is never the first element. The other two are mixed
> with a probability of around 50% which leads me to believe they're
> internally managed as a set and are therefore affected by the
> (relatively new) nondeterministic hashing function initialization.
> 
> 
> I don't know if stable output is guaranteed for these functions, but it
> sure would be nice. Messes up a whole bunch of things otherwise :-/
> 
> Please let me know if this is a bug or expected behavior.
> 
> Best regards,
> 
> Johannes

dictionary. same for v3.3.3 as well. 

it might be you could try to query using sequence below : 

import mimetypes
mimetypes.init()
mimetypes.guess_extension("text/html")

i got only 'htm' for 5 consequitive attempts

/Asaf

[toc] | [prev] | [next] | [standalone]


#65605

FromAsaf Las <roegltd@gmail.com>
Date2014-02-07 11:17 -0800
Message-ID<01c8e74a-5451-40a8-958a-c58c86a9f77f@googlegroups.com>
In reply to#65604
btw, had seen this after own post - 
example usage includes mimetypes.init() 
before call to module functions.

[toc] | [prev] | [next] | [standalone]


#65606

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-02-07 19:28 +0000
Message-ID<mailman.6496.1391801309.18130.python-list@python.org>
In reply to#65605
On 07/02/2014 19:17, Asaf Las wrote:
> btw, had seen this after own post -
> example usage includes mimetypes.init()
> before call to module functions.
>

 From http://docs.python.org/3/library/mimetypes.html#module-mimetypes 
third paragraph "The functions described below provide the primary 
interface for this module. If the module has not been initialized, they 
will call init() if they rely on the information init() sets up."  Draw 
your own conclusions :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [next] | [standalone]


#65607

FromJohannes Bauer <dfnsonfsduifb@gmx.de>
Date2014-02-07 20:39 +0100
Message-ID<ld3cpd$i42$1@news.albasani.net>
In reply to#65604
On 07.02.2014 20:09, Asaf Las wrote:

> it might be you could try to query using sequence below : 
> 
> import mimetypes
> mimetypes.init()
> mimetypes.guess_extension("text/html")
> 
> i got only 'htm' for 5 consequitive attempts

Doesn't change anything. With this:

#!/usr/bin/python3
import mimetypes
mimetypes.init()
print(mimetypes.guess_extension("application/msword"))

And a call like this:

$ for i in `seq 100`; do ./x.py ; done | sort | uniq -c

I get

     35 .doc
     24 .dot
     41 .wiz

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]


#65608

FromPeter Otten <__peter__@web.de>
Date2014-02-07 20:40 +0100
Message-ID<mailman.6497.1391802017.18130.python-list@python.org>
In reply to#65604
Asaf Las wrote:

> On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
>> Hi group,
>> 
>> I'm using Python 3.3.2+ (default, Oct  9 2013, 14:50:09) [GCC 4.8.1] on
>> linux and have found what is very peculiar behavior at best and a bug at
>> worst. It regards the mimetypes module and in particular the
>> guess_all_extensions and guess_extension functions.
>> 
>> I've found that these do not return stable output. When running the
>> following commands, it returns one of:
>> 
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.htm', '.html', '.shtml'] .htm
>> 
>> $ python3 -c 'import mimetypes;
>> print(mimetypes.guess_all_extensions("text/html"),
>> mimetypes.guess_extension("text/html"))'
>> ['.html', '.htm', '.shtml'] .html
>> 
>> So guess_extension(x) seems to always return guess_all_extensions(x)[0].
>> 
>> Curiously, "shtml" is never the first element. The other two are mixed
>> with a probability of around 50% which leads me to believe they're
>> internally managed as a set and are therefore affected by the
>> (relatively new) nondeterministic hashing function initialization.
>> 
>> 
>> I don't know if stable output is guaranteed for these functions, but it
>> sure would be nice. Messes up a whole bunch of things otherwise :-/
>> 
>> Please let me know if this is a bug or expected behavior.
>> 
>> Best regards,
>> 
>> Johannes
> 
> dictionary. same for v3.3.3 as well.
> 
> it might be you could try to query using sequence below :
> 
> import mimetypes
> mimetypes.init()
> mimetypes.guess_extension("text/html")
> 
> i got only 'htm' for 5 consequitive attempts

As Johannes mentioned, this depends on the hash seed:

$ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.html
$ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.htm
$ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
.shtml

You never see ".shtml" as the guessed extension because it is not in the 
original mimetypes.types_map dict, but instead programmaticaly read from a 
file like /etc/mime.types and then added to a list of extensions.

Johanes, 
I'd like the guessed extension to be consistent, too, but even if that is 
rejected the current behaviour should be documented. 

Please file a bug report.

[toc] | [prev] | [next] | [standalone]


#65609

FromAsaf Las <roegltd@gmail.com>
Date2014-02-07 12:25 -0800
Message-ID<c26b109c-0247-4c99-80c3-dccfa3d7ab06@googlegroups.com>
In reply to#65608
On Friday, February 7, 2014 9:40:06 PM UTC+2, Peter Otten wrote:
> As Johannes mentioned, this depends on the hash seed:
> $ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
> .html
> $ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
> .htm
> $ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
> .shtml
> 
> You never see ".shtml" as the guessed extension because it is not in the 
> original mimetypes.types_map dict, but instead programmaticaly read from a 
> file like /etc/mime.types and then added to a list of extensions.
> 
as there are bunch of files in mimetypes.py the only repeatability could 
be achieved on particular machine level.

"/etc/mime.types",
"/etc/httpd/mime.types",               
"/etc/httpd/conf/mime.types",          
"/etc/apache/mime.types",              
"/etc/apache2/mime.types",             
"/usr/local/etc/httpd/conf/mime.types",
"/usr/local/lib/netscape/mime.types",
"/usr/local/etc/httpd/conf/mime.types",
"/usr/local/etc/mime.types"

[toc] | [prev] | [next] | [standalone]


#65638

FromPeter Otten <__peter__@web.de>
Date2014-02-08 08:51 +0100
Message-ID<mailman.6517.1391845924.18130.python-list@python.org>
In reply to#65609
Asaf Las wrote:

> On Friday, February 7, 2014 9:40:06 PM UTC+2, Peter Otten wrote:

>> You never see ".shtml" as the guessed extension because it is not in the
>> original mimetypes.types_map dict, but instead programmaticaly read from
>> a file like /etc/mime.types and then added to a list of extensions.

> as there are bunch of files in mimetypes.py the only repeatability could
> be achieved on particular machine level.

At least the mimetypes already defined in the module could easily produce 
the same guessed extension consistently.

[toc] | [prev] | [next] | [standalone]


#65644

FromAsaf Las <roegltd@gmail.com>
Date2014-02-08 00:24 -0800
Message-ID<e5cc422f-e324-4ad3-9f24-cf8c462ddf15@googlegroups.com>
In reply to#65638
On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
> 
> At least the mimetypes already defined in the module could easily produce 
> the same guessed extension consistently.

imho one workaround for OP could be to supply own map file in init() thus 
ensure unambiguous mapping across every platform and distribution. guess 
some libraries already doing that. or write wrapper and process all_guesses
to eliminate ambiguity up to needed requirement.
that is in case if bug request will be rejected. 

[toc] | [prev] | [next] | [standalone]


#65650

FromPeter Otten <__peter__@web.de>
Date2014-02-08 09:39 +0100
Message-ID<mailman.6526.1391848759.18130.python-list@python.org>
In reply to#65644
Asaf Las wrote:

> On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
>> 
>> At least the mimetypes already defined in the module could easily produce
>> the same guessed extension consistently.
> 
> imho one workaround for OP could be to supply own map file in init() thus
> ensure unambiguous mapping across every platform and distribution. guess
> some libraries already doing that. or write wrapper and process
> all_guesses to eliminate ambiguity up to needed requirement.
> that is in case if bug request will be rejected.

You also have to set mimetypes.types_map and mimetypes.common_types to an 
empty dict (or an OrderedDict).

[toc] | [prev] | [next] | [standalone]


#65665

FromAsaf Las <roegltd@gmail.com>
Date2014-02-08 02:59 -0800
Message-ID<3ca96b6c-ff2d-4fe2-8492-e0a8ff961ede@googlegroups.com>
In reply to#65650
On Saturday, February 8, 2014 10:39:06 AM UTC+2, Peter Otten wrote:
> Asaf Las wrote:
> > On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
> >> At least the mimetypes already defined in the module could easily produce
> >> the same guessed extension consistently.
> > imho one workaround for OP could be to supply own map file in init() thus
> > ensure unambiguous mapping across every platform and distribution. guess
> > some libraries already doing that. or write wrapper and process
> > all_guesses to eliminate ambiguity up to needed requirement.
> > that is in case if bug request will be rejected.
> 
> You also have to set mimetypes.types_map and mimetypes.common_types to an 
> empty dict (or an OrderedDict).

Hmmm, yes. then the quickest workaround is to get all guesses list then
sort it and use the one at index 0.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web