Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #44923 > unrolled thread

Re: Making safe file names

Started byAndrew Berg <bahamutzero8825@gmail.com>
First post2013-05-07 19:51 -0500
Last post2013-05-07 23:49 -0500
Articles 14 — 8 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 19:51 -0500
    Re: Making safe file names Neil Hodgson <nhodgson@iinet.net.au> - 2013-05-08 11:28 +1000
      Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-07 21:45 -0400
        Re: Making safe file names Roy Smith <roy@panix.com> - 2013-05-07 22:21 -0400
      Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 21:20 -0500
      Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 21:06 -0500
      Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-08 00:10 -0400
      Re: Making safe file names albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-05-28 13:44 +0000
        Re: Making safe file names Chris Angelico <rosuav@gmail.com> - 2013-05-28 23:53 +1000
        Re: Making safe file names Grant Edwards <invalid@invalid.invalid> - 2013-05-28 16:03 +0000
    Re: Making safe file names Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-05-08 03:40 +0000
      Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-08 00:13 -0400
        Re: Making safe file names Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-05-08 04:47 +0000
      Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 23:49 -0500

#44923 — Re: Making safe file names

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2013-05-07 19:51 -0500
SubjectRe: Making safe file names
Message-ID<mailman.1430.1367974288.3114.python-list@python.org>
On 2013.05.07 19:14, Dave Angel wrote:
> You also need to decide how to handle Unicode characters, since they're 
> different for different OS.  In Windows on NTFS, filenames are in 
> Unicode, while on Unix, filenames are bytes.  So on one of those, you 
> will be encoding/decoding if your code is to be mostly portable.
Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my
program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings
will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, but inside UTF-16/32 (UTF-16
is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings).
> Don't forget that ls and rm may not use the same encoding you're using. 
> So you may not consider it adequate to make the names legal, but you 
> may also want they easily typeable in the shell.
I don't understand. I have no intention of changing Unicode characters.


This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
 \ / : * ? " < > | @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be escaped.
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

[toc] | [next] | [standalone]


#44928

FromNeil Hodgson <nhodgson@iinet.net.au>
Date2013-05-08 11:28 +1000
Message-ID<Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au>
In reply to#44923
Andrew Berg:

> This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
>   \ / : * ? "<  >  | @ and the NUL character
> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be escaped.

    There's also the Windows device name hole. There may be trouble with 
artists named 'COM4', 'CLOCK$', 'Con', or similar.

http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29

    Neil

[toc] | [prev] | [next] | [standalone]


#44931

FromDave Angel <davea@davea.name>
Date2013-05-07 21:45 -0400
Message-ID<mailman.1435.1367977523.3114.python-list@python.org>
In reply to#44928
On 05/07/2013 09:28 PM, Neil Hodgson wrote:
> Andrew Berg:
>
>> This is not a Unicode issue since (modern) file systems will happily
>> accept it. The issue is that certain characters (which are ASCII) are
>> not allowed on some file systems:
>>   \ / : * ? "<  >  | @ and the NUL character
>> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
>> and NUL and / are not allowed on pretty much any file system. Locale
>> settings and encodings aside, these 11 characters will need to be
>> escaped.
>
>     There's also the Windows device name hole. There may be trouble with
> artists named 'COM4', 'CLOCK$', 'Con', or similar.
>

In MSDOS 2, there was a switch that would tell the OS to ignore such 
names unless they were prefixed by \DEV.  But like the switchar switch, 
it was largely ignored by the ignorant, and probably doesn't exist in 
current versions of M$OS

> http://support.microsoft.com/kb/74496
> http://en.wikipedia.org/wiki/Nul_%28band%29
>
>     Neil

While we're looking for trouble, there's also case insensitivity. 
Unclear if the user cares, but tom and TOM are the same file in most 
configurations of NT.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#44934

FromRoy Smith <roy@panix.com>
Date2013-05-07 22:21 -0400
Message-ID<roy-9FFDA1.22215307052013@news.panix.com>
In reply to#44931
In article <mailman.1435.1367977523.3114.python-list@python.org>,
 Dave Angel <davea@davea.name> wrote:

> While we're looking for trouble, there's also case insensitivity. 
> Unclear if the user cares, but tom and TOM are the same file in most 
> configurations of NT.

OSX, too.

[toc] | [prev] | [next] | [standalone]


#44932

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2013-05-07 21:20 -0500
Message-ID<mailman.1437.1367979624.3114.python-list@python.org>
In reply to#44928
On 2013.05.07 20:45, Dave Angel wrote:
> While we're looking for trouble, there's also case insensitivity. 
> Unclear if the user cares, but tom and TOM are the same file in most 
> configurations of NT.
Artist names on Last.fm cannot differ only in case. This does remind me to make sure to update the case of the artist name as necessary,
though. For example, if Sam becomes SAM again (I have seen Last.fm change the case for artist names), I need to make sure that I don't end
up with two file names differing only in case.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

[toc] | [prev] | [next] | [standalone]


#44935

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2013-05-07 21:06 -0500
Message-ID<mailman.1436.1367979189.3114.python-list@python.org>
In reply to#44928
On 2013.05.07 20:28, Neil Hodgson wrote:
> http://support.microsoft.com/kb/74496
> http://en.wikipedia.org/wiki/Nul_%28band%29
I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

[toc] | [prev] | [next] | [standalone]


#44939

FromDave Angel <davea@davea.name>
Date2013-05-08 00:10 -0400
Message-ID<mailman.1440.1367986228.3114.python-list@python.org>
In reply to#44928
On 05/07/2013 10:06 PM, Andrew Berg wrote:
> On 2013.05.07 20:28, Neil Hodgson wrote:
>> http://support.microsoft.com/kb/74496
>> http://en.wikipedia.org/wiki/Nul_%28band%29
> I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches.
>

Won't help.  NUL.txt is just as reserved as NUL is.  Extensions are 
ignored in this particular piece of historical nonsense.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#46285

Fromalbert@spenarnc.xs4all.nl (Albert van der Horst)
Date2013-05-28 13:44 +0000
Message-ID<51a4b4aa$0$6063$e4fe514c@dreader36.news.xs4all.nl>
In reply to#44928
In article <Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au>,
Neil Hodgson  <nhodgson@iinet.net.au> wrote:
>Andrew Berg:
>
>> This is not a Unicode issue since (modern) file systems will happily
>accept it. The issue is that certain characters (which are ASCII) are
>> not allowed on some file systems:
>>   \ / : * ? "<  >  | @ and the NUL character
>> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
>and NUL and / are not allowed on pretty much any file system. Locale
>> settings and encodings aside, these 11 characters will need to be escaped.
>
>    There's also the Windows device name hole. There may be trouble with
>artists named 'COM4', 'CLOCK$', 'Con', or similar.
>
>http://support.microsoft.com/kb/74496

That applies to MS-DOS names. God forbid that this still holds on more modern
Microsoft operating systems?

>http://en.wikipedia.org/wiki/Nul_%28band%29
>
>    Neil
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]


#46288

FromChris Angelico <rosuav@gmail.com>
Date2013-05-28 23:53 +1000
Message-ID<mailman.2297.1369749186.3114.python-list@python.org>
In reply to#46285
On Tue, May 28, 2013 at 11:44 PM, Albert van der Horst
<albert@spenarnc.xs4all.nl> wrote:
> In article <Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au>,
> Neil Hodgson  <nhodgson@iinet.net.au> wrote:
>>    There's also the Windows device name hole. There may be trouble with
>>artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on more modern
> Microsoft operating systems?

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("com1","w").write("Test\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'com1'
>>> open("con","w").write("Test\n")
Test
5
>>>

ChrisA

[toc] | [prev] | [next] | [standalone]


#46304

FromGrant Edwards <invalid@invalid.invalid>
Date2013-05-28 16:03 +0000
Message-ID<ko2kg3$9oh$4@reader1.panix.com>
In reply to#46285
On 2013-05-28, Albert van der Horst <albert@spenarnc.xs4all.nl> wrote:

>> There's also the Windows device name hole. There may be trouble with
>> artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on
> more modern Microsoft operating systems?

There are no more modern Microsoft operating systems.  Only more
recent ones.  There are still lots of reserved filenames in recent
versions of Windows.

-- 
Grant Edwards               grant.b.edwards        Yow! I've got an IDEA!!
                                  at               Why don't I STARE at you
                              gmail.com            so HARD, you forget your
                                                   SOCIAL SECURITY NUMBER!!

[toc] | [prev] | [next] | [standalone]


#44938

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-05-08 03:40 +0000
Message-ID<5189c941$0$11094$c3e8da3@news.astraweb.com>
In reply to#44923
On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote:

> On 2013.05.07 19:14, Dave Angel wrote:
>> You also need to decide how to handle Unicode characters, since they're
>> different for different OS.  In Windows on NTFS, filenames are in
>> Unicode, while on Unix, filenames are bytes.  So on one of those, you
>> will be encoding/decoding if your code is to be mostly portable.
>
> Characters outside whatever sys.getfilesystemencoding() returns won't be
> allowed. If the user's locale settings don't support Unicode, my program
> will be far from the only one to have issues with it. Any problem
> reports that arise from a user moving between legacy encodings will
> generally be ignored. I haven't yet decided how I will handle artist
> names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is
> just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in
> their locale settings).

There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire 
Unicode range, unlike other encodings like Latin-1 or ASCII.

Well, that is to say, there may be characters that are not (yet) handled 
at all by Unicode, but there are no known legacy encodings that support 
such characters.

To a first approximation, Unicode covers the entire set of characters in 
human use, and for those which it does not, there is always the private 
use area. So for example, if you wish to record the Artist Formerly Known 
As "The Artist Formerly Known As Prince" as Love Symbol, you could pick 
an arbitrary private use code point, declare that for your application 
that code point means Love Symbol, and use that code point as the artist 
name. You could even come up with a custom font that includes a rendition 
of that character glyph.

However, there are byte combinations which are not valid UTF-8, which is 
a different story. If you're receiving bytes from (say) a file name, they 
may not necessarily make up a valid UTF-8 string. But this is not an 
issue if you are receiving data from something guaranteed to be valid 
UTF-8.


>> Don't forget that ls and rm may not use the same encoding you're using.
>> So you may not consider it adequate to make the names legal, but you
>> may also want they easily typeable in the shell.
>
> I don't understand. I have no intention of changing Unicode characters.

Of course you do. You even talk below about Unicode characters like * 
and ? not being allowed on NTFS systems.

Perhaps you are thinking that there are a bunch of characters over here 
called "plain text ASCII characters", and a *different* bunch of 
characters with funny accents and stuff called "Unicode characters". If 
so, then you are labouring under a misapprehension, and you should start 
off by reading this:

http://www.joelonsoftware.com/articles/Unicode.html


then come back with any questions.


> This is not a Unicode issue since (modern) file systems will happily
> accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
>  \ / : * ? " < > | @ and the NUL character

These are all Unicode characters too. Unicode is a subset of ASCII, so 
anything which is ASCII is also Unicode.


> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
> and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be
> escaped.

If you have an artist with control characters in their name, like newline 
or carriage return or NUL, I think it is fair to just drop the control 
characters and then give the artist a thorough thrashing with a halibut.

Does your mapping really need to be guaranteed reversible? If you have an 
artist called "JoeBlow", and another artist called "Joe\0Blow", and a 
third called "Joe\nBlow", does it *really* matter if your application 
conflates them?


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#44940

FromDave Angel <davea@davea.name>
Date2013-05-08 00:13 -0400
Message-ID<mailman.1441.1367986414.3114.python-list@python.org>
In reply to#44938
On 05/07/2013 11:40 PM, Steven D'Aprano wrote:
>
>    <SNIP>
>
> These are all Unicode characters too. Unicode is a subset of ASCII, so
> anything which is ASCII is also Unicode.
>
>

Typo.  You meant  Unicode is a superset of ASCII.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#44941

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-05-08 04:47 +0000
Message-ID<5189d8dc$0$11094$c3e8da3@news.astraweb.com>
In reply to#44940
On Wed, 08 May 2013 00:13:20 -0400, Dave Angel wrote:

> On 05/07/2013 11:40 PM, Steven D'Aprano wrote:
>>
>>    <SNIP>
>>
>> These are all Unicode characters too. Unicode is a subset of ASCII, so
>> anything which is ASCII is also Unicode.
>>
>>
>>
> Typo.  You meant  Unicode is a superset of ASCII.

Damn. Yes, you're right. I was thinking superset, but my fingers typed 
subset.

Thanks for the correction.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#44942

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2013-05-07 23:49 -0500
Message-ID<mailman.1442.1367992489.3114.python-list@python.org>
In reply to#44938
On 2013.05.07 22:40, Steven D'Aprano wrote:
> There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire 
> Unicode range, unlike other encodings like Latin-1 or ASCII.
You are correct. I'm not sure what I was thinking.

>> I don't understand. I have no intention of changing Unicode characters.
> 
> Of course you do. You even talk below about Unicode characters like * 
> and ? not being allowed on NTFS systems.
I worded that incorrectly. What I meant, of course, is that I intend to preserve as many characters as possible and have no need to stay
within ASCII.

> If you have an artist with control characters in their name, like newline 
> or carriage return or NUL, I think it is fair to just drop the control 
> characters and then give the artist a thorough thrashing with a halibut.
While the thrashing with a halibut may be warranted (though I personally would use a rubber chicken), conflicts are problematic.

> Does your mapping really need to be guaranteed reversible? If you have an 
> artist called "JoeBlow", and another artist called "Joe\0Blow", and a 
> third called "Joe\nBlow", does it *really* matter if your application 
> conflates them?
Yes and yes. Some artists like to be real cute with their names and make witch house artist names look tame in comparison, and some may
choose to use names similar to some very popular artists. I've also seen people scrobble fake artists with names that look like real artist
names (using things like a non-breaking space instead of a regular space) with different artist pictures in order to confuse and troll
people. If I could remember the user profiles with this, I'd link them. Last.fm is a silly place.
As I said before though, I don't think control characters are even allowed in artist names (likely for technical reasons).
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web