Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #44923 > unrolled thread
| Started by | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| First post | 2013-05-07 19:51 -0500 |
| Last post | 2013-05-07 23:49 -0500 |
| Articles | 14 — 8 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 19:51 -0500
Re: Making safe file names Neil Hodgson <nhodgson@iinet.net.au> - 2013-05-08 11:28 +1000
Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-07 21:45 -0400
Re: Making safe file names Roy Smith <roy@panix.com> - 2013-05-07 22:21 -0400
Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 21:20 -0500
Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 21:06 -0500
Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-08 00:10 -0400
Re: Making safe file names albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-05-28 13:44 +0000
Re: Making safe file names Chris Angelico <rosuav@gmail.com> - 2013-05-28 23:53 +1000
Re: Making safe file names Grant Edwards <invalid@invalid.invalid> - 2013-05-28 16:03 +0000
Re: Making safe file names Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-05-08 03:40 +0000
Re: Making safe file names Dave Angel <davea@davea.name> - 2013-05-08 00:13 -0400
Re: Making safe file names Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-05-08 04:47 +0000
Re: Making safe file names Andrew Berg <bahamutzero8825@gmail.com> - 2013-05-07 23:49 -0500
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-05-07 19:51 -0500 |
| Subject | Re: Making safe file names |
| Message-ID | <mailman.1430.1367974288.3114.python-list@python.org> |
On 2013.05.07 19:14, Dave Angel wrote: > You also need to decide how to handle Unicode characters, since they're > different for different OS. In Windows on NTFS, filenames are in > Unicode, while on Unix, filenames are bytes. So on one of those, you > will be encoding/decoding if your code is to be mostly portable. Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings). > Don't forget that ls and rm may not use the same encoding you're using. > So you may not consider it adequate to make the names legal, but you > may also want they easily typeable in the shell. I don't understand. I have no intention of changing Unicode characters. This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ? " < > | @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
[toc] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2013-05-08 11:28 +1000 |
| Message-ID | <Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au> |
| In reply to | #44923 |
Andrew Berg:
> This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
> \ / : * ? "< > | @ and the NUL character
> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be escaped.
There's also the Windows device name hole. There may be trouble with
artists named 'COM4', 'CLOCK$', 'Con', or similar.
http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29
Neil
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-07 21:45 -0400 |
| Message-ID | <mailman.1435.1367977523.3114.python-list@python.org> |
| In reply to | #44928 |
On 05/07/2013 09:28 PM, Neil Hodgson wrote: > Andrew Berg: > >> This is not a Unicode issue since (modern) file systems will happily >> accept it. The issue is that certain characters (which are ASCII) are >> not allowed on some file systems: >> \ / : * ? "< > | @ and the NUL character >> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, >> and NUL and / are not allowed on pretty much any file system. Locale >> settings and encodings aside, these 11 characters will need to be >> escaped. > > There's also the Windows device name hole. There may be trouble with > artists named 'COM4', 'CLOCK$', 'Con', or similar. > In MSDOS 2, there was a switch that would tell the OS to ignore such names unless they were prefixed by \DEV. But like the switchar switch, it was largely ignored by the ignorant, and probably doesn't exist in current versions of M$OS > http://support.microsoft.com/kb/74496 > http://en.wikipedia.org/wiki/Nul_%28band%29 > > Neil While we're looking for trouble, there's also case insensitivity. Unclear if the user cares, but tom and TOM are the same file in most configurations of NT. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-05-07 22:21 -0400 |
| Message-ID | <roy-9FFDA1.22215307052013@news.panix.com> |
| In reply to | #44931 |
In article <mailman.1435.1367977523.3114.python-list@python.org>, Dave Angel <davea@davea.name> wrote: > While we're looking for trouble, there's also case insensitivity. > Unclear if the user cares, but tom and TOM are the same file in most > configurations of NT. OSX, too.
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-05-07 21:20 -0500 |
| Message-ID | <mailman.1437.1367979624.3114.python-list@python.org> |
| In reply to | #44928 |
On 2013.05.07 20:45, Dave Angel wrote: > While we're looking for trouble, there's also case insensitivity. > Unclear if the user cares, but tom and TOM are the same file in most > configurations of NT. Artist names on Last.fm cannot differ only in case. This does remind me to make sure to update the case of the artist name as necessary, though. For example, if Sam becomes SAM again (I have seen Last.fm change the case for artist names), I need to make sure that I don't end up with two file names differing only in case. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-05-07 21:06 -0500 |
| Message-ID | <mailman.1436.1367979189.3114.python-list@python.org> |
| In reply to | #44928 |
On 2013.05.07 20:28, Neil Hodgson wrote: > http://support.microsoft.com/kb/74496 > http://en.wikipedia.org/wiki/Nul_%28band%29 I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-08 00:10 -0400 |
| Message-ID | <mailman.1440.1367986228.3114.python-list@python.org> |
| In reply to | #44928 |
On 05/07/2013 10:06 PM, Andrew Berg wrote: > On 2013.05.07 20:28, Neil Hodgson wrote: >> http://support.microsoft.com/kb/74496 >> http://en.wikipedia.org/wiki/Nul_%28band%29 > I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches. > Won't help. NUL.txt is just as reserved as NUL is. Extensions are ignored in this particular piece of historical nonsense. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl (Albert van der Horst) |
|---|---|
| Date | 2013-05-28 13:44 +0000 |
| Message-ID | <51a4b4aa$0$6063$e4fe514c@dreader36.news.xs4all.nl> |
| In reply to | #44928 |
In article <Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au>, Neil Hodgson <nhodgson@iinet.net.au> wrote: >Andrew Berg: > >> This is not a Unicode issue since (modern) file systems will happily >accept it. The issue is that certain characters (which are ASCII) are >> not allowed on some file systems: >> \ / : * ? "< > | @ and the NUL character >> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, >and NUL and / are not allowed on pretty much any file system. Locale >> settings and encodings aside, these 11 characters will need to be escaped. > > There's also the Windows device name hole. There may be trouble with >artists named 'COM4', 'CLOCK$', 'Con', or similar. > >http://support.microsoft.com/kb/74496 That applies to MS-DOS names. God forbid that this still holds on more modern Microsoft operating systems? >http://en.wikipedia.org/wiki/Nul_%28band%29 > > Neil -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-05-28 23:53 +1000 |
| Message-ID | <mailman.2297.1369749186.3114.python-list@python.org> |
| In reply to | #46285 |
On Tue, May 28, 2013 at 11:44 PM, Albert van der Horst
<albert@spenarnc.xs4all.nl> wrote:
> In article <Lvydneajg7LXNhTMnZ2dnUVZ_rKdnZ2d@westnet.com.au>,
> Neil Hodgson <nhodgson@iinet.net.au> wrote:
>> There's also the Windows device name hole. There may be trouble with
>>artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on more modern
> Microsoft operating systems?
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("com1","w").write("Test\n")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'com1'
>>> open("con","w").write("Test\n")
Test
5
>>>
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2013-05-28 16:03 +0000 |
| Message-ID | <ko2kg3$9oh$4@reader1.panix.com> |
| In reply to | #46285 |
On 2013-05-28, Albert van der Horst <albert@spenarnc.xs4all.nl> wrote:
>> There's also the Windows device name hole. There may be trouble with
>> artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on
> more modern Microsoft operating systems?
There are no more modern Microsoft operating systems. Only more
recent ones. There are still lots of reserved filenames in recent
versions of Windows.
--
Grant Edwards grant.b.edwards Yow! I've got an IDEA!!
at Why don't I STARE at you
gmail.com so HARD, you forget your
SOCIAL SECURITY NUMBER!!
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-05-08 03:40 +0000 |
| Message-ID | <5189c941$0$11094$c3e8da3@news.astraweb.com> |
| In reply to | #44923 |
On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote: > On 2013.05.07 19:14, Dave Angel wrote: >> You also need to decide how to handle Unicode characters, since they're >> different for different OS. In Windows on NTFS, filenames are in >> Unicode, while on Unix, filenames are bytes. So on one of those, you >> will be encoding/decoding if your code is to be mostly portable. > > Characters outside whatever sys.getfilesystemencoding() returns won't be > allowed. If the user's locale settings don't support Unicode, my program > will be far from the only one to have issues with it. Any problem > reports that arise from a user moving between legacy encodings will > generally be ignored. I haven't yet decided how I will handle artist > names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is > just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in > their locale settings). There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire Unicode range, unlike other encodings like Latin-1 or ASCII. Well, that is to say, there may be characters that are not (yet) handled at all by Unicode, but there are no known legacy encodings that support such characters. To a first approximation, Unicode covers the entire set of characters in human use, and for those which it does not, there is always the private use area. So for example, if you wish to record the Artist Formerly Known As "The Artist Formerly Known As Prince" as Love Symbol, you could pick an arbitrary private use code point, declare that for your application that code point means Love Symbol, and use that code point as the artist name. You could even come up with a custom font that includes a rendition of that character glyph. However, there are byte combinations which are not valid UTF-8, which is a different story. If you're receiving bytes from (say) a file name, they may not necessarily make up a valid UTF-8 string. But this is not an issue if you are receiving data from something guaranteed to be valid UTF-8. >> Don't forget that ls and rm may not use the same encoding you're using. >> So you may not consider it adequate to make the names legal, but you >> may also want they easily typeable in the shell. > > I don't understand. I have no intention of changing Unicode characters. Of course you do. You even talk below about Unicode characters like * and ? not being allowed on NTFS systems. Perhaps you are thinking that there are a bunch of characters over here called "plain text ASCII characters", and a *different* bunch of characters with funny accents and stuff called "Unicode characters". If so, then you are labouring under a misapprehension, and you should start off by reading this: http://www.joelonsoftware.com/articles/Unicode.html then come back with any questions. > This is not a Unicode issue since (modern) file systems will happily > accept it. The issue is that certain characters (which are ASCII) are > not allowed on some file systems: > \ / : * ? " < > | @ and the NUL character These are all Unicode characters too. Unicode is a subset of ASCII, so anything which is ASCII is also Unicode. > The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, > and NUL and / are not allowed on pretty much any file system. Locale > settings and encodings aside, these 11 characters will need to be > escaped. If you have an artist with control characters in their name, like newline or carriage return or NUL, I think it is fair to just drop the control characters and then give the artist a thorough thrashing with a halibut. Does your mapping really need to be guaranteed reversible? If you have an artist called "JoeBlow", and another artist called "Joe\0Blow", and a third called "Joe\nBlow", does it *really* matter if your application conflates them? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-08 00:13 -0400 |
| Message-ID | <mailman.1441.1367986414.3114.python-list@python.org> |
| In reply to | #44938 |
On 05/07/2013 11:40 PM, Steven D'Aprano wrote: > > <SNIP> > > These are all Unicode characters too. Unicode is a subset of ASCII, so > anything which is ASCII is also Unicode. > > Typo. You meant Unicode is a superset of ASCII. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-05-08 04:47 +0000 |
| Message-ID | <5189d8dc$0$11094$c3e8da3@news.astraweb.com> |
| In reply to | #44940 |
On Wed, 08 May 2013 00:13:20 -0400, Dave Angel wrote: > On 05/07/2013 11:40 PM, Steven D'Aprano wrote: >> >> <SNIP> >> >> These are all Unicode characters too. Unicode is a subset of ASCII, so >> anything which is ASCII is also Unicode. >> >> >> > Typo. You meant Unicode is a superset of ASCII. Damn. Yes, you're right. I was thinking superset, but my fingers typed subset. Thanks for the correction. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2013-05-07 23:49 -0500 |
| Message-ID | <mailman.1442.1367992489.3114.python-list@python.org> |
| In reply to | #44938 |
On 2013.05.07 22:40, Steven D'Aprano wrote: > There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire > Unicode range, unlike other encodings like Latin-1 or ASCII. You are correct. I'm not sure what I was thinking. >> I don't understand. I have no intention of changing Unicode characters. > > Of course you do. You even talk below about Unicode characters like * > and ? not being allowed on NTFS systems. I worded that incorrectly. What I meant, of course, is that I intend to preserve as many characters as possible and have no need to stay within ASCII. > If you have an artist with control characters in their name, like newline > or carriage return or NUL, I think it is fair to just drop the control > characters and then give the artist a thorough thrashing with a halibut. While the thrashing with a halibut may be warranted (though I personally would use a rubber chicken), conflicts are problematic. > Does your mapping really need to be guaranteed reversible? If you have an > artist called "JoeBlow", and another artist called "Joe\0Blow", and a > third called "Joe\nBlow", does it *really* matter if your application > conflates them? Yes and yes. Some artists like to be real cute with their names and make witch house artist names look tame in comparison, and some may choose to use names similar to some very popular artists. I've also seen people scrobble fake artists with names that look like real artist names (using things like a non-breaking space instead of a regular space) with different artist pictures in order to confuse and troll people. If I could remember the user profiles with this, I'd link them. Last.fm is a silly place. As I said before though, I don't think control characters are even allowed in artist names (likely for technical reasons). -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web