Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.sys.acorn.misc > #6677
| Date | 2012-09-25 23:04 +0100 |
|---|---|
| From | Matthew Phillips <spam2011m@yahoo.co.uk> |
| Newsgroups | comp.sys.acorn.misc |
| Subject | Re: Local browsing |
| Message-ID | <18f4f1d452.Matthew@sinenomine.freeserve.co.uk> (permalink) |
| References | <na.9e67bb52c9.a806e0tennant@orpheusmail.co.uk> <785386c952.graham@durain.demon.co.uk> <na.f8edb552c9.a806e0tennant@orpheusmail.co.uk> <fa5519d452.Matthew@sinenomine.freeserve.co.uk> <na.fad50152d4.a806e0tennant@orpheusmail.co.uk> |
In message <na.fad50152d4.a806e0tennant@orpheusmail.co.uk> on 25 Sep 2012 Tennant Stuart wrote: > In article <fa5519d452.Matthew@sinenomine.freeserve.co.uk>, > Matthew Phillips <spam2011m@yahoo.co.uk> wrote: > > > Browsers vary as to how much they handle conversion of characters which > > should not be present in links. Did you inform NetSurf which character > > set was in use in the HTML file by using the correct meta tag attributes > > in the head of the file, or by serving it from a web server which gave > > the character set in the HTTP response? > > > And if you're expecting it to cope with the bullet, when the default > > character set for HTML documents is ISO Latin 1, which does not have a > > bullet character, think again! That's certainly going to be a low > > priority for the developers. > > For heaven's sake Matthew, this is the third tine I've had to explain that > it doesn't matter what the characters are or how they should or should not > be displayed in a document - all that matters is that the file is fetched. I realise that that is what you think. I have been trying to explain another point of view, informed by actual knowledge of the standards. You have asserted higher up in the thread that NetSurf is not following the standards, but I would like to explain where you are not following them. > A particular byte value can represent a bullet to one user, a certain > punctuation mark to another, or even a new letter in some new language > yet IT SIMPLY DOES NOT MATTER - the bytes in the link match the bytes > in the file name, so fetch the file without changing any bytes. Yes, it might seem that would be the nicest solution. But you have to bear in mind that these characters just are not allowed to appear in URLs in the first place. I can see you arguing that NetSurf should behave specially for "file:" URLs and just not worry about illegal characters, but if you have not encoded them as you want them, NetSurf does try to handle the illegal characters. Where it perhaps falls down is in unencoding them at a later stage in the process. Any characters above 127 (whether in the ISO Latin 1 character set, the Acorn Latin 1 extensions, or in Unicode) have to be encoded with a percent sign. See http://en.wikipedia.org/wiki/Uniform_resource_locator which includes a link to RFC 3986 where you can read the full details. For example, lower-case e-acute, character 233 (or E9 hex) can be represented as %E9 in a URL. The bullet, which is not part of ISO Latin 1 is trickier, as the standards mandate that it should be translated into Unicode and represented in UTF-8, with each byte being percent-encoded. In fact, since 2005 and the publication of RFC 3986 new URI schemes have had to do this across the board, so e-acute would be %C3%A9. The bullet would be %E2%80%A2 if encoded correctly in Unicode. (The fact that NetSurf is not recognising what the bullet character is, is a complication.) Bear with me, I am getting to the crux of the matter. So, URLs may not contain any characters other than plain upper or lower case roman alphabet with no accents, digits, and various punctuation characters, all of which can be found on a standard UK keyboard. Any characters beyond those, and some of the punctuation, such as the pound sign, have to be encoded in all circumstances. OK, that's the rules for URLs. What you did not make clear initially was that you were trying file:/// links appearing in an HTML file, rather than just browsing the hard drive with NetSurf. So as you (or some script or utility) are creating the HTML file, we now have to turn to the rules about what may appear in HTML. The URL is placed in the href attribute in the HTML <a> tag. According to the HTML standards, the following process should take place: 1. Take a validly structured URL with any characters which have to be percent encoded already encoded. 2. Encode the URL further to ensure any characters which are special to HTML are encoded. This mainly means converting & to & as the other awkward characters have already been encoded out of the way with percentages. 3. Place in your HTML file. So, the point is, having an HTML file in which an href attribute contains unencoded characters with values above 127 is undoubtedly NOT complying with the HTML standards. That's not to say you won't meet such HTML in the wild, of course. Different browsers behave differently when they meet HTML which is faulty, like the examples you have given. Some browsers may well take the disallowed characters and quietly percent-encode them for you before submitting the URL to the web server (or in this case, looking for the file on the disc). Others may drop the characters completely. What NetSurf does is take any out-of-range characters and encode them in a UTF-8 representation. If you have the following HTML: <a href="file:///RAM::RamDisc0/$/ResumX">Link 2</a><br> (where the X is actually an e-acute, in the file) then NetSurf will generate the URL file:///RAM::RamDisc0/$/Resum%C3%A9 I guess it does this rather than file:///RAM::RamDisc0/$/Resum%E9 because most of the web servers out there are expecting UTF-8 these days, and it uses the same strategy for coping with http links which will be sent off to a web server for retrieval. There is an argument which says that %E9 is unambiguous, while %C3%A9 could mean capital-A-tilde followed by copyright-sign, but hey, you've fed it illegal HTML so it's making the best of a bad job. Unfortunately for you, the bit of the browser which interprets file: URLs is expecting the characters in the old-fashioned 8-bit encoding, so it does actually look for a file called ResumXY where X is an upper-case-A-tilde and Y is the copyright symbol. You can prove this by creating such a file and you will see it displayed in the browser. This behaviour of the file: handler ties in with the way NetSurf builds browsable views of directories on disc, so that makes sense. The unfortunate aspect is the way it's out of step with its own default behaviour for illegally-formed URLs found in HTML. We might perhaps argue, therefore, that for file: URLs NetSurf should encode the illegal characters in a generic 8-bit rather than UTF-8 format. But remember that NetSurf is doing you a favour handling these illegal characters at all. They just should not be present. You can avoid the problem entirely by making your source code say: <a href="file:///RAM::RamDisc0/$/Resum%E9">Link 2</a><br> I realise it would be nice for you if NetSurf behaved differently in this respect, and you might try asking nicely to see if the developers will accommodate this need. But remember that the directory browsing is probably rather RISC OS specific (most of the other operating systems NetSurf targets have supported characters outwith the 8-bit range for a long time now). And please don't bang on about how the browser ought to behave when you are feeding it duff HTML in the first place. You have not said where the HTML file containing the problematic file:/// links has come from. Are you creating it yourself or is a utility creating it? The quickest solution for you is to get those characters percent-encoded, and do please consider withdrawing or amending the NetSurf bug report. I see you have added a comment referring them to this thread (though not as a hyperlink). I would be happy for you to paste relevant parts of this posting into the comments to explain better what your issue is. At present your issue report does not make it clear that you have placed a link in an HTML file and that the link has raw top-bit-set characters in it. That's crucial to understanding what your problem is. I hope that helps. I am not going to explain any further! -- Matthew Phillips Durham
Back to comp.sys.acorn.misc | Previous | Next — Previous in thread | Next in thread | Find similar
Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-03 18:01 +0100
Re: Local browsing Graham Pickles <graham@durain.demon.co.uk> - 2012-09-03 18:50 +0100
Re: Local browsing John Rickman Iyonix <rickman@argonet.co.uk> - 2012-09-03 20:18 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-04 18:01 +0100
Re: Local browsing Chris Johnson <chrisjohnson+news@spamcop.net> - 2012-09-04 20:25 +0100
Re: Local browsing Dave Symes <dave@triffid.co.uk> - 2012-09-04 22:00 +0100
Re: Local browsing Dave Symes <dave@triffid.co.uk> - 2012-09-04 22:09 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-05 18:08 +0100
Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-10 18:52 +0100
Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-10 19:43 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-11 18:01 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-17 18:02 +0100
Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-18 00:15 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-18 18:02 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-18 07:28 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-19 18:00 +0100
Re: Local browsing Russell Hafter News <see.sig@walkingingermany.invalid> - 2012-09-19 21:06 +0100
Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-20 13:40 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:17 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-22 18:01 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-24 07:46 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:42 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:45 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-22 18:00 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-24 07:38 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-25 18:01 +0100
Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-25 23:04 +0100
Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-26 01:58 +0100
Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-27 00:26 +0100
Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-10-01 18:03 +0100
Help Please - RiscPC won't boot Boblith News Sender <bob@boblith44.plus.com> - 2012-10-02 18:10 +0200
Re: Help Please - RiscPC won't boot Jim Nagel <jimnewsm10d@abbeypress.co.uk> - 2012-10-02 18:33 +0100
Re: Help Please - RiscPC won't boot Chris Newman <cvjazz@waitrose.com> - 2012-10-02 19:39 +0100
Re: Help Please - RiscPC won't boot "Bob's News account" <bob@boblith44.plus.com> - 2012-10-06 03:22 +0000
Re: Help Please - RiscPC won't boot Chris Newman <cvjazz@waitrose.com> - 2012-10-06 16:37 +0100
Re: Help Please - RiscPC won't boot "Dave Plowman (News)" <dave@davenoise.co.uk> - 2012-10-02 23:25 +0100
csiph-web