Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.sys.acorn.misc > #6677

Re: Local browsing

Date 2012-09-25 23:04 +0100
From Matthew Phillips <spam2011m@yahoo.co.uk>
Newsgroups comp.sys.acorn.misc
Subject Re: Local browsing
Message-ID <18f4f1d452.Matthew@sinenomine.freeserve.co.uk> (permalink)
References <na.9e67bb52c9.a806e0tennant@orpheusmail.co.uk> <785386c952.graham@durain.demon.co.uk> <na.f8edb552c9.a806e0tennant@orpheusmail.co.uk> <fa5519d452.Matthew@sinenomine.freeserve.co.uk> <na.fad50152d4.a806e0tennant@orpheusmail.co.uk>

Show all headers | View raw


In message <na.fad50152d4.a806e0tennant@orpheusmail.co.uk>
 on 25 Sep 2012 Tennant Stuart  wrote:

> In article <fa5519d452.Matthew@sinenomine.freeserve.co.uk>,
> Matthew Phillips <spam2011m@yahoo.co.uk> wrote:
> 
> > Browsers vary as to how much they handle conversion of characters which
> > should not be present in links. Did you inform NetSurf which character
> > set was in use in the HTML file by using the correct meta tag attributes
> > in the head of the file, or by serving it from a web server which gave
> > the character set in the HTTP response?
> 
> > And if you're expecting it to cope with the bullet, when the default
> > character set for HTML documents is ISO Latin 1, which does not have a
> > bullet character, think again! That's certainly going to be a low
> > priority for the developers.
> 
> For heaven's sake Matthew, this is the third tine I've had to explain that
> it doesn't matter what the characters are or how they should or should not
> be displayed in a document - all that matters is that the file is fetched.

I realise that that is what you think.  I have been trying to explain another
point of view, informed by actual knowledge of the standards.  You have
asserted higher up in the thread that NetSurf is not following the standards,
but I would like to explain where you are not following them.

> A particular byte value can represent a bullet to one user, a certain
> punctuation mark to another, or even a new letter in some new language
> yet IT SIMPLY DOES NOT MATTER - the bytes in the link match the bytes
> in the file name, so fetch the file without changing any bytes.

Yes, it might seem that would be the nicest solution.  But you have to bear
in mind that these characters just are not allowed to appear in URLs in the
first place.  I can see you arguing that NetSurf should behave specially for
"file:" URLs and just not worry about illegal characters, but if you
have not encoded them as you want them, NetSurf does try to handle the
illegal characters.  Where it perhaps falls down is in unencoding them at a
later stage in the process.

Any characters above 127 (whether in the ISO Latin 1 character set, the Acorn
Latin 1 extensions, or in Unicode) have to be encoded with a percent sign. 
See http://en.wikipedia.org/wiki/Uniform_resource_locator which includes a
link to RFC 3986 where you can read the full details.

For example, lower-case e-acute, character 233 (or E9 hex) can be represented
as %E9 in a URL.  The bullet, which is not part of ISO Latin 1 is trickier,
as the standards mandate that it should be translated into Unicode and
represented in UTF-8, with each byte being percent-encoded.  In fact, since
2005 and the publication of RFC 3986 new URI schemes have had to do this
across the board, so e-acute would be %C3%A9.  The bullet would be %E2%80%A2
if encoded correctly in Unicode.  (The fact that NetSurf is not recognising
what the bullet character is, is a complication.)

Bear with me, I am getting to the crux of the matter.

So, URLs may not contain any characters other than plain upper or lower case
roman alphabet with no accents, digits, and various punctuation characters,
all of which can be found on a standard UK keyboard.  Any characters beyond
those, and some of the punctuation, such as the pound sign, have to be
encoded in all circumstances.

OK, that's the rules for URLs.

What you did not make clear initially was that you were trying file:/// links
appearing in an HTML file, rather than just browsing the hard drive with
NetSurf.  So as you (or some script or utility) are creating the HTML file,
we now have to turn to the rules about what may appear in HTML.

The URL is placed in the href attribute in the HTML <a> tag.  According to
the HTML standards, the following process should take place:

1. Take a validly structured URL with any characters which have to be percent
encoded already encoded.

2. Encode the URL further to ensure any characters which are special to HTML
are encoded.  This mainly means converting & to &amp; as the other awkward
characters have already been encoded out of the way with percentages.

3. Place in your HTML file.

So, the point is, having an HTML file in which an href attribute contains
unencoded characters with values above 127 is undoubtedly NOT complying with
the HTML standards.  That's not to say you won't meet such HTML in the wild,
of course.

Different browsers behave differently when they meet HTML which is faulty,
like the examples you have given.  Some browsers may well take the disallowed
characters and quietly percent-encode them for you before submitting the URL
to the web server (or in this case, looking for the file on the disc). 
Others may drop the characters completely.  What NetSurf does is take any
out-of-range characters and encode them in a UTF-8 representation.

If you have the following HTML:

<a href="file:///RAM::RamDisc0/$/ResumX">Link 2</a><br>

(where the X is actually an e-acute, in the file) then NetSurf will generate
the URL file:///RAM::RamDisc0/$/Resum%C3%A9

I guess it does this rather than file:///RAM::RamDisc0/$/Resum%E9 because
most of the web servers out there are expecting UTF-8 these days, and it
uses the same strategy for coping with http links which will be sent off to
a web server for retrieval.  There is an argument which says that %E9 is
unambiguous, while %C3%A9 could mean capital-A-tilde followed by
copyright-sign, but hey, you've fed it illegal HTML so it's making the best
of a bad job.

Unfortunately for you, the bit of the browser which interprets file: URLs is
expecting the characters in the old-fashioned 8-bit encoding, so it does
actually look for a file called ResumXY where X is an upper-case-A-tilde and
Y is the copyright symbol.  You can prove this by creating such a file and
you will see it displayed in the browser.  This behaviour of the file:
handler ties in with the way NetSurf builds browsable views of directories on
disc, so that makes sense.  The unfortunate aspect is the way it's out of
step with its own default behaviour for illegally-formed URLs found in HTML.

We might perhaps argue, therefore, that for file: URLs NetSurf should encode
the illegal characters in a generic 8-bit rather than UTF-8 format.  But
remember that NetSurf is doing you a favour handling these illegal characters
at all.  They just should not be present.  You can avoid the problem entirely
by making your source code say:

<a href="file:///RAM::RamDisc0/$/Resum%E9">Link 2</a><br>

I realise it would be nice for you if NetSurf behaved differently in this
respect, and you might try asking nicely to see if the developers will
accommodate this need.  But remember that the directory browsing is probably
rather RISC OS specific (most of the other operating systems NetSurf
targets have supported characters outwith the 8-bit range for a long time
now).  And please don't bang on about how the browser ought to behave when
you are feeding it duff HTML in the first place.

You have not said where the HTML file containing the problematic file:///
links has come from.  Are you creating it yourself or is a utility creating
it?

The quickest solution for you is to get those characters percent-encoded, and
do please consider withdrawing or amending the NetSurf bug report.  I see you
have added a comment referring them to this thread (though not as a
hyperlink).  I would be happy for you to paste relevant parts of this posting
into the comments to explain better what your issue is.

At present your issue report does not make it clear that you have placed a
link in an HTML file and that the link has raw top-bit-set characters in it. 
That's crucial to understanding what your problem is.

I hope that helps.  I am not going to explain any further!

-- 
Matthew Phillips
Durham

Back to comp.sys.acorn.misc | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-03 18:01 +0100
  Re: Local browsing Graham Pickles <graham@durain.demon.co.uk> - 2012-09-03 18:50 +0100
    Re: Local browsing John Rickman Iyonix <rickman@argonet.co.uk> - 2012-09-03 20:18 +0100
    Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-04 18:01 +0100
      Re: Local browsing Chris Johnson <chrisjohnson+news@spamcop.net> - 2012-09-04 20:25 +0100
        Re: Local browsing Dave Symes <dave@triffid.co.uk> - 2012-09-04 22:00 +0100
          Re: Local browsing Dave Symes <dave@triffid.co.uk> - 2012-09-04 22:09 +0100
            Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-05 18:08 +0100
              Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-10 18:52 +0100
                Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-10 19:43 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-11 18:01 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-17 18:02 +0100
                Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-18 00:15 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-18 18:02 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-18 07:28 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-19 18:00 +0100
                Re: Local browsing Russell Hafter News <see.sig@walkingingermany.invalid> - 2012-09-19 21:06 +0100
                Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-20 13:40 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:17 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-22 18:01 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-24 07:46 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:42 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-21 07:45 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-22 18:00 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-24 07:38 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-09-25 18:01 +0100
                Re: Local browsing Matthew Phillips <spam2011m@yahoo.co.uk> - 2012-09-25 23:04 +0100
                Re: Local browsing Theo Markettos <theom+news@chiark.greenend.org.uk> - 2012-09-26 01:58 +0100
                Re: Local browsing "Felicity S." <Flcty@rdsqurrl.com> - 2012-09-27 00:26 +0100
                Re: Local browsing Tennant Stuart <tennant@orpheus.co.uk> - 2012-10-01 18:03 +0100
                Help Please - RiscPC won't boot Boblith News Sender <bob@boblith44.plus.com> - 2012-10-02 18:10 +0200
                Re: Help Please - RiscPC won't boot Jim Nagel <jimnewsm10d@abbeypress.co.uk> - 2012-10-02 18:33 +0100
                Re: Help Please - RiscPC won't boot Chris Newman <cvjazz@waitrose.com> - 2012-10-02 19:39 +0100
                Re: Help Please - RiscPC won't boot "Bob's News account" <bob@boblith44.plus.com> - 2012-10-06 03:22 +0000
                Re: Help Please - RiscPC won't boot Chris Newman <cvjazz@waitrose.com> - 2012-10-06 16:37 +0100
                Re: Help Please - RiscPC won't boot "Dave Plowman (News)" <dave@davenoise.co.uk> - 2012-10-02 23:25 +0100

csiph-web