Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.help > #2601
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Newsgroups | comp.lang.java.help |
| Subject | Re: File Gotchas |
| Date | 2013-03-17 11:17 -0700 |
| Organization | A noiseless patient Spider |
| Message-ID | <ki5198$84o$1@dont-email.me> (permalink) |
| References | <t9cak8lc0kuljdsgges7rkpl98hrj5bt8b@4ax.com> <18nv51lbot90m.5vhsbw89ogpt$.dlg@40tude.net> <ggubk8tunlh9h2e040dvja8mmfqvg819a1@4ax.com> |
On 3/17/2013 10:20 AM, Roedy Green wrote:
> On Sun, 17 Mar 2013 13:56:20 +0100, Joerg Meier <joergmmeier@arcor.de>
> wrote, quoted or indirectly quoted someone who said :
>
>> What website ? Now websites are involved ? Not really sure whats going on
>> here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
>> E:/ or E:/mindprod/jgloss ?
>
> That is the point. Your local file system has no idea that
> E:/mindprod represents the root of your local mirror of a website, and
> neither do your browsers. If they did, you could have links in the
> local mirror of the form href="/jgloss/jgloss.html" to refer to
> E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
> website mirror. You must use relative addresses, e.g.
> href="../jgloss/jgloss.html". My examples mainly come up when you try
> navigating the local files of a website mirror with the file system.
>
> For a remote website, the browser does know the root. I have not
> experimented to see if /-type links work there.
I was going to sort of defend you but now you're just being silly.
Check out the documentation for the wget unix utility. There's some
hints there.
What I think you are missing is:
1. You have to maintain you're own root if you're
parsing/browsing/scraping a website. You have to remember that you
fetched a document from http:www.mindprod.com/stuffs, for example, and
all your paths are relative to that. I haven't actually looked at HTML
semantics in a while, so you might have to also remove the path from
that root and just use the protocol + host part. The URL class in Java
does this for you.
2. Once you have the root, you have to look at the start of the path
from the HTML document and determine if you just append, or if you have
to use the just the hostname, based on the leading characters of the
path ("." or "/"). I'm quite certain the HTML RFCs spell this out
explicitly. Expecting the Java File class to implement these special
semantics for you is just isn't going to work. It's "naive," or
something, alright.
Back to comp.lang.java.help | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-16 20:16 -0700
Re: File Gotchas Joerg Meier <joergmmeier@arcor.de> - 2013-03-17 13:56 +0100
Re: File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-17 10:20 -0700
Re: File Gotchas markspace <markspace@nospam.nospam> - 2013-03-17 11:17 -0700
Re: File Gotchas Steven Simpson <ss@domain.invalid> - 2013-03-17 19:38 +0000
Re: File Gotchas Lew <lewbloch@gmail.com> - 2013-03-17 11:58 -0700
csiph-web