Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.help > #2601

Re: File Gotchas

From markspace <markspace@nospam.nospam>
Newsgroups comp.lang.java.help
Subject Re: File Gotchas
Date 2013-03-17 11:17 -0700
Organization A noiseless patient Spider
Message-ID <ki5198$84o$1@dont-email.me> (permalink)
References <t9cak8lc0kuljdsgges7rkpl98hrj5bt8b@4ax.com> <18nv51lbot90m.5vhsbw89ogpt$.dlg@40tude.net> <ggubk8tunlh9h2e040dvja8mmfqvg819a1@4ax.com>

Show all headers | View raw


On 3/17/2013 10:20 AM, Roedy Green wrote:
> On Sun, 17 Mar 2013 13:56:20 +0100, Joerg Meier <joergmmeier@arcor.de>
> wrote, quoted or indirectly quoted someone who said :
>
>> What website ? Now websites are involved ? Not really sure whats going on
>> here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
>> E:/ or E:/mindprod/jgloss ?
>
> That is the point.  Your local file system has no idea that
> E:/mindprod represents the root of your local mirror of a website, and
> neither do your browsers.  If they did, you could have links in the
> local mirror of the form href="/jgloss/jgloss.html" to refer to
> E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
> website mirror.  You must use relative addresses, e.g.
> href="../jgloss/jgloss.html".  My examples mainly come up when you try
> navigating the local files of a website mirror with the file system.
>
> For a remote website, the browser does know the root.  I have not
> experimented to see if /-type links work there.

I was going to sort of defend you but now you're just being silly. 
Check out the documentation for the wget unix utility.  There's some 
hints there.

What I think you are missing is:

1.  You have to maintain you're own root if you're 
parsing/browsing/scraping a website.  You have to remember that you 
fetched a document from http:www.mindprod.com/stuffs, for example, and 
all your paths are relative to that.  I haven't actually looked at HTML 
semantics in a while, so you might have to also remove the path from 
that root and just use the protocol + host part.  The URL class in Java 
does this for you.

2.  Once you have the root, you have to look at the start of the path 
from the HTML document and determine if you just append, or if you have 
to use the just the hostname, based on the leading characters of the 
path ("." or "/").  I'm quite certain the HTML RFCs spell this out 
explicitly.  Expecting the Java File class to implement these special 
semantics for you is just isn't going to work.  It's "naive," or 
something, alright.




Back to comp.lang.java.help | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-16 20:16 -0700
  Re: File Gotchas Joerg Meier <joergmmeier@arcor.de> - 2013-03-17 13:56 +0100
    Re: File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-17 10:20 -0700
      Re: File Gotchas markspace <markspace@nospam.nospam> - 2013-03-17 11:17 -0700
      Re: File Gotchas Steven Simpson <ss@domain.invalid> - 2013-03-17 19:38 +0000
    Re: File Gotchas Lew <lewbloch@gmail.com> - 2013-03-17 11:58 -0700

csiph-web