Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.help > #2597 > unrolled thread

File Gotchas

Started byRoedy Green <see_website@mindprod.com.invalid>
First post2013-03-16 20:16 -0700
Last post2013-03-17 11:58 -0700
Articles 6 — 5 participants

Back to article view | Back to comp.lang.java.help


Contents

  File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-16 20:16 -0700
    Re: File Gotchas Joerg Meier <joergmmeier@arcor.de> - 2013-03-17 13:56 +0100
      Re: File Gotchas Roedy Green <see_website@mindprod.com.invalid> - 2013-03-17 10:20 -0700
        Re: File Gotchas markspace <markspace@nospam.nospam> - 2013-03-17 11:17 -0700
        Re: File Gotchas Steven Simpson <ss@domain.invalid> - 2013-03-17 19:38 +0000
      Re: File Gotchas Lew <lewbloch@gmail.com> - 2013-03-17 11:58 -0700

#2597 — File Gotchas

FromRoedy Green <see_website@mindprod.com.invalid>
Date2013-03-16 20:16 -0700
SubjectFile Gotchas
Message-ID<t9cak8lc0kuljdsgges7rkpl98hrj5bt8b@4ax.com>
/*
 * [TestFileCombine.java]
 *
 * Summary: combining two filenames with java.io.File
 *
 * Copyright: (c) 2013 Roedy Green, Canadian Mind Products,
http://mindprod.com
 *
 * Licence: This software may be copied and used freely for any
purpose but military.
 *          http://mindprod.com/contact/nonmil.html
 *
 * Requires: JDK 1.7+
 *
 * Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
 *
 * Version History:
 *  1.0 2013-03-16 initial version
 */
package com.mindprod.example;

import com.mindprod.common11.Misc;

import java.io.File;
import java.io.IOException;

import static java.lang.System.out;

/**
 * combining two filenames with java.io.File
 *
 * @author Roedy Green, Canadian Mind Products
 * @version 1.0 2013-03-16 initial version
 * @since 2013-03-16
 */
public final class TestFileCombine
    {

    /**
     * Experiment with various ways of combining file names
     *
     * @param args not used
     *
     * @throws java.io.IOException on I/O failure.
     */
    public static void main( String[] args ) throws IOException
        {

        // file is not suitable for resolving relative or absolute
offsets from a base filename.
        File root = new File( "E:/mindprod" );
        File o1 = new File( root, "index.html" );
        out.println( Misc.getCanOrAbsPath( o1 ) );
        // prints: E:/mindprod/index.html  (actually with backslashes)

        File o2 = new File( root, "/index.html" );
        out.println( Misc.getCanOrAbsPath( o2 ) );
        // prints: E:/mindprod/index.html

        File base = new File( "E:/mindprod/jgloss/encoding" );
        File o3 = new File( base, "pad.html" );
        out.println( Misc.getCanOrAbsPath( o3 ) );
        // prints: E:\mindprod\jgloss\encoding\pad.html

        File o4 = new File( base, "../pad.html" );
        out.println( Misc.getCanOrAbsPath( o4 ) );
        // prints: E:\mindprod\jgloss\pad.html 

        File o5 = new File( base, "/jgloss/pad.html" );
        out.println( Misc.getCanOrAbsPath( o5 ) );
        // prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
        // You might have naively hoped for:
E:/mindprod/jgloss/pad.html
        // However, File has no idea that / on your website refers to
E:/mindprod.

        File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
);
        File o6 = new File( base2, "pad.html" );
        out.println( Misc.getCanOrAbsPath( o6 ) );
        // prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
(ouch)
        // You might have hoped for:
E:\mindprod\jgloss\encoding\pad.html

        File o7 = new File( base2, "../pad.html" );
        out.println( Misc.getCanOrAbsPath( o7 ) );
        // prints:  E:\mindprod\jgloss\encoding\pad.html (ouch)
        // You might have hoped for: E:\mindprod\jgloss\pad.html

        File o8 = new File( base2, "/jgloss/pad.html" );
        out.println( Misc.getCanOrAbsPath( o8 ) );
        // prints:
E:\mindprod\jgloss\encoding\utf8.html\jgloss\pad.html (ouch)
        // You might have naively hoped for:
E:/mindprod/jgloss/pad.html
        // However, File has no idea that / on your website refers to
E:/mindprod.
        }

    }
-- 
Roedy Green Canadian Mind Products http://mindprod.com
The computer programmer is a creator of universes for which he alone
is the lawgiver. No playwright, no stage director, no emperor, however
powerful, has ever exercised such absolute authority to arrange a stage
or a field of battle and to command such unswervingly dutiful actors or
troops. 
 ~ Joseph Weizenbaum (born: 1923-01-08 died: 2008-03-05 at age: 85)

[toc] | [next] | [standalone]


#2598

FromJoerg Meier <joergmmeier@arcor.de>
Date2013-03-17 13:56 +0100
Message-ID<18nv51lbot90m.5vhsbw89ogpt$.dlg@40tude.net>
In reply to#2597
On Sat, 16 Mar 2013 20:16:03 -0700, Roedy Green wrote:

>         File o5 = new File( base, "/jgloss/pad.html" );
>         out.println( Misc.getCanOrAbsPath( o5 ) );
>         // prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
>         // You might have naively hoped for:
> E:/mindprod/jgloss/pad.html
>         // However, File has no idea that / on your website refers to
> E:/mindprod.

What website ? Now websites are involved ? Not really sure whats going on
here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
E:/ or E:/mindprod/jgloss ?

Leaving out the part about a website that I don't understand, why would you
assume that Java randomly would pick the parts of the filename you were
thinking of ? I can see no indication why it would be that specific part
other than "If I wish really hard, maybe it will come true". At most, I
would have expected that a leading / would be interpreted as the drives
root, as it works under Linux.

>         File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
> );
>         File o6 = new File( base2, "pad.html" );
>         out.println( Misc.getCanOrAbsPath( o6 ) );
>         // prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
> (ouch)
>         // You might have hoped for:
> E:\mindprod\jgloss\encoding\pad.html

That would be a defect that I would immediately file a bug report for. It
would mean that it would be impossible to access folders/directories that
have a period in their name. Why you would hope that those would randomly
be cut off for no reason is beyond me.

>         File o7 = new File( base2, "../pad.html" );
>         out.println( Misc.getCanOrAbsPath( o7 ) );
>         // prints:  E:\mindprod\jgloss\encoding\pad.html (ouch)
>         // You might have hoped for: E:\mindprod\jgloss\pad.html

Again: a behaviour like that would mean a bug in regards to directories
with a period in their name. Not sure why that would be desirable.

>         File o8 = new File( base2, "/jgloss/pad.html" );
>         out.println( Misc.getCanOrAbsPath( o8 ) );
>         // prints:
> E:\mindprod\jgloss\encoding\utf8.html\jgloss\pad.html (ouch)
>         // You might have naively hoped for:
> E:/mindprod/jgloss/pad.html
>         // However, File has no idea that / on your website refers to
> E:/mindprod.

Same response as above: what website ? Why would / refer to that particular
piece of the path ?

Liebe Gruesse,
		Joerg

-- 
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

[toc] | [prev] | [next] | [standalone]


#2599

FromRoedy Green <see_website@mindprod.com.invalid>
Date2013-03-17 10:20 -0700
Message-ID<ggubk8tunlh9h2e040dvja8mmfqvg819a1@4ax.com>
In reply to#2598
On Sun, 17 Mar 2013 13:56:20 +0100, Joerg Meier <joergmmeier@arcor.de>
wrote, quoted or indirectly quoted someone who said :

>What website ? Now websites are involved ? Not really sure whats going on
>here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
>E:/ or E:/mindprod/jgloss ?
 
That is the point.  Your local file system has no idea that
E:/mindprod represents the root of your local mirror of a website, and
neither do your browsers.  If they did, you could have links in the
local mirror of the form href="/jgloss/jgloss.html" to refer to
E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
website mirror.  You must use relative addresses, e.g.
href="../jgloss/jgloss.html".  My examples mainly come up when you try
navigating the local files of a website mirror with the file system.

For a remote website, the browser does know the root.  I have not
experimented to see if /-type links work there.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
The computer programmer is a creator of universes for which he alone
is the lawgiver. No playwright, no stage director, no emperor, however
powerful, has ever exercised such absolute authority to arrange a stage
or a field of battle and to command such unswervingly dutiful actors or
troops. 
 ~ Joseph Weizenbaum (born: 1923-01-08 died: 2008-03-05 at age: 85)

[toc] | [prev] | [next] | [standalone]


#2601

Frommarkspace <markspace@nospam.nospam>
Date2013-03-17 11:17 -0700
Message-ID<ki5198$84o$1@dont-email.me>
In reply to#2599
On 3/17/2013 10:20 AM, Roedy Green wrote:
> On Sun, 17 Mar 2013 13:56:20 +0100, Joerg Meier <joergmmeier@arcor.de>
> wrote, quoted or indirectly quoted someone who said :
>
>> What website ? Now websites are involved ? Not really sure whats going on
>> here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
>> E:/ or E:/mindprod/jgloss ?
>
> That is the point.  Your local file system has no idea that
> E:/mindprod represents the root of your local mirror of a website, and
> neither do your browsers.  If they did, you could have links in the
> local mirror of the form href="/jgloss/jgloss.html" to refer to
> E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
> website mirror.  You must use relative addresses, e.g.
> href="../jgloss/jgloss.html".  My examples mainly come up when you try
> navigating the local files of a website mirror with the file system.
>
> For a remote website, the browser does know the root.  I have not
> experimented to see if /-type links work there.

I was going to sort of defend you but now you're just being silly. 
Check out the documentation for the wget unix utility.  There's some 
hints there.

What I think you are missing is:

1.  You have to maintain you're own root if you're 
parsing/browsing/scraping a website.  You have to remember that you 
fetched a document from http:www.mindprod.com/stuffs, for example, and 
all your paths are relative to that.  I haven't actually looked at HTML 
semantics in a while, so you might have to also remove the path from 
that root and just use the protocol + host part.  The URL class in Java 
does this for you.

2.  Once you have the root, you have to look at the start of the path 
from the HTML document and determine if you just append, or if you have 
to use the just the hostname, based on the leading characters of the 
path ("." or "/").  I'm quite certain the HTML RFCs spell this out 
explicitly.  Expecting the Java File class to implement these special 
semantics for you is just isn't going to work.  It's "naive," or 
something, alright.




[toc] | [prev] | [next] | [standalone]


#2603

FromSteven Simpson <ss@domain.invalid>
Date2013-03-17 19:38 +0000
Message-ID<ulih1a-ge3.ln1@s.simpson148.btinternet.com>
In reply to#2599
On 17/03/13 17:20, Roedy Green wrote:
> Your local file system has no idea that
> E:/mindprod represents the root of your local mirror of a website, and
> neither do your browsers.  If they did, you could have links in the
> local mirror of the form href="/jgloss/jgloss.html" to refer to
> E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
> website mirror.  You must use relative addresses, e.g.
> href="../jgloss/jgloss.html".  My examples mainly come up when you try
> navigating the local files of a website mirror with the file system.
>
> For a remote website, the browser does know the root.  I have not
> experimented to see if /-type links work there.

I gather you're trying to write some off-line site-checking program, 
where you have a local copy of your site, which you FTP to the server, 
and the program needs to interpret links (among other things).

java.io.File does not capture distinctions between files and 
directories, but java.net.URI does distinguish between URIs with and 
without terminating slashes.  I suggest you do as much work as possible 
with URIs - identify each document you're handling by its URI; parse 
href values as URIs and resolve against the document's - and only 
convert to File when you need to access the disc.  Here's a barely 
tested class that might help with that:

import java.net.URI;
import java.io.File;

/**
  * Maps URIs within a site to local files.
  */
class FileMapping {
     final URI site;
     final URI copy;
     final String index;

     /**
      * Create a file mapping.
      *
      * @param site the base URI of the site; anything after the last
      * slash is ignored
      *
      * @param copy the directory of the local copy of the site
      *
      * @param index the default filename to use to map directory-like
      * URIs
      */
     public FileMapping(String site, String copy, String index) {
         this(URI.create(site), new File(copy), index);
     }

     /**
      * Create a file mapping using a default leafname.
      *
      * @param site the base URI of the site; anything after the last
      * slash is ignored
      *
      * @param copy the directory of the local copy of the site
      */
     public FileMapping(String site, String copy) {
         this(URI.create(site), new File(copy));
     }

     /**
      * Create a file mapping using a default leafname.
      *
      * @param site the base URI of the site; anything after the last
      * slash is ignored
      *
      * @param copy the directory of the local copy of the site
      */
     public FileMapping(URI site, File copy) {
         this(site, copy, "index.html");
     }

     /**
      * Create a file mapping.
      *
      * @param site the base URI of the site; anything after the last
      * slash is ignored
      *
      * @param copy the directory of the local copy of the site
      *
      * @param index the default filename to use to map directory-like
      * URIs
      */
     public FileMapping(URI site, File copy, String index) {
         /* We must have a slash-terminated base URI for relativize to
          * work. */
         this.site = site.resolve("./");

         /* We must add a dummy element so that we can ensure a
          * trailing slash. */
         this.copy = new File(copy, "dummy").toURI().resolve("./");

         this.index = index;
     }

     /**
      * Map the URI to a file.
      *
      * @param addr the URI to be mapped
      *
      * @return the file that the URI maps to, or null if it is
      * external
      */
     public File map(URI addr) {
         URI rel = site.relativize(addr);
         if (rel.isAbsolute()) return null;
         if (rel.resolve("./").equals(rel))
             rel = rel.resolve(index);
         rel = copy.resolve(rel);
         return new File(rel);
     }

     private static void test(FileMapping mapping, String addrText) {
         URI addr = URI.create(addrText);
         File file = mapping.map(addr);
         System.out.printf("%s -> %s%n", addr, file);
     }

     public static void main(String[] args) throws Exception {
         FileMapping mapping =
             new FileMapping("http://mindprod.com/", "/var/site");
         test(mapping, "http://www.example.com/");
         test(mapping, "http://mindprod.com/jgloss/pad.html");
         test(mapping, "http://mindprod.com/jgloss/encoding/pad.html");
     }
}



-- 
ss at comp dot lancs dot ac dot uk

[toc] | [prev] | [next] | [standalone]


#2602

FromLew <lewbloch@gmail.com>
Date2013-03-17 11:58 -0700
Message-ID<e0db3a97-2973-403a-84a5-04c88c96184e@googlegroups.com>
In reply to#2598
Joerg Meier wrote:
> Roedy Green wrote:
>>         File o5 = new File( base, "/jgloss/pad.html" );
>>         out.println( Misc.getCanOrAbsPath( o5 ) );
>>         // prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
>>         // You might have naively hoped for:
>> E:/mindprod/jgloss/pad.html
>>         // However, File has no idea that / on your website refers to
>> E:/mindprod.

'File' is meant to assist with file-system navigation, not web navigation.

It is not an abstraction of a file system, either. It is "[a]n abstract representation of 
file and directory pathnames."

It only models the names. From that point of view, all the behavior you observed
is consistent with expectation.

> What website ? Now websites are involved ? Not really sure whats going on
> here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
> E:/ or E:/mindprod/jgloss ?

In the case of 'File', you are not even promised that it refers to "/".

You are promised that it represents the pathname "/", the resource for which is 
out of its scope.

> Leaving out the part about a website that I don't understand, why would you
> assume that Java randomly would pick the parts of the filename you were
> thinking of ? I can see no indication why it would be that specific part
> other than "If I wish really hard, maybe it will come true". At most, I
> would have expected that a leading / would be interpreted as the drives
> root, as it works under Linux.

Which is actually more than it does. All it represents is the pathname "/". 

To put it another way, 'File' is not not responsible for how the pathname is 
interpreted.

If that is the drive root, that's up to the OS service to which 'File' passes 
the pathname.

>>         File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
>> );
>>         File o6 = new File( base2, "pad.html" );
>>         out.println( Misc.getCanOrAbsPath( o6 ) );
>>         // prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
>> (ouch)

                                     new File( base2, "pad.html" );
E:/mindprod/jgloss/encoding/utf8.html/pad.html 
E:\mindprod\jgloss\encoding\utf8.html\pad.html

Why "ouch"?

> >         // You might have hoped for:
> 
> > E:\mindprod\jgloss\encoding\pad.html

That would violate the documented behavior of the constructor:
"Creates a new File instance from a parent pathname string and a child pathname string."

> That would be a defect that I would immediately file a bug report for. It
> would mean that it would be impossible to access folders/directories that

It is not the job of 'File' to access any resource. Its job is only to manage pathnames 
and the interaction of those pathnames with host services.

> have a period in their name. Why you would hope that those would randomly
> be cut off for no reason is beyond me.

And it would violate the contract.

> ... [snip] ...
>
> Same response as above: what website ? Why would / refer to that particular
> piece of the path ?

In point of fact, the shortcut of thinking that "/" refers to anything is a mismatch 
to what 'File' actually does. 'File' manages the name and its communication to the OS.

The OS decides what it matches.

With that in mind, the logic of 'File''s documented behavior and Joerg's incredulity 
that expectations would diverge therefrom are perfectly explicable.

-- 
Lew

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.help


csiph-web