Groups > comp.lang.java.programmer > #22465 > unrolled thread

offsets in a FileChannel ...

Started by	qwertmonkey@syberianoutpost.ru
First post	2013-02-23 14:11 +0000
Last post	2013-02-26 08:05 -0800
Articles	5 — 3 participants

Back to article view | Back to comp.lang.java.programmer

  offsets in a FileChannel ... qwertmonkey@syberianoutpost.ru - 2013-02-23 14:11 +0000
    Re: offsets in a FileChannel ... Robert Klemme <shortcutter@googlemail.com> - 2013-02-23 15:39 +0100
      Re: offsets in a FileChannel ... Roedy Green <see_website@mindprod.com.invalid> - 2013-02-25 04:09 -0800
        Re: offsets in a FileChannel ... Robert Klemme <shortcutter@googlemail.com> - 2013-02-25 21:50 +0100
          Re: offsets in a FileChannel ... Roedy Green <see_website@mindprod.com.invalid> - 2013-02-26 08:05 -0800

#22465 — offsets in a FileChannel ...

From	qwertmonkey@syberianoutpost.ru
Date	2013-02-23 14:11 +0000
Subject	offsets in a FileChannel ...
Message-ID	<kgain5$4pl$1@speranza.aioe.org>

 What is missing in this code snippet to get the offsets in the underlying
FileChannel on which the MappedByteBuffer and then the CharBuffer are built?
~ 
 CharBuffer.position() gives you the position alright, but how about wanting
to get the actual offset of certain characters in the actual data feed exposed
through the FileInputStream?
~ 
     char c;
     long lPsx;
     FIS = new FileInputStream(IFl);
     FileChannel FlChnl = FIS.getChannel();
     MappedByteBuffer MptbChnlBfr = FlChnl.map(FileChannel.MapMode.READ_ONLY,
0, FlChnl.size());
     CharBuffer cBfrUTF8 = ChrStDkdr.decode(MptbChnlBfr);
// __ 
     while(cBfrUTF8.hasRemaining()){
      c = cBfrUTF8.get();
      lPsx = cBfrUTF8.position();
      System.err.println("// __ |" + lPsx + "|" + c + "|" + (int)c + "|");
     }
// __ 
     FlChnl.close();
     FIS.close();
~ 
 Or do you know of any other way to basically do the same thing?
~ 
 thanks,
 lbrtchx
 comp.lang.java.programmer:offsets in a FileChannel ...

[toc] | [next] | [standalone]

#22466

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-02-23 15:39 +0100
Message-ID	<aos2kgFqnonU1@mid.individual.net>
In reply to	#22465

On 23.02.2013 15:11, qwertmonkey@syberianoutpost.ru wrote:
>   What is missing in this code snippet to get the offsets in the underlying
> FileChannel on which the MappedByteBuffer and then the CharBuffer are built?
> ~
>   CharBuffer.position() gives you the position alright, but how about wanting
> to get the actual offset of certain characters in the actual data feed exposed
> through the FileInputStream?
> ~
>       char c;
>       long lPsx;
>       FIS = new FileInputStream(IFl);
>       FileChannel FlChnl = FIS.getChannel();
>       MappedByteBuffer MptbChnlBfr = FlChnl.map(FileChannel.MapMode.READ_ONLY,
> 0, FlChnl.size());
>       CharBuffer cBfrUTF8 = ChrStDkdr.decode(MptbChnlBfr);
> // __
>       while(cBfrUTF8.hasRemaining()){
>        c = cBfrUTF8.get();
>        lPsx = cBfrUTF8.position();
>        System.err.println("// __ |" + lPsx + "|" + c + "|" + (int)c + "|");
>       }
> // __
>       FlChnl.close();
>       FIS.close();
> ~
>   Or do you know of any other way to basically do the same thing?

UTF8 is not an encoding with a fixed width.  You would have to create 
more complex code if you want to align char position and byte position. 
  Basically you need to read the file from the beginning and observe the 
width of every char as it is being decoded.  You could of course apply 
heuristics if you have more knowledge about the file but I guess that 
soon gets messy.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#22502

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2013-02-25 04:09 -0800
Message-ID	<2rkmi8hjbqisimmd7bcsalbqjs6f8l9ngp@4ax.com>
In reply to	#22466

On Sat, 23 Feb 2013 15:39:08 +0100, Robert Klemme
<shortcutter@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

>UTF8 is not an encoding with a fixed width.  

You could use UTF-16.  Then you could interconvert 8 byte and char
offsets. with a simple shift.

You could build a table of interesting byte offsets when you construct
the stream.  

You could embed binary counts in bytes/chars at the head of phrases.
You build and take the stream apart with ByteArrayStreams.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
One thing I love about having a website, is that when I complain about
something, I only have to do it once. It saves me endless hours of 
grumbling.

[toc] | [prev] | [next] | [standalone]

#22513

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-02-25 21:50 +0100
Message-ID	<ap214gF52mbU1@mid.individual.net>
In reply to	#22502

On 25.02.2013 13:09, Roedy Green wrote:
> On Sat, 23 Feb 2013 15:39:08 +0100, Robert Klemme
> <shortcutter@googlemail.com> wrote, quoted or indirectly quoted
> someone who said :
>
>> UTF8 is not an encoding with a fixed width.
>
> You could use UTF-16.  Then you could interconvert 8 byte and char
> offsets. with a simple shift.

I don't.  And he don't either since UTF-16 isn't a fixed width encoding.
http://www.unicode.org/faq/utf_bom.html#gen6
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G28070

> You could build a table of interesting byte offsets when you construct
> the stream.

So you would augment the file with an index file.  This is certainly not 
a general solution as you do not always have the option to transport 
that extra data with the file.  Plus, aligning offsets while writing 
might prove as difficult as when reading (e.g. because of buffering).

> You could embed binary counts in bytes/chars at the head of phrases.
> You build and take the stream apart with ByteArrayStreams.

That's no longer a text document.

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#22535

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2013-02-26 08:05 -0800
Message-ID	<g3npi89s302lhvv68dmtvdirlqh5jcsjk5@4ax.com>
In reply to	#22513

On Mon, 25 Feb 2013 21:50:18 +0100, Robert Klemme
<shortcutter@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

>So you would augment the file with an index file.  This is certainly not 
>a general solution as you do not always have the option to transport 
>that extra data with the file.  

In one application I wrote, on load I compose a temporary RAF from
sequential files with a in-RAM ArrayList of offsets of where records
start.  It is a primitive form of hermit crab.  

Now that I have RAM and address space to burn, I could put the whole
thing in RAM.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
One thing I love about having a website, is that when I complain about
something, I only have to do it once. It saves me endless hours of 
grumbling.

[toc] | [prev] | [standalone]

csiph-web

offsets in a FileChannel ...

Contents

#22465 — offsets in a FileChannel ...

#22466

#22502

#22513

#22535