Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #7302 > unrolled thread

large xml file...

Started byboris <boris@localhost.domain>
First post2011-08-22 20:05 -0400
Last post2011-08-26 08:47 +0300
Articles 12 — 7 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  large xml file... boris <boris@localhost.domain> - 2011-08-22 20:05 -0400
    Re: large xml file... Ian Shef <invalid@avoiding.spam> - 2011-08-23 00:43 +0000
      Re: large xml file... boris <boris@localhost.domain> - 2011-08-22 20:53 -0400
        Re: large xml file... boris <boris@localhost.domain> - 2011-08-22 20:55 -0400
          Re: large xml file... Ian Shef <invalid@avoiding.spam> - 2011-08-23 19:48 +0000
    Re: large xml file... Arne Vajhøj <arne@vajhoej.dk> - 2011-08-22 21:59 -0400
      Re: large xml file... boris <boris@localhost.localdomain> - 2011-08-24 14:40 -0400
        Re: large xml file... Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> - 2011-08-24 18:59 +0000
        Re: large xml file... Arne Vajhøj <arne@vajhoej.dk> - 2011-08-24 19:10 -0400
          Re: large xml file... Stanimir Stamenkov <s7an10@netscape.net> - 2011-08-25 07:57 +0300
            Re: large xml file... RedGrittyBrick <RedGrittyBrick@spamweary.invalid> - 2011-08-25 10:39 +0100
              Re: large xml file... Stanimir Stamenkov <s7an10@netscape.net> - 2011-08-26 08:47 +0300

#7302 — large xml file...

Fromboris <boris@localhost.domain>
Date2011-08-22 20:05 -0400
Subjectlarge xml file...
Message-ID<j2uqp4$n8h$1@speranza.aioe.org>
hi all,
I need to process large xml file and dump some documents to a different 
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to 
a different file:

<doc>
	<text1>
	<text2>
	<text3>  ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.

	

[toc] | [next] | [standalone]


#7303

FromIan Shef <invalid@avoiding.spam>
Date2011-08-23 00:43 +0000
Message-ID<Xns9F49B4434CADvaj4088ianshef@138.125.254.103>
In reply to#7302
boris <boris@localhost.domain> wrote in news:j2uqp4$n8h$1
@speranza.aioe.org:

> hi all,
> I need to process large xml file and dump some documents to a different 
> file based on content of some elements.
> 
> let's say I need to check content of <text3> and dump the whole <doc> to 
> a different file:
> 
> <doc>
>      <text1>
>      <text2>
>      <text3>  ... etc
> 
> </doc>
> 
> I'm trying to do this using sax. Are there any examples how to do this?
> Is using sax ok for this task?
> thanks.
> 
>      
> 

What you are asking is unclear to me.  
Do you mean that <text3> will determine whether you dump the whole <doc> to 
another file?
Do you mean that <text3> will determine what file the whole <doc> will be 
dumped to?
Or do you mean that the whole <doc> will be dumped to some other file, and 
while you are at it, <text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding 
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being 
processed?

Sorry about the questions, but I need clarification.  I have used SAX and 
may be able to provide enlightenment.  SAX has its uses, but is not so good 
when 'memory' is involved unless _you_ provide the memory.  SAX appears to 
excel when processing can take place in a single pass with very little 
lokking backwards.  Consequently, it does not use as much memory as some 
other methods.




[toc] | [prev] | [next] | [standalone]


#7304

Fromboris <boris@localhost.domain>
Date2011-08-22 20:53 -0400
Message-ID<j2utk3$t1q$1@speranza.aioe.org>
In reply to#7303
On 08/22/2011 08:43 PM, Ian Shef wrote:
> boris<boris@localhost.domain>  wrote in news:j2uqp4$n8h$1
> @speranza.aioe.org:
>
>> hi all,
>> I need to process large xml file and dump some documents to a different
>> file based on content of some elements.
>>
>> let's say I need to check content of<text3>  and dump the whole<doc>  to
>> a different file:
>>
>> <doc>
>>       <text1>
>>       <text2>
>>       <text3>   ... etc
>>
>> </doc>
>>
>> I'm trying to do this using sax. Are there any examples how to do this?
>> Is using sax ok for this task?
>> thanks.
>>
>>
>>
>
> What you are asking is unclear to me.
> Do you mean that<text3>  will determine whether you dump the whole<doc>  to
> another file?
> Do you mean that<text3>  will determine what file the whole<doc>  will be
> dumped to?
> Or do you mean that the whole<doc>  will be dumped to some other file, and
> while you are at it,<text3>  will also be checked and reported in some way?
>
> Can you read the "large xml file" twice?
> Can you put the whole "large xml file" (or at least the part preceeding
> <text3>) into memory?
> Can you copy the "large xml file" to another file while it is being
> processed?
>
> Sorry about the questions, but I need clarification.  I have used SAX and
> may be able to provide enlightenment.  SAX has its uses, but is not so good
> when 'memory' is involved unless _you_ provide the memory.  SAX appears to
> excel when processing can take place in a single pass with very little
> lokking backwards.  Consequently, it does not use as much memory as some
> other methods.
>

 > Do you mean that<text3>  will determine whether you dump the 
 >whole<doc>  to
 > another file?
yes


 > Can you read the "large xml file" twice?
I would like to read it once.

 > Can you put the whole "large xml file" (or at least the part >preceeding
 > <text3>) into memory?
no.

[toc] | [prev] | [next] | [standalone]


#7305

Fromboris <boris@localhost.domain>
Date2011-08-22 20:55 -0400
Message-ID<j2utnu$t1q$2@speranza.aioe.org>
In reply to#7304
> On 08/22/2011 08:43 PM, Ian Shef wrote:

>  > Can you put the whole "large xml file" (or at least the part >preceeding
>  > <text3>) into memory?
> no.

No, I can load the whole file. 1 doc is not a problem...



[toc] | [prev] | [next] | [standalone]


#7317

FromIan Shef <invalid@avoiding.spam>
Date2011-08-23 19:48 +0000
Message-ID<Xns9F4A8259B4392vaj4088ianshef@138.125.254.103>
In reply to#7305
boris <boris@localhost.domain> wrote in
news:j2utnu$t1q$2@speranza.aioe.org: 

>> On 08/22/2011 08:43 PM, Ian Shef wrote:
> 
>>  > Can you put the whole "large xml file" (or at least the part
>>  > >preceeding <text3>) into memory?
>> no.
> 
> No, I can load the whole file. 1 doc is not a problem...
> 
> 
> 
> 

As you are processing, you can save the XML yourself (e.g. as a List of 
String_s).

Based on the result of evaluating <text3>, you can choose to:

Open an output file, copy the List of String_s to the output file, and copy 
any succeeding XML to the output file, or discard the List and discontinue 
processing.

Alternatively, you can save the XML to a file as you process it.  When you 
evaluate <text3>, you can choose to continue saving to the file, or delete 
the file and discontinue processing.




[toc] | [prev] | [next] | [standalone]


#7306

FromArne Vajhøj <arne@vajhoej.dk>
Date2011-08-22 21:59 -0400
Message-ID<4e5309a2$0$303$14726298@news.sunsite.dk>
In reply to#7302
On 8/22/2011 8:05 PM, boris wrote:
> I need to process large xml file and dump some documents to a different
> file based on content of some elements.
>
> let's say I need to check content of <text3> and dump the whole <doc> to
> a different file:
>
> <doc>
> <text1>
> <text2>
> <text3> ... etc
>
> </doc>
>
> I'm trying to do this using sax. Are there any examples how to do this?
> Is using sax ok for this task?

SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne

[toc] | [prev] | [next] | [standalone]


#7347

Fromboris <boris@localhost.localdomain>
Date2011-08-24 14:40 -0400
Message-ID<j33gfr$u9l$1@speranza.aioe.org>
In reply to#7306
On 08/22/2011 09:59 PM, Arne Vajhøj wrote:
> On 8/22/2011 8:05 PM, boris wrote:
>> I need to process large xml file and dump some documents to a different
>> file based on content of some elements.
>>
>> let's say I need to check content of <text3> and dump the whole <doc> to
>> a different file:
>>
>> <doc>
>> <text1>
>> <text2>
>> <text3> ... etc
>>
>> </doc>
>>
>> I'm trying to do this using sax. Are there any examples how to do this?
>> Is using sax ok for this task?
>
> SAX or StAX seems as the most obvious choices given the context.
>
> Any textbook SAX example should lead you to working code.
>
> I can post some code, but I doubt that it will show anything
> various books and tutorials does not.
>
> Arne
>
>
  I tried to accumulate the whole xml(<doc>...</doc>) as string using 
sax, but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like &quot;

Using stax, I get correct xml, if I print events right away, but I if I 
store them in collection and print them later , I don't get the same result.




[toc] | [prev] | [next] | [standalone]


#7348

FromAndreas Leitgeb <avl@gamma.logic.tuwien.ac.at>
Date2011-08-24 18:59 +0000
Message-ID<slrnj5aigv.6gl.avl@gamma.logic.tuwien.ac.at>
In reply to#7347
boris <boris@localhost.localdomain> wrote:
> Using stax, I get correct xml, if I print events right away, but I if I 
> store them in collection and print them later , I don't get the same result.

That sounds more like a bug in your code for "storing" and "printing later"
than a problem with stax itself. ;)

[toc] | [prev] | [next] | [standalone]


#7356

FromArne Vajhøj <arne@vajhoej.dk>
Date2011-08-24 19:10 -0400
Message-ID<4e5584ec$0$304$14726298@news.sunsite.dk>
In reply to#7347
On 8/24/2011 2:40 PM, boris wrote:
> On 08/22/2011 09:59 PM, Arne Vajhøj wrote:
>> On 8/22/2011 8:05 PM, boris wrote:
>>> I need to process large xml file and dump some documents to a different
>>> file based on content of some elements.
>>>
>>> let's say I need to check content of <text3> and dump the whole <doc> to
>>> a different file:
>>>
>>> <doc>
>>> <text1>
>>> <text2>
>>> <text3> ... etc
>>>
>>> </doc>
>>>
>>> I'm trying to do this using sax. Are there any examples how to do this?
>>> Is using sax ok for this task?
>>
>> SAX or StAX seems as the most obvious choices given the context.
>>
>> Any textbook SAX example should lead you to working code.
>>
>> I can post some code, but I doubt that it will show anything
>> various books and tutorials does not.

> I tried to accumulate the whole xml(<doc>...</doc>) as string using sax,
> but in this case all special characters are processed by parser
> and are just characters and not "predefined entities" like &quot;
>
> Using stax, I get correct xml, if I print events right away, but I if I
> store them in collection and print them later , I don't get the same
> result.

Any correct XML parser should convert the XML &quot; to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

Arne

[toc] | [prev] | [next] | [standalone]


#7362

FromStanimir Stamenkov <s7an10@netscape.net>
Date2011-08-25 07:57 +0300
Message-ID<j34kni$tg5$1@dont-email.me>
In reply to#7356
Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:

> Any correct XML parser should convert the XML &quot; to a " in
> a Java String.
>
> Any correct XML formatter/serializer should convert it back again
> when generating new XML.

I think any sane XML serializer should not output " as &quot; in 
text content.

-- 
Stanimir

[toc] | [prev] | [next] | [standalone]


#7368

FromRedGrittyBrick <RedGrittyBrick@spamweary.invalid>
Date2011-08-25 10:39 +0100
Message-ID<4e56183b$0$2937$fa0fcedb@news.zen.co.uk>
In reply to#7362
On 25/08/2011 05:57, Stanimir Stamenkov wrote:
> Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:
>
>> Any correct XML parser should convert the XML &quot; to a " in
>> a Java String.
>>
>> Any correct XML formatter/serializer should convert it back again
>> when generating new XML.
>
> I think any sane XML serializer should not output " as &quot; in text
> content.
>

If you use an XML parser to read '<foo delimiter="&quot;">...' you will 
get a structure with an attribute with a value of '"'.

If you serialise that structure back to XML again, I would hope to get 
'<foo delimiter="&quot;">...' again. Am I wrong?

-- 
RGB

[toc] | [prev] | [next] | [standalone]


#7396

FromStanimir Stamenkov <s7an10@netscape.net>
Date2011-08-26 08:47 +0300
Message-ID<j37c1a$pl7$1@dont-email.me>
In reply to#7368
Thu, 25 Aug 2011 10:39:17 +0100, /RedGrittyBrick/:
> On 25/08/2011 05:57, Stanimir Stamenkov wrote:
>> Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:
>>
>>> Any correct XML parser should convert the XML &quot; to a " in
>>> a Java String.
>>>
>>> Any correct XML formatter/serializer should convert it back again
>>> when generating new XML.
>>
>> I think any sane XML serializer should not output " as &quot; in text
>> content.
>
> If you use an XML parser to read '<foo delimiter="&quot;">...' you
> will get a structure with an attribute with a value of '"'.
>
> If you serialise that structure back to XML again, I would hope to
> get '<foo delimiter="&quot;">...' again. Am I wrong?

The serializer may choose (or be configured) to output:

<foo delimiter='"'>...

But my point was text content, not attribute values:

<foo>&quot;</foo>

an then:

<foo>"</foo>

-- 
Stanimir

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.programmer


csiph-web