Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #7302 > unrolled thread
| Started by | boris <boris@localhost.domain> |
|---|---|
| First post | 2011-08-22 20:05 -0400 |
| Last post | 2011-08-26 08:47 +0300 |
| Articles | 12 — 7 participants |
Back to article view | Back to comp.lang.java.programmer
large xml file... boris <boris@localhost.domain> - 2011-08-22 20:05 -0400
Re: large xml file... Ian Shef <invalid@avoiding.spam> - 2011-08-23 00:43 +0000
Re: large xml file... boris <boris@localhost.domain> - 2011-08-22 20:53 -0400
Re: large xml file... boris <boris@localhost.domain> - 2011-08-22 20:55 -0400
Re: large xml file... Ian Shef <invalid@avoiding.spam> - 2011-08-23 19:48 +0000
Re: large xml file... Arne Vajhøj <arne@vajhoej.dk> - 2011-08-22 21:59 -0400
Re: large xml file... boris <boris@localhost.localdomain> - 2011-08-24 14:40 -0400
Re: large xml file... Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> - 2011-08-24 18:59 +0000
Re: large xml file... Arne Vajhøj <arne@vajhoej.dk> - 2011-08-24 19:10 -0400
Re: large xml file... Stanimir Stamenkov <s7an10@netscape.net> - 2011-08-25 07:57 +0300
Re: large xml file... RedGrittyBrick <RedGrittyBrick@spamweary.invalid> - 2011-08-25 10:39 +0100
Re: large xml file... Stanimir Stamenkov <s7an10@netscape.net> - 2011-08-26 08:47 +0300
| From | boris <boris@localhost.domain> |
|---|---|
| Date | 2011-08-22 20:05 -0400 |
| Subject | large xml file... |
| Message-ID | <j2uqp4$n8h$1@speranza.aioe.org> |
hi all, I need to process large xml file and dump some documents to a different file based on content of some elements. let's say I need to check content of <text3> and dump the whole <doc> to a different file: <doc> <text1> <text2> <text3> ... etc </doc> I'm trying to do this using sax. Are there any examples how to do this? Is using sax ok for this task? thanks.
[toc] | [next] | [standalone]
| From | Ian Shef <invalid@avoiding.spam> |
|---|---|
| Date | 2011-08-23 00:43 +0000 |
| Message-ID | <Xns9F49B4434CADvaj4088ianshef@138.125.254.103> |
| In reply to | #7302 |
boris <boris@localhost.domain> wrote in news:j2uqp4$n8h$1 @speranza.aioe.org: > hi all, > I need to process large xml file and dump some documents to a different > file based on content of some elements. > > let's say I need to check content of <text3> and dump the whole <doc> to > a different file: > > <doc> > <text1> > <text2> > <text3> ... etc > > </doc> > > I'm trying to do this using sax. Are there any examples how to do this? > Is using sax ok for this task? > thanks. > > > What you are asking is unclear to me. Do you mean that <text3> will determine whether you dump the whole <doc> to another file? Do you mean that <text3> will determine what file the whole <doc> will be dumped to? Or do you mean that the whole <doc> will be dumped to some other file, and while you are at it, <text3> will also be checked and reported in some way? Can you read the "large xml file" twice? Can you put the whole "large xml file" (or at least the part preceeding <text3>) into memory? Can you copy the "large xml file" to another file while it is being processed? Sorry about the questions, but I need clarification. I have used SAX and may be able to provide enlightenment. SAX has its uses, but is not so good when 'memory' is involved unless _you_ provide the memory. SAX appears to excel when processing can take place in a single pass with very little lokking backwards. Consequently, it does not use as much memory as some other methods.
[toc] | [prev] | [next] | [standalone]
| From | boris <boris@localhost.domain> |
|---|---|
| Date | 2011-08-22 20:53 -0400 |
| Message-ID | <j2utk3$t1q$1@speranza.aioe.org> |
| In reply to | #7303 |
On 08/22/2011 08:43 PM, Ian Shef wrote: > boris<boris@localhost.domain> wrote in news:j2uqp4$n8h$1 > @speranza.aioe.org: > >> hi all, >> I need to process large xml file and dump some documents to a different >> file based on content of some elements. >> >> let's say I need to check content of<text3> and dump the whole<doc> to >> a different file: >> >> <doc> >> <text1> >> <text2> >> <text3> ... etc >> >> </doc> >> >> I'm trying to do this using sax. Are there any examples how to do this? >> Is using sax ok for this task? >> thanks. >> >> >> > > What you are asking is unclear to me. > Do you mean that<text3> will determine whether you dump the whole<doc> to > another file? > Do you mean that<text3> will determine what file the whole<doc> will be > dumped to? > Or do you mean that the whole<doc> will be dumped to some other file, and > while you are at it,<text3> will also be checked and reported in some way? > > Can you read the "large xml file" twice? > Can you put the whole "large xml file" (or at least the part preceeding > <text3>) into memory? > Can you copy the "large xml file" to another file while it is being > processed? > > Sorry about the questions, but I need clarification. I have used SAX and > may be able to provide enlightenment. SAX has its uses, but is not so good > when 'memory' is involved unless _you_ provide the memory. SAX appears to > excel when processing can take place in a single pass with very little > lokking backwards. Consequently, it does not use as much memory as some > other methods. > > Do you mean that<text3> will determine whether you dump the >whole<doc> to > another file? yes > Can you read the "large xml file" twice? I would like to read it once. > Can you put the whole "large xml file" (or at least the part >preceeding > <text3>) into memory? no.
[toc] | [prev] | [next] | [standalone]
| From | boris <boris@localhost.domain> |
|---|---|
| Date | 2011-08-22 20:55 -0400 |
| Message-ID | <j2utnu$t1q$2@speranza.aioe.org> |
| In reply to | #7304 |
> On 08/22/2011 08:43 PM, Ian Shef wrote: > > Can you put the whole "large xml file" (or at least the part >preceeding > > <text3>) into memory? > no. No, I can load the whole file. 1 doc is not a problem...
[toc] | [prev] | [next] | [standalone]
| From | Ian Shef <invalid@avoiding.spam> |
|---|---|
| Date | 2011-08-23 19:48 +0000 |
| Message-ID | <Xns9F4A8259B4392vaj4088ianshef@138.125.254.103> |
| In reply to | #7305 |
boris <boris@localhost.domain> wrote in news:j2utnu$t1q$2@speranza.aioe.org: >> On 08/22/2011 08:43 PM, Ian Shef wrote: > >> > Can you put the whole "large xml file" (or at least the part >> > >preceeding <text3>) into memory? >> no. > > No, I can load the whole file. 1 doc is not a problem... > > > > As you are processing, you can save the XML yourself (e.g. as a List of String_s). Based on the result of evaluating <text3>, you can choose to: Open an output file, copy the List of String_s to the output file, and copy any succeeding XML to the output file, or discard the List and discontinue processing. Alternatively, you can save the XML to a file as you process it. When you evaluate <text3>, you can choose to continue saving to the file, or delete the file and discontinue processing.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-08-22 21:59 -0400 |
| Message-ID | <4e5309a2$0$303$14726298@news.sunsite.dk> |
| In reply to | #7302 |
On 8/22/2011 8:05 PM, boris wrote: > I need to process large xml file and dump some documents to a different > file based on content of some elements. > > let's say I need to check content of <text3> and dump the whole <doc> to > a different file: > > <doc> > <text1> > <text2> > <text3> ... etc > > </doc> > > I'm trying to do this using sax. Are there any examples how to do this? > Is using sax ok for this task? SAX or StAX seems as the most obvious choices given the context. Any textbook SAX example should lead you to working code. I can post some code, but I doubt that it will show anything various books and tutorials does not. Arne
[toc] | [prev] | [next] | [standalone]
| From | boris <boris@localhost.localdomain> |
|---|---|
| Date | 2011-08-24 14:40 -0400 |
| Message-ID | <j33gfr$u9l$1@speranza.aioe.org> |
| In reply to | #7306 |
On 08/22/2011 09:59 PM, Arne Vajhøj wrote: > On 8/22/2011 8:05 PM, boris wrote: >> I need to process large xml file and dump some documents to a different >> file based on content of some elements. >> >> let's say I need to check content of <text3> and dump the whole <doc> to >> a different file: >> >> <doc> >> <text1> >> <text2> >> <text3> ... etc >> >> </doc> >> >> I'm trying to do this using sax. Are there any examples how to do this? >> Is using sax ok for this task? > > SAX or StAX seems as the most obvious choices given the context. > > Any textbook SAX example should lead you to working code. > > I can post some code, but I doubt that it will show anything > various books and tutorials does not. > > Arne > > I tried to accumulate the whole xml(<doc>...</doc>) as string using sax, but in this case all special characters are processed by parser and are just characters and not "predefined entities" like " Using stax, I get correct xml, if I print events right away, but I if I store them in collection and print them later , I don't get the same result.
[toc] | [prev] | [next] | [standalone]
| From | Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> |
|---|---|
| Date | 2011-08-24 18:59 +0000 |
| Message-ID | <slrnj5aigv.6gl.avl@gamma.logic.tuwien.ac.at> |
| In reply to | #7347 |
boris <boris@localhost.localdomain> wrote: > Using stax, I get correct xml, if I print events right away, but I if I > store them in collection and print them later , I don't get the same result. That sounds more like a bug in your code for "storing" and "printing later" than a problem with stax itself. ;)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-08-24 19:10 -0400 |
| Message-ID | <4e5584ec$0$304$14726298@news.sunsite.dk> |
| In reply to | #7347 |
On 8/24/2011 2:40 PM, boris wrote: > On 08/22/2011 09:59 PM, Arne Vajhøj wrote: >> On 8/22/2011 8:05 PM, boris wrote: >>> I need to process large xml file and dump some documents to a different >>> file based on content of some elements. >>> >>> let's say I need to check content of <text3> and dump the whole <doc> to >>> a different file: >>> >>> <doc> >>> <text1> >>> <text2> >>> <text3> ... etc >>> >>> </doc> >>> >>> I'm trying to do this using sax. Are there any examples how to do this? >>> Is using sax ok for this task? >> >> SAX or StAX seems as the most obvious choices given the context. >> >> Any textbook SAX example should lead you to working code. >> >> I can post some code, but I doubt that it will show anything >> various books and tutorials does not. > I tried to accumulate the whole xml(<doc>...</doc>) as string using sax, > but in this case all special characters are processed by parser > and are just characters and not "predefined entities" like " > > Using stax, I get correct xml, if I print events right away, but I if I > store them in collection and print them later , I don't get the same > result. Any correct XML parser should convert the XML " to a " in a Java String. Any correct XML formatter/serializer should convert it back again when generating new XML. Arne
[toc] | [prev] | [next] | [standalone]
| From | Stanimir Stamenkov <s7an10@netscape.net> |
|---|---|
| Date | 2011-08-25 07:57 +0300 |
| Message-ID | <j34kni$tg5$1@dont-email.me> |
| In reply to | #7356 |
Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/: > Any correct XML parser should convert the XML " to a " in > a Java String. > > Any correct XML formatter/serializer should convert it back again > when generating new XML. I think any sane XML serializer should not output " as " in text content. -- Stanimir
[toc] | [prev] | [next] | [standalone]
| From | RedGrittyBrick <RedGrittyBrick@spamweary.invalid> |
|---|---|
| Date | 2011-08-25 10:39 +0100 |
| Message-ID | <4e56183b$0$2937$fa0fcedb@news.zen.co.uk> |
| In reply to | #7362 |
On 25/08/2011 05:57, Stanimir Stamenkov wrote: > Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/: > >> Any correct XML parser should convert the XML " to a " in >> a Java String. >> >> Any correct XML formatter/serializer should convert it back again >> when generating new XML. > > I think any sane XML serializer should not output " as " in text > content. > If you use an XML parser to read '<foo delimiter=""">...' you will get a structure with an attribute with a value of '"'. If you serialise that structure back to XML again, I would hope to get '<foo delimiter=""">...' again. Am I wrong? -- RGB
[toc] | [prev] | [next] | [standalone]
| From | Stanimir Stamenkov <s7an10@netscape.net> |
|---|---|
| Date | 2011-08-26 08:47 +0300 |
| Message-ID | <j37c1a$pl7$1@dont-email.me> |
| In reply to | #7368 |
Thu, 25 Aug 2011 10:39:17 +0100, /RedGrittyBrick/: > On 25/08/2011 05:57, Stanimir Stamenkov wrote: >> Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/: >> >>> Any correct XML parser should convert the XML " to a " in >>> a Java String. >>> >>> Any correct XML formatter/serializer should convert it back again >>> when generating new XML. >> >> I think any sane XML serializer should not output " as " in text >> content. > > If you use an XML parser to read '<foo delimiter=""">...' you > will get a structure with an attribute with a value of '"'. > > If you serialise that structure back to XML again, I would hope to > get '<foo delimiter=""">...' again. Am I wrong? The serializer may choose (or be configured) to output: <foo delimiter='"'>... But my point was text content, not attribute values: <foo>"</foo> an then: <foo>"</foo> -- Stanimir
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web