Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64927
| Newsgroups | comp.lang.python |
|---|---|
| Date | 2014-01-28 17:52 -0800 |
| References | <9ec53bc0-f2da-46f4-ad58-2c9a75653dbf@googlegroups.com> <7500190f-18a6-42b2-a77a-982672ce1644@googlegroups.com> <mailman.6083.1390949255.18130.python-list@python.org> |
| Message-ID | <d93ef7fe-3d74-400a-95f5-c7d781cb6ca1@googlegroups.com> (permalink) |
| Subject | Re: Wikipedia XML Dump |
| From | Rustom Mody <rustompmody@gmail.com> |
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote: > hi, > On 01/29/14 00:31, Kevin Glover wrote: > > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? > in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to > be your only option. Further thoughts?? Just a combo of what Burak and Skip said: I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml to something (more) digestible to nltk
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Wikipedia XML Dump kevingloveruk@gmail.com - 2014-01-28 03:45 -0800
Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 09:11 -0800
Re: Wikipedia XML Dump Skip Montanaro <skip@pobox.com> - 2014-01-28 12:15 -0600
Re: Wikipedia XML Dump Kevin Glover <kevingloveruk@gmail.com> - 2014-01-28 14:31 -0800
Re: Wikipedia XML Dump Burak Arslan <burak.arslan@arskom.com.tr> - 2014-01-29 00:47 +0200
Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 17:52 -0800
Re: Wikipedia XML Dump alex23 <wuwei23@gmail.com> - 2014-01-29 11:39 +1000
csiph-web