Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64927

Re: Wikipedia XML Dump

Newsgroups comp.lang.python
Date 2014-01-28 17:52 -0800
References <9ec53bc0-f2da-46f4-ad58-2c9a75653dbf@googlegroups.com> <7500190f-18a6-42b2-a77a-982672ce1644@googlegroups.com> <mailman.6083.1390949255.18130.python-list@python.org>
Message-ID <d93ef7fe-3d74-400a-95f5-c7d781cb6ca1@googlegroups.com> (permalink)
Subject Re: Wikipedia XML Dump
From Rustom Mody <rustompmody@gmail.com>

Show all headers | View raw


On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
> hi,

> On 01/29/14 00:31, Kevin Glover wrote:
> > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

> in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
> be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Wikipedia XML Dump kevingloveruk@gmail.com - 2014-01-28 03:45 -0800
  Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 09:11 -0800
    Re: Wikipedia XML Dump Skip Montanaro <skip@pobox.com> - 2014-01-28 12:15 -0600
  Re: Wikipedia XML Dump Kevin Glover <kevingloveruk@gmail.com> - 2014-01-28 14:31 -0800
    Re: Wikipedia XML Dump Burak Arslan <burak.arslan@arskom.com.tr> - 2014-01-29 00:47 +0200
      Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 17:52 -0800
  Re: Wikipedia XML Dump alex23 <wuwei23@gmail.com> - 2014-01-29 11:39 +1000

csiph-web