Groups > comp.lang.python > #64905 > unrolled thread

Wikipedia XML Dump

Started by	kevingloveruk@gmail.com
First post	2014-01-28 03:45 -0800
Last post	2014-01-29 11:39 +1000
Articles	7 — 6 participants

Back to article view | Back to comp.lang.python

  Wikipedia XML Dump kevingloveruk@gmail.com - 2014-01-28 03:45 -0800
    Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 09:11 -0800
      Re: Wikipedia XML Dump Skip Montanaro <skip@pobox.com> - 2014-01-28 12:15 -0600
    Re: Wikipedia XML Dump Kevin Glover <kevingloveruk@gmail.com> - 2014-01-28 14:31 -0800
      Re: Wikipedia XML Dump Burak Arslan <burak.arslan@arskom.com.tr> - 2014-01-29 00:47 +0200
        Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 17:52 -0800
    Re: Wikipedia XML Dump alex23 <wuwei23@gmail.com> - 2014-01-29 11:39 +1000

#64905 — Wikipedia XML Dump

From	kevingloveruk@gmail.com
Date	2014-01-28 03:45 -0800
Subject	Wikipedia XML Dump
Message-ID	<9ec53bc0-f2da-46f4-ad58-2c9a75653dbf@googlegroups.com>

Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover

[toc] | [next] | [standalone]

#64914

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-28 09:11 -0800
Message-ID	<03d8894a-417c-4445-aeb5-f0b1003ca5eb@googlegroups.com>
In reply to	#64905

On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote:
> Hi

> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

> I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

> Is what I am trying to do actually feasible?

Cant really visualize what youve got...
When you 'download' wikipedia what do you get?
One 40GB file?
A zillion files?
Some other database format?

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

Also you may want to explore nltk

[toc] | [prev] | [next] | [standalone]

#64915

From	Skip Montanaro <skip@pobox.com>
Date	2014-01-28 12:15 -0600
Message-ID	<mailman.6077.1390932918.18130.python-list@python.org>
In reply to	#64914

> Another point:
> sax is painful to use compared to full lxml (dom)
> But then sax is the only choice when files cross a certain size
> Thats why the above question

No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.

Skip

[toc] | [prev] | [next] | [standalone]

#64923

From	Kevin Glover <kevingloveruk@gmail.com>
Date	2014-01-28 14:31 -0800
Message-ID	<7500190f-18a6-42b2-a77a-982672ce1644@googlegroups.com>
In reply to	#64905

Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

Kevin

[toc] | [prev] | [next] | [standalone]

#64924

From	Burak Arslan <burak.arslan@arskom.com.tr>
Date	2014-01-29 00:47 +0200
Message-ID	<mailman.6083.1390949255.18130.python-list@python.org>
In reply to	#64923

hi,

On 01/29/14 00:31, Kevin Glover wrote:
> Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?
>
>

in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.

hth,
burak

[toc] | [prev] | [next] | [standalone]

#64927

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-28 17:52 -0800
Message-ID	<d93ef7fe-3d74-400a-95f5-c7d781cb6ca1@googlegroups.com>
In reply to	#64924

On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
> hi,

> On 01/29/14 00:31, Kevin Glover wrote:
> > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

> in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
> be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk

[toc] | [prev] | [next] | [standalone]

#64926

From	alex23 <wuwei23@gmail.com>
Date	2014-01-29 11:39 +1000
Message-ID	<lc9m56$5nl$1@dont-email.me>
In reply to	#64905

On 28/01/2014 9:45 PM, kevingloveruk@gmail.com wrote:
> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".
>
> I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:
>
> Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search, 
you should get better performance using an XML database which will allow 
you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients

[toc] | [prev] | [standalone]

csiph-web

Wikipedia XML Dump

Contents

#64905 — Wikipedia XML Dump

#64914

#64915

#64923

#64924

#64927

#64926