Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64905 > unrolled thread

Wikipedia XML Dump

Started bykevingloveruk@gmail.com
First post2014-01-28 03:45 -0800
Last post2014-01-29 11:39 +1000
Articles 7 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  Wikipedia XML Dump kevingloveruk@gmail.com - 2014-01-28 03:45 -0800
    Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 09:11 -0800
      Re: Wikipedia XML Dump Skip Montanaro <skip@pobox.com> - 2014-01-28 12:15 -0600
    Re: Wikipedia XML Dump Kevin Glover <kevingloveruk@gmail.com> - 2014-01-28 14:31 -0800
      Re: Wikipedia XML Dump Burak Arslan <burak.arslan@arskom.com.tr> - 2014-01-29 00:47 +0200
        Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 17:52 -0800
    Re: Wikipedia XML Dump alex23 <wuwei23@gmail.com> - 2014-01-29 11:39 +1000

#64905 — Wikipedia XML Dump

Fromkevingloveruk@gmail.com
Date2014-01-28 03:45 -0800
SubjectWikipedia XML Dump
Message-ID<9ec53bc0-f2da-46f4-ad58-2c9a75653dbf@googlegroups.com>
Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover

[toc] | [next] | [standalone]


#64914

FromRustom Mody <rustompmody@gmail.com>
Date2014-01-28 09:11 -0800
Message-ID<03d8894a-417c-4445-aeb5-f0b1003ca5eb@googlegroups.com>
In reply to#64905
On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote:
> Hi

> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

> I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

> Is what I am trying to do actually feasible?

Cant really visualize what youve got...
When you 'download' wikipedia what do you get?
One 40GB file?
A zillion files?
Some other database format?

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

Also you may want to explore nltk

[toc] | [prev] | [next] | [standalone]


#64915

FromSkip Montanaro <skip@pobox.com>
Date2014-01-28 12:15 -0600
Message-ID<mailman.6077.1390932918.18130.python-list@python.org>
In reply to#64914
> Another point:
> sax is painful to use compared to full lxml (dom)
> But then sax is the only choice when files cross a certain size
> Thats why the above question

No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.

Skip

[toc] | [prev] | [next] | [standalone]


#64923

FromKevin Glover <kevingloveruk@gmail.com>
Date2014-01-28 14:31 -0800
Message-ID<7500190f-18a6-42b2-a77a-982672ce1644@googlegroups.com>
In reply to#64905
Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

Kevin

[toc] | [prev] | [next] | [standalone]


#64924

FromBurak Arslan <burak.arslan@arskom.com.tr>
Date2014-01-29 00:47 +0200
Message-ID<mailman.6083.1390949255.18130.python-list@python.org>
In reply to#64923
hi,

On 01/29/14 00:31, Kevin Glover wrote:
> Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?
>
>

in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.

hth,
burak

[toc] | [prev] | [next] | [standalone]


#64927

FromRustom Mody <rustompmody@gmail.com>
Date2014-01-28 17:52 -0800
Message-ID<d93ef7fe-3d74-400a-95f5-c7d781cb6ca1@googlegroups.com>
In reply to#64924
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
> hi,

> On 01/29/14 00:31, Kevin Glover wrote:
> > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

> in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
> be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk

[toc] | [prev] | [next] | [standalone]


#64926

Fromalex23 <wuwei23@gmail.com>
Date2014-01-29 11:39 +1000
Message-ID<lc9m56$5nl$1@dont-email.me>
In reply to#64905
On 28/01/2014 9:45 PM, kevingloveruk@gmail.com wrote:
> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".
>
> I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:
>
> Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search, 
you should get better performance using an XML database which will allow 
you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web