Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64905 > unrolled thread
| Started by | kevingloveruk@gmail.com |
|---|---|
| First post | 2014-01-28 03:45 -0800 |
| Last post | 2014-01-29 11:39 +1000 |
| Articles | 7 — 6 participants |
Back to article view | Back to comp.lang.python
Wikipedia XML Dump kevingloveruk@gmail.com - 2014-01-28 03:45 -0800
Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 09:11 -0800
Re: Wikipedia XML Dump Skip Montanaro <skip@pobox.com> - 2014-01-28 12:15 -0600
Re: Wikipedia XML Dump Kevin Glover <kevingloveruk@gmail.com> - 2014-01-28 14:31 -0800
Re: Wikipedia XML Dump Burak Arslan <burak.arslan@arskom.com.tr> - 2014-01-29 00:47 +0200
Re: Wikipedia XML Dump Rustom Mody <rustompmody@gmail.com> - 2014-01-28 17:52 -0800
Re: Wikipedia XML Dump alex23 <wuwei23@gmail.com> - 2014-01-29 11:39 +1000
| From | kevingloveruk@gmail.com |
|---|---|
| Date | 2014-01-28 03:45 -0800 |
| Subject | Wikipedia XML Dump |
| Message-ID | <9ec53bc0-f2da-46f4-ad58-2c9a75653dbf@googlegroups.com> |
Hi I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's". I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask: Is what I am trying to do actually feasible? Are there any example programs or code snippets that would help me? Any advice or guidance would be gratefully received. Best regards, Kevin Glover
[toc] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-28 09:11 -0800 |
| Message-ID | <03d8894a-417c-4445-aeb5-f0b1003ca5eb@googlegroups.com> |
| In reply to | #64905 |
On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote: > Hi > I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's". > I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask: > Is what I am trying to do actually feasible? Cant really visualize what youve got... When you 'download' wikipedia what do you get? One 40GB file? A zillion files? Some other database format? Another point: sax is painful to use compared to full lxml (dom) But then sax is the only choice when files cross a certain size Thats why the above question Also you may want to explore nltk
[toc] | [prev] | [next] | [standalone]
| From | Skip Montanaro <skip@pobox.com> |
|---|---|
| Date | 2014-01-28 12:15 -0600 |
| Message-ID | <mailman.6077.1390932918.18130.python-list@python.org> |
| In reply to | #64914 |
> Another point: > sax is painful to use compared to full lxml (dom) > But then sax is the only choice when files cross a certain size > Thats why the above question No matter what the choice of XML parser, I suspect you'll want to convert it to some other form for processing. Skip
[toc] | [prev] | [next] | [standalone]
| From | Kevin Glover <kevingloveruk@gmail.com> |
|---|---|
| Date | 2014-01-28 14:31 -0800 |
| Message-ID | <7500190f-18a6-42b2-a77a-982672ce1644@googlegroups.com> |
| In reply to | #64905 |
Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? Kevin
[toc] | [prev] | [next] | [standalone]
| From | Burak Arslan <burak.arslan@arskom.com.tr> |
|---|---|
| Date | 2014-01-29 00:47 +0200 |
| Message-ID | <mailman.6083.1390949255.18130.python-list@python.org> |
| In reply to | #64923 |
hi, On 01/29/14 00:31, Kevin Glover wrote: > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? > > in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to be your only option. hth, burak
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-01-28 17:52 -0800 |
| Message-ID | <d93ef7fe-3d74-400a-95f5-c7d781cb6ca1@googlegroups.com> |
| In reply to | #64924 |
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote: > hi, > On 01/29/14 00:31, Kevin Glover wrote: > > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? > in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to > be your only option. Further thoughts?? Just a combo of what Burak and Skip said: I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml to something (more) digestible to nltk
[toc] | [prev] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2014-01-29 11:39 +1000 |
| Message-ID | <lc9m56$5nl$1@dont-email.me> |
| In reply to | #64905 |
On 28/01/2014 9:45 PM, kevingloveruk@gmail.com wrote: > I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's". > > I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask: > > Is what I am trying to do actually feasible? Rather than parsing through 40GB+ every time you need to do a search, you should get better performance using an XML database which will allow you to do queries directly on the xml data. http://basex.org/ is one such db, and comes with a Python API: http://docs.basex.org/wiki/Clients
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web