Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!news2.arglkargh.de!news.wiretrip.org!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Thu, 11 Aug 2011 08:40:30 +0200
From: przemolicc@poczta.fm
To: python-list@python.org
Subject: Re: String concatenation - which is the fastest way ?
References: <20110810111754.GD5045@host.pgf.com.pl> <CAPTjJmrF0GcVs0onfoCHAbMe38b5iLXgFX1R7G_RXcKxjPH5wQ@mail.gmail.com> <20110810133146.GE5045@host.pgf.com.pl> <j1ub3q$65c$1@dough.gmane.org>
MIME-Version: 1.0
In-Reply-To: <j1ub3q$65c$1@dough.gmane.org>
User-Agent: Mutt/1.5.18 (2008-05-17)
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Precedence: list
Reply-To: przemolicc@poczta.fm
Newsgroups: comp.lang.python
Message-ID: <mailman.2152.1313044834.1164.python-list@python.org>
Lines: 134
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:11188

On Wed, Aug 10, 2011 at 06:20:10PM +0200, Stefan Behnel wrote:
> przemolicc@poczta.fm, 10.08.2011 15:31:
>> On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote:
>>> On Wed, Aug 10, 2011 at 12:17 PM,<przemolicc@poczta.fm>  wrote:
>>>> I'd like to write a python (2.6/2.7) script which connects to database=
, fetches
>>>> hundreds of thousands of rows, concat them (basically: create XML)
>>>> and then put the result into another table. Do I have any choice
>>>> regarding string concatenation in Python from the performance point of=
 view ?
>>>> Since the number of rows is big I'd like to use the fastest possible l=
ibrary
>>>> (if there is any choice). Can you recommend me something ?
>>>
>>> First off, I have no idea why you would want to create an XML dump of
>>> hundreds of thousands of rows, only to store it in another table.
>>> However, if that is your intention, list joining is about as efficient
>>> as you're going to get in Python:
>>>
>>> lst=3D["asdf","qwer","zxcv"] # feel free to add 399,997 more list entri=
es
>>> xml=3D"<foo>"+"</foo><foo>".join(lst)+"</foo>"
>>>
>>> This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which
>>> may or may not be what you're after.
>>
>> since this process (XML building) is running now inside database (using =
native SQL commands)
>> and is one-thread task it is quite slow. What I wanted to do is to spawn=
 several python subprocesses in parallel which
>> will concat subset of the whole table (and then merge all of them at the=
 end).
>> Basically:
>> - fetch all rows from the database (up to 1 million): what is recommende=
d data type ?
>> - spawn X python processes each one:
>>      - concat its own subset
>> - merge the result from all the subprocesses
>>
>> This task is running on a server which has many but slow cores and I am =
trying to divide this task
>> into many subtasks.
>
> Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL) =
=20
> come with built-in Python support.

The data are in Oracle so I have to use cx_oracle.

> You still didn't provide enough information to make me understand why you=
=20
> need XML in between one database and another (or the same?), but if you=20
> go that route, you can just read data through multiple connections in=20
> multiple threads (or processes), have each build up one (or more) XML=20
> entries, and then push those into a queue. Then another thread (or more=20
> than one) can read from that queue and write the XML items into a file=20
> (or another database) as they come in.

I am not a database developer so I don't want to change the whole process
of data flow between applications in my company. Another process is
reading this XML from particular Oracle table so I have to put the final XM=
L there.

> If your data has a considerable size, I wouldn't use string concatenation=
=20
> or joining at all (note that it requires 2x the memory during =20
> concatenation), but rather write it into a file, or even just process the=
=20
> data on the fly, i.e. write it back into the target table right away. =20
> Reading a file back in after the fact is much more resource friendly than=
=20
> keeping huge amounts of data in memory. And disk speed is usually not a =20
> problem when streaming data from disk into a database.

This server has 256 GB of RAM so memory is not a problem.
Also the select which fetches the data is sorted. That is why I have to=20
carefully divide into subtasks and then merge it in correct order.

Regards
Przemyslaw Bak (przemol)



















































----------------------------------------------------------------
Dom marzen - kup lub wynajmij taniej niz myslisz!
Szukaj >> http://linkint.pl/f2a0d