Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!news2.arglkargh.de!news.wiretrip.org!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.012 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; '(using': 0.05; 'memory.': 0.05; 'python:': 0.05; 'ram': 0.05; 'python': 0.08; '>>>>': 0.09; 'fetch': 0.09; 'spawn': 0.09; 'developer': 0.12; 'entries': 0.13; 'library': 0.15; '(note': 0.16; 'cores': 0.16; 'end).': 0.16; 'postgresql)': 0.16; 'rows': 0.16; 'subset': 0.16; 'xml)': 0.16; 'wrote:': 0.16; "wouldn't": 0.17; 'wed,': 0.17; '>>>': 0.18; 'memory': 0.21; 'trying': 0.21; 'connections': 0.21; 'file,': 0.21; 'header:In-Reply-To:1': 0.22; '(or': 0.23; '+0100,': 0.23; 'away.': 0.23; 'joining': 0.23; 'pm,': 0.24; 'there.': 0.24; 'aug': 0.24; 'xml': 0.25; 'string': 0.26; 'fact': 0.27; 'problem': 0.28; 'correct': 0.28; 'server': 0.29; 'oracle': 0.29; 'order.': 0.29; 'stefan': 0.29; '(and': 0.29; 'script': 0.29; '+0200,': 0.30; 'dom': 0.30; 'one)': 0.30; 'queue': 0.30; 'table.': 0.30; 'threads': 0.30; "skip:' 10": 0.30; "didn't": 0.31; 'chris': 0.32; 'list': 0.32; 'usually': 0.32; 'there': 0.33; 'to:addr:python- list': 0.33; 'however,': 0.34; 'header:User-Agent:1': 0.34; 'quite': 0.34; 'idea': 0.34; 'subject: ?': 0.34; 'rather': 0.35; 'running': 0.35; 'problem.': 0.36; 'file': 0.36; 'another': 0.37; 'disk': 0.37; 'fastest': 0.37; 'push': 0.37; 'streaming': 0.37; 'thread': 0.37; 'put': 0.37; 'several': 0.37; 'but': 0.37; 'something': 0.37; 'amounts': 0.38; 'size,': 0.38; 'subject:: ': 0.39; 'i.e.': 0.39; 'sets': 0.39; 'enough': 0.39; 'recommended': 0.39; 'data': 0.39; 'why': 0.39; 'to:addr:python.org': 0.39; 'sense': 0.39; "i'd": 0.40; 'more': 0.60; 'huge': 0.61; 'your': 0.61; 'target': 0.61; 'back': 0.62; 'from:no real name:2**0': 0.63; 'free': 0.63; 'friendly': 0.64; 'charset:iso-8859-2': 0.66; 'view': 0.67; 'thousands': 0.67; 'database,': 0.68; 'flow': 0.68; 'carefully': 0.68; 'header:Reply-To:1': 0.71; 'reply-to:no real name:2**0': 0.71; 'database.': 0.74; 'received:pl': 0.84; '12:17': 0.84; '256': 0.84; 'fly,': 0.84; 'slow.': 0.84; 'url:pl': 0.93 Date: Thu, 11 Aug 2011 08:40:30 +0200 From: przemolicc@poczta.fm To: python-list@python.org Subject: Re: String concatenation - which is the fastest way ? References: <20110810111754.GD5045@host.pgf.com.pl> <20110810133146.GE5045@host.pgf.com.pl> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) X-Interia-Antivirus: OK X-EMID: 9aeafc98 Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: przemolicc@poczta.fm List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 134 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1313044834 news.xs4all.nl 23908 [2001:888:2000:d::a6]:50516 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:11188 On Wed, Aug 10, 2011 at 06:20:10PM +0200, Stefan Behnel wrote: > przemolicc@poczta.fm, 10.08.2011 15:31: >> On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote: >>> On Wed, Aug 10, 2011 at 12:17 PM, wrote: >>>> I'd like to write a python (2.6/2.7) script which connects to database= , fetches >>>> hundreds of thousands of rows, concat them (basically: create XML) >>>> and then put the result into another table. Do I have any choice >>>> regarding string concatenation in Python from the performance point of= view ? >>>> Since the number of rows is big I'd like to use the fastest possible l= ibrary >>>> (if there is any choice). Can you recommend me something ? >>> >>> First off, I have no idea why you would want to create an XML dump of >>> hundreds of thousands of rows, only to store it in another table. >>> However, if that is your intention, list joining is about as efficient >>> as you're going to get in Python: >>> >>> lst=3D["asdf","qwer","zxcv"] # feel free to add 399,997 more list entri= es >>> xml=3D""+"".join(lst)+"" >>> >>> This sets xml to 'asdfqwerzxcv' which >>> may or may not be what you're after. >> >> since this process (XML building) is running now inside database (using = native SQL commands) >> and is one-thread task it is quite slow. What I wanted to do is to spawn= several python subprocesses in parallel which >> will concat subset of the whole table (and then merge all of them at the= end). >> Basically: >> - fetch all rows from the database (up to 1 million): what is recommende= d data type ? >> - spawn X python processes each one: >> - concat its own subset >> - merge the result from all the subprocesses >> >> This task is running on a server which has many but slow cores and I am = trying to divide this task >> into many subtasks. > > Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL) = =20 > come with built-in Python support. The data are in Oracle so I have to use cx_oracle. > You still didn't provide enough information to make me understand why you= =20 > need XML in between one database and another (or the same?), but if you=20 > go that route, you can just read data through multiple connections in=20 > multiple threads (or processes), have each build up one (or more) XML=20 > entries, and then push those into a queue. Then another thread (or more=20 > than one) can read from that queue and write the XML items into a file=20 > (or another database) as they come in. I am not a database developer so I don't want to change the whole process of data flow between applications in my company. Another process is reading this XML from particular Oracle table so I have to put the final XM= L there. > If your data has a considerable size, I wouldn't use string concatenation= =20 > or joining at all (note that it requires 2x the memory during =20 > concatenation), but rather write it into a file, or even just process the= =20 > data on the fly, i.e. write it back into the target table right away. =20 > Reading a file back in after the fact is much more resource friendly than= =20 > keeping huge amounts of data in memory. And disk speed is usually not a =20 > problem when streaming data from disk into a database. This server has 256 GB of RAM so memory is not a problem. Also the select which fetches the data is sorted. That is why I have to=20 carefully divide into subtasks and then merge it in correct order. Regards Przemyslaw Bak (przemol) ---------------------------------------------------------------- Dom marzen - kup lub wynajmij taniej niz myslisz! Szukaj >> http://linkint.pl/f2a0d