Groups > comp.lang.python > #42289 > unrolled thread

Re: How to find bad row with db api executemany()?

Started by	Dennis Lee Bieber <wlfraed@ix.netcom.com>
First post	2013-03-29 18:25 -0400
Last post	2013-03-30 06:38 -0500
Articles	20 on this page of 24 — 5 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 18:25 -0400
    Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 20:41 -0400
      Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 11:57 +1100
        Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 21:19 -0400
          Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 13:05 +1100
          Re: How to find bad row with db api executemany()? Tim Chase <python.list@tim.thechases.com> - 2013-03-29 22:17 -0500
          Re: How to find bad row with db api executemany()? (PS) Tim Chase <python.list@tim.thechases.com> - 2013-03-29 22:38 -0500
          Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 23:38 -0400
      Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 22:44 -0400
        Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 13:49 +1100
          Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:09 -0400
            Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 14:14 +1100
              Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:36 -0400
                Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 14:57 +1100
                  Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-30 00:10 -0400
                    Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 15:21 +1100
                      Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-30 10:19 -0400
          Re: How to find bad row with db api executemany()? rusi <rustompmody@gmail.com> - 2013-03-29 20:13 -0700
            Re: How to find bad row with db api executemany()? rusi <rustompmody@gmail.com> - 2013-03-29 20:15 -0700
            Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:40 -0400
        Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 23:53 -0400
        Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-30 00:19 -0400
        Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 15:24 +1100
        Re: How to find bad row with db api executemany()? Tim Chase <python.list@tim.thechases.com> - 2013-03-30 06:38 -0500

Page 1 of 2 [1] 2 Next page →

#42289 — Re: How to find bad row with db api executemany()?

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-03-29 18:25 -0400
Subject	Re: How to find bad row with db api executemany()?
Message-ID	<mailman.3971.1364595940.2939.python-list@python.org>

On Fri, 29 Mar 2013 14:53:30 -0400, Dave Angel <davea@davea.name>
declaimed the following in gmane.comp.python.general:

> On 03/29/2013 10:48 AM, Roy Smith wrote:
> > I'm inserting a gazillion rows into a MySQL database using MySQLdb and cursor.executemany() for efficiency.  Every once in a while, I get a row which violates some kind of database constraint and raises Error.
> >
> > I can catch the exception, but don't see any way to tell which row caused the problem.  Is this information obtainable, short of retrying each row one by one?
> >
> 
> I don't know the direct answer, or even if there is one (way to get 
> MySQL to tell you which one failed), but ...
> 
> Assuming that executeMany is much cheaper than a million calls to 
> executeOne (or whatever).

	If using MySQLdb, there isn't all that much difference... MySQLdb is
still compatible with MySQL v4 (and maybe even v3), and since those
versions don't have "prepared statements", .executemany() essentially
turns into something that creates a newline delimited "list" of
"identical" (but for argument substitution) statements and submits that
to MySQL.

	Just look at the source code (cursors.py in MySQLdb package).

	Hmmm... That gives a rally point. If it IS submitting one massive
string containing all the data copies, could the failure be coming from
submitting something too big for the client/server communication
channel?
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [next] | [standalone]

#42295

From	Roy Smith <roy@panix.com>
Date	2013-03-29 20:41 -0400
Message-ID	<roy-A61512.20410329032013@news.panix.com>
In reply to	#42289

In article <mailman.3971.1364595940.2939.python-list@python.org>,
 Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:

> If using MySQLdb, there isn't all that much difference... MySQLdb is
> still compatible with MySQL v4 (and maybe even v3), and since those
> versions don't have "prepared statements", .executemany() essentially
> turns into something that creates a newline delimited "list" of
> "identical" (but for argument substitution) statements and submits that
> to MySQL.

Shockingly, that does appear to be the case.  I had thought during my 
initial testing that I was seeing far greater throughput, but as I got 
more into the project and started doing some side-by-side comparisons, 
it the differences went away.

We're sucking in a pretty huge amount of data.  The source document is a 
7 GB gzipped XML file.  I'm not sure how big it is uncompressed (we use 
gzip.GzipFile to uncompress on the fly) but I'm guessing something like 
a 30x compression ratio so 200 GB?  The last time we imported the whole 
set, it ran for 21 days!

It turns out, the problems we were seeing were all inserts into a new 
table we added.   Apparently, the default charset is latin-1 and we 
didn't notice that when we altered the schema!  Once I noticed that all 
the other tables were utf-8 and changed this one to be that, the 
problems went away.

Sadly, I ended up resorting to a truly ugly hack to diagnose the 
problem.  I catch the exception and parse the text message.  Yuck.

    try:
        self.executemany(self.sql_statement, self.sql_params)
    except MySQLdb.Error as ex:
        code, message = ex.args
        m = re.search(r".* at row (\d+)$", message)
        if m:
            i = int(m.group(1)) - 1  # Python is 0-index, SQL, 1-index

The other truly horrible part of the project was when I decided it was 
bad for my import script to have too much schema knowledge hard-wired 
in.  So, I decided to use SQLAlchemy to introspect the database and 
discover the column names, types, and defaults.  It turns out, if an 
integer column has a default (say, 0), the introspected data comes back 
with the default as the string, '0'.  WTF???

Does Postgress's Python adapter handle executemany() in a sane way?  
We're not wedded to MySQL in any way.  We use it for exactly this one 
process.  We get these XML dumps from a supplier's SQL-Server database.  
We stage the data in MySQL, then export what we need into MongoDB.  We 
could easily swap out the MySQL staging for Postgress if that worked 
better.

Hmmm, we do take advantage of REPLACE INTO, which I think is a 
non-standard MySQL addition.  Not sure if Postgress supports that.

[toc] | [prev] | [next] | [standalone]

#42296

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 11:57 +1100
Message-ID	<mailman.3977.1364605026.2939.python-list@python.org>
In reply to	#42295

On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3971.1364595940.2939.python-list@python.org>,
>  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
>> If using MySQLdb, there isn't all that much difference... MySQLdb is
>> still compatible with MySQL v4 (and maybe even v3), and since those
>> versions don't have "prepared statements", .executemany() essentially
>> turns into something that creates a newline delimited "list" of
>> "identical" (but for argument substitution) statements and submits that
>> to MySQL.
>
> Shockingly, that does appear to be the case.  I had thought during my
> initial testing that I was seeing far greater throughput, but as I got
> more into the project and started doing some side-by-side comparisons,
> it the differences went away.

How much are you doing per transaction? The two extremes (everything
in one transaction, or each line in its own transaction) are probably
the worst for performance. See what happens if you pepper the code
with 'begin' and 'commit' statements (maybe every thousand or ten
thousand rows) to see if performance improves.

ChrisA

[toc] | [prev] | [next] | [standalone]

#42299

From	Roy Smith <roy@panix.com>
Date	2013-03-29 21:19 -0400
Message-ID	<roy-AE1A29.21192229032013@news.panix.com>
In reply to	#42296

In article <mailman.3977.1364605026.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote:
> > In article <mailman.3971.1364595940.2939.python-list@python.org>,
> >  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
> >
> >> If using MySQLdb, there isn't all that much difference... MySQLdb is
> >> still compatible with MySQL v4 (and maybe even v3), and since those
> >> versions don't have "prepared statements", .executemany() essentially
> >> turns into something that creates a newline delimited "list" of
> >> "identical" (but for argument substitution) statements and submits that
> >> to MySQL.
> >
> > Shockingly, that does appear to be the case.  I had thought during my
> > initial testing that I was seeing far greater throughput, but as I got
> > more into the project and started doing some side-by-side comparisons,
> > it the differences went away.
> 
> How much are you doing per transaction? The two extremes (everything
> in one transaction, or each line in its own transaction) are probably
> the worst for performance. See what happens if you pepper the code
> with 'begin' and 'commit' statements (maybe every thousand or ten
> thousand rows) to see if performance improves.
> 
> ChrisA

We're doing it all in one transaction, on purpose.  We start with an 
initial dump, then get updates about once a day.  We want to make sure 
that the updates either complete without errors, or back out cleanly.  
If we ever had a partial daily update, the result would be a mess.

Hmmm, on the other hand, I could probably try doing the initial dump the 
way you describe.  If it fails, we can just delete the whole thing and 
start again.

[toc] | [prev] | [next] | [standalone]

#42300

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 13:05 +1100
Message-ID	<mailman.3979.1364609118.2939.python-list@python.org>
In reply to	#42299

On Sat, Mar 30, 2013 at 12:19 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3977.1364605026.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote:
>> > In article <mailman.3971.1364595940.2939.python-list@python.org>,
>> >  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>> >
>> >> If using MySQLdb, there isn't all that much difference... MySQLdb is
>> >> still compatible with MySQL v4 (and maybe even v3), and since those
>> >> versions don't have "prepared statements", .executemany() essentially
>> >> turns into something that creates a newline delimited "list" of
>> >> "identical" (but for argument substitution) statements and submits that
>> >> to MySQL.
>> >
>> > Shockingly, that does appear to be the case.  I had thought during my
>> > initial testing that I was seeing far greater throughput, but as I got
>> > more into the project and started doing some side-by-side comparisons,
>> > it the differences went away.
>>
>> How much are you doing per transaction? The two extremes (everything
>> in one transaction, or each line in its own transaction) are probably
>> the worst for performance. See what happens if you pepper the code
>> with 'begin' and 'commit' statements (maybe every thousand or ten
>> thousand rows) to see if performance improves.
>>
>> ChrisA
>
> We're doing it all in one transaction, on purpose.  We start with an
> initial dump, then get updates about once a day.  We want to make sure
> that the updates either complete without errors, or back out cleanly.
> If we ever had a partial daily update, the result would be a mess.
>
> Hmmm, on the other hand, I could probably try doing the initial dump the
> way you describe.  If it fails, we can just delete the whole thing and
> start again.

One transaction for the lot isn't nearly as bad as one transaction per
row, but it can consume a lot of memory on the server - or at least,
that's what I found last time I worked with MySQL. (PostgreSQL works
completely differently, and I'd strongly recommend doing it all as one
transaction if you switch.) It's not guaranteed to help, but if it
won't hurt to try, there's a chance you'll gain some performance.

ChrisA

[toc] | [prev] | [next] | [standalone]

#42308

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-03-29 22:17 -0500
Message-ID	<mailman.3982.1364613376.2939.python-list@python.org>
In reply to	#42299

On 2013-03-29 21:19, Roy Smith wrote:
> We're doing it all in one transaction, on purpose.  We start with
> an initial dump, then get updates about once a day.  We want to
> make sure that the updates either complete without errors, or back
> out cleanly. If we ever had a partial daily update, the result
> would be a mess.

Having had to do some similarly-sized bulk data loads (in my case,
MSSqlServer at $JOB) couple other ideas occur to me:

1) I believe MySQL has a side-loading function (I'd have to go
digging for it; a quick google suggests a "LOAD DATA INFILE"
statement[1]) that allows you to load data from an external file such
as an XML or CSV file

2) Load into a temp table in testable batches, then do some sort of
batch insert into your main table.  Again, a quick google suggest the
"INSERT ... SELECT" syntax[2]

-tkc

[1]
http://dev.mysql.com/doc/refman/5.1/en/load-data.html

[2]
http://dev.mysql.com/doc/refman/5.0/en/insert-select.html

[toc] | [prev] | [next] | [standalone]

#42309 — Re: How to find bad row with db api executemany()? (PS)

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-03-29 22:38 -0500
Subject	Re: How to find bad row with db api executemany()? (PS)
Message-ID	<mailman.3983.1364614588.2939.python-list@python.org>
In reply to	#42299

On 2013-03-29 22:17, Tim Chase wrote:
> 2) Load into a temp table in testable batches, then do some sort of
> batch insert into your main table.  Again, a quick google suggest
> the "INSERT ... SELECT" syntax[2]

It looks like there's a corresponding "REPLACE INTO ... SELECT"
syntax[1], as you mention doing a REPLACE INTO rather than a straight
INSERT

-tkc

[1]
http://dev.mysql.com/doc/refman/5.0/en/replace.html

[toc] | [prev] | [next] | [standalone]

#42311

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-03-29 23:38 -0400
Message-ID	<mailman.3984.1364614714.2939.python-list@python.org>
In reply to	#42299

On Fri, 29 Mar 2013 21:19:22 -0400, Roy Smith <roy@panix.com> declaimed
the following in gmane.comp.python.general:

> In article <mailman.3977.1364605026.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> 
> > 
> > How much are you doing per transaction? The two extremes (everything
> > in one transaction, or each line in its own transaction) are probably
> > the worst for performance. See what happens if you pepper the code
> > with 'begin' and 'commit' statements (maybe every thousand or ten
> > thousand rows) to see if performance improves.
> > 
> > ChrisA
> 
> We're doing it all in one transaction, on purpose.  We start with an 
> initial dump, then get updates about once a day.  We want to make sure 
> that the updates either complete without errors, or back out cleanly.  
> If we ever had a partial daily update, the result would be a mess.
>
	As I recall, DB-API compliance should have started a transaction
when you submit a statement, so explicit "begin" should not be required.

	My suggestion would be to first try just breaking the .executemany()
into chunks rather than everything at once. Submit maybe 500-1000
records at a time via .executemany(), checking for errors -- if a block
errors you can still rollback() the entire transaction AND you have a
smaller set of data to examine. Then at the end, if no errors, do the
commit()
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#42301

From	Roy Smith <roy@panix.com>
Date	2013-03-29 22:44 -0400
Message-ID	<roy-1FF146.22445329032013@news.panix.com>
In reply to	#42295

In article <roy-A61512.20410329032013@news.panix.com>,
 Roy Smith <roy@panix.com> wrote:

> In article <mailman.3971.1364595940.2939.python-list@python.org>,
>  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
> 
> > If using MySQLdb, there isn't all that much difference... MySQLdb is
> > still compatible with MySQL v4 (and maybe even v3), and since those
> > versions don't have "prepared statements", .executemany() essentially
> > turns into something that creates a newline delimited "list" of
> > "identical" (but for argument substitution) statements and submits that
> > to MySQL.
> 
> Shockingly, that does appear to be the case.  I had thought during my 
> initial testing that I was seeing far greater throughput, but as I got 
> more into the project and started doing some side-by-side comparisons, 
> it the differences went away.

OMG, this is amazing.

http://stackoverflow.com/questions/3945642/

It turns out, the MySQLdb executemany() runs a regex over your SQL and 
picks one of two algorithms depending on whether it matches or not.

restr = (r"\svalues\s*"
        r"(\(((?<!\\)'[^\)]*?\)[^\)]*(?<!\\)?'"
        r"|[^\(\)]|"
        r"(?:\([^\)]*\))"
        r")+\))")

Leaving aside the obvious line-noise aspects, the operative problem here 
is that it only looks for "values" (in lower case).

I've lost my initial test script which convinced me that executemany() 
would be a win; I'm assuming I used lower case for that.  Our production 
code uses "VALUES".

The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4 
seconds.  When I switch to "values", I'm getting more like 1000 rows in 
100 ms!

A truly breathtaking bug.

[toc] | [prev] | [next] | [standalone]

#42302

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 13:49 +1100
Message-ID	<mailman.3980.1364611804.2939.python-list@python.org>
In reply to	#42301

On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <roy@panix.com> wrote:
> The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
> seconds.  When I switch to "values", I'm getting more like 1000 rows in
> 100 ms!
>
> A truly breathtaking bug.

*facepalm*

Doubly facepalm because a regex could easily have tested for mixed case.

Especially facepalm because there's some way to do this that's faster
than straight INSERT statements, and it's not clearly documented as
"hey, guys, if you want to dump loads of data in, use COPY instead"
(it might be that, I don't know, but usually COPY isn't directly
transliterable with INSERT).

I agree. Breathtaking.

ChrisA

[toc] | [prev] | [next] | [standalone]

#42304

From	Roy Smith <roy@panix.com>
Date	2013-03-29 23:09 -0400
Message-ID	<roy-8D4A5A.23093729032013@news.panix.com>
In reply to	#42302

In article <mailman.3980.1364611804.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> Especially facepalm because there's some way to do this that's faster
> than straight INSERT statements, and it's not clearly documented as
> "hey, guys, if you want to dump loads of data in, use COPY instead"
> (it might be that, I don't know, but usually COPY isn't directly
> transliterable with INSERT).

We're actually using REPLACE INTO.  For the initial data load, we could 
just as well do INSERT, but we need the REPLACE functionality as we roll 
in the daily incremental updates.

This also explains why, even after provisioning our RDS instance for 
2000 IOPS (that's AWS-speak for "we paid extra to get more disk 
bandwidth"), we didn't see any performance improvement!

[toc] | [prev] | [next] | [standalone]

#42306

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 14:14 +1100
Message-ID	<mailman.3981.1364613280.2939.python-list@python.org>
In reply to	#42304

On Sat, Mar 30, 2013 at 2:09 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3980.1364611804.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> Especially facepalm because there's some way to do this that's faster
>> than straight INSERT statements, and it's not clearly documented as
>> "hey, guys, if you want to dump loads of data in, use COPY instead"
>> (it might be that, I don't know, but usually COPY isn't directly
>> transliterable with INSERT).
>
> We're actually using REPLACE INTO.  For the initial data load, we could
> just as well do INSERT, but we need the REPLACE functionality as we roll
> in the daily incremental updates.
>
> This also explains why, even after provisioning our RDS instance for
> 2000 IOPS (that's AWS-speak for "we paid extra to get more disk
> bandwidth"), we didn't see any performance improvement!

Hmm. I heard around the forums that Amazon weren't that great at disk
bandwidth anyway, and that provisioning IO was often a waste of money.
But we never did all that much much research on Amazon I/O
performance; shortly after doing some basic benchmarking, we decided
that the cloud was a poor fit for our system model, and went looking
at dedicated servers with their own RAID storage right there on the
bus.

ChrisA

[toc] | [prev] | [next] | [standalone]

#42310

From	Roy Smith <roy@panix.com>
Date	2013-03-29 23:36 -0400
Message-ID	<roy-211B1A.23365429032013@news.panix.com>
In reply to	#42306

In article <mailman.3981.1364613280.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> Hmm. I heard around the forums that Amazon weren't that great at disk
> bandwidth anyway, and that provisioning IO was often a waste of money.

Au, contraire.  I guess it all depends on what you're doing.  If you're 
CPU bound, increasing your I/O bandwidth won't help.  But, at least on 
our database (MongoDB) servers, we saw a huge performance boost when we 
started going for provisioned IO.

> But we never did all that much much research on Amazon I/O
> performance; shortly after doing some basic benchmarking, we decided
> that the cloud was a poor fit for our system model, and went looking
> at dedicated servers with their own RAID storage right there on the
> bus.

As far as I can tell, from a raw price/performance basis, they're pretty 
expensive.  But, from a convenience standpoint, it's hard to beat.

Case in point: We've been thinking about SSD as our next performance 
step-up.  One day, we just spun up some big honking machine, configured 
it with 2 TB of SSD, and played around for a while.  Wicked fast.  Then 
we shut it down.  That experiment probably cost us $10 or so, and we 
were able to run it on the spur of the moment.

Another example was last summer when we had a huge traffic spike because 
of a new product release.  Caught us by surprise how much new traffic it 
would generate.  Our site was in total meltdown.  We were able to spin 
up 10 new servers in an afternoon.  If we had to go out and buy 
hardware, have it shipped to us, figure out where we had rack space, 
power, network capacity, cooling, etc, we'd have been out of business 
before we got back on the air.

Yet another example.  We just (as in, while I've been typing this) had 
one of our servers go down.  Looks like the underlying hardware the VM 
was running on croaked, because when the instance came back up, it had a 
new IP address.  The whole event was over in a couple of minutes, with 
only minor disruption to the service.  And, presumably, there's some 
piece of hardware somewhere in Virginia that needs repairing, but that's 
not our problem.

The really big boys (Google, Facebook) run their own data centers.  But, 
some surprisingly large operations run out of AWS.  Netflix, for 
example.  The convenience and flexibility is worth a lot.

[toc] | [prev] | [next] | [standalone]

#42314

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 14:57 +1100
Message-ID	<mailman.3986.1364615879.2939.python-list@python.org>
In reply to	#42310

On Sat, Mar 30, 2013 at 2:36 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3981.1364613280.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> Hmm. I heard around the forums that Amazon weren't that great at disk
>> bandwidth anyway, and that provisioning IO was often a waste of money.
>
> Au, contraire.  I guess it all depends on what you're doing.  If you're
> CPU bound, increasing your I/O bandwidth won't help.  But, at least on
> our database (MongoDB) servers, we saw a huge performance boost when we
> started going for provisioned IO.
>
> As far as I can tell, from a raw price/performance basis, they're pretty
> expensive.  But, from a convenience standpoint, it's hard to beat.

Yeah, I'm not saying you won't see performance go up - just that it's
going to be expensive compared to what you could do with a dedicated
server. The flexibility costs.

> Case in point: We've been thinking about SSD as our next performance
> step-up.  One day, we just spun up some big honking machine, configured
> it with 2 TB of SSD, and played around for a while.  Wicked fast.  Then
> we shut it down.  That experiment probably cost us $10 or so, and we
> were able to run it on the spur of the moment.

That is one thing the cloud is *awesome* for. "Hmm, I wonder......"
*half an hour later* "Now I know." *bill comes in* "That was cheap."

> Another example was last summer when we had a huge traffic spike because
> of a new product release.  Caught us by surprise how much new traffic it
> would generate.  Our site was in total meltdown.  We were able to spin
> up 10 new servers in an afternoon.  If we had to go out and buy
> hardware, have it shipped to us, figure out where we had rack space,
> power, network capacity, cooling, etc, we'd have been out of business
> before we got back on the air.

Yep. We're looking rather at server rental, from big honking data
centers (by the way, I never managed to figure this out - at what size
do things begin to honk? Larger than a bread box?), where hopefully
they would be able to deploy us more servers within a matter of hours.
Not as good as cloud (where you can have more servers in minutes), but
you described "in an afternoon" as your success story, so I'm guessing
most of the delay was an administrative one - the decision to actually
go ahead and spin those servers up. Unless you're automating the whole
thing (which is possible with the Amazon cloud, but not every client
will do it), that's always going to cost you.

> Yet another example.  We just (as in, while I've been typing this) had
> one of our servers go down.  Looks like the underlying hardware the VM
> was running on croaked, because when the instance came back up, it had a
> new IP address.  The whole event was over in a couple of minutes, with
> only minor disruption to the service.  And, presumably, there's some
> piece of hardware somewhere in Virginia that needs repairing, but that's
> not our problem.

Yeah, that's also a concern. And that works beautifully as long as
you're okay with that. It'll also happen more often with AWS than with
dedicated hardware, because there are more components to fail. That's
part of what wouldn't have fitted our server layout; we have a layer
on top of all that to manage monitoring (and, incidentally, we set up
a dozen home-grade laptops as a "server cluster" to test all those
systems, and knew within minutes of one of the laptops deciding that
it had crunched its last bits), and we really want stable IP addresses
for DNS and such. Cloud isn't bad, it was just a bad fit for us.

Side point: You mentioned SSDs. Are you aware of the fundamental risks
associated with them? Only a handful of SSD models are actually
trustworthy for databasing. Sure, they're fast, but can you afford
data corruption in the event of a power outage? Most SSDs will
cheerfully lie about fsync() and thus violate transactional integrity
(ACID compliance etc).

ChrisA

[toc] | [prev] | [next] | [standalone]

#42315

From	Roy Smith <roy@panix.com>
Date	2013-03-30 00:10 -0400
Message-ID	<roy-0B9881.00100930032013@news.panix.com>
In reply to	#42314

In article <mailman.3986.1364615879.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> Side point: You mentioned SSDs. Are you aware of the fundamental risks
> associated with them? Only a handful of SSD models are actually
> trustworthy for databasing.

We haven't decided if we're going that route yet, but if we do, we will 
probably use do RAID SSD for added reliability.  We also have all our 
database servers in failover clusters, so we get added reliability that 
way too.

But, we have some runway left with more conventional technologies, so we 
don't need to decide for a while.  Ultimately, however, as reliability 
goes up and cost comes down, it's hard to imagine the SSD isn't going to 
be a big part of our lives at some point.

[toc] | [prev] | [next] | [standalone]

#42317

From	Chris Angelico <rosuav@gmail.com>
Date	2013-03-30 15:21 +1100
Message-ID	<mailman.3988.1364617294.2939.python-list@python.org>
In reply to	#42315

On Sat, Mar 30, 2013 at 3:10 PM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3986.1364615879.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> Side point: You mentioned SSDs. Are you aware of the fundamental risks
>> associated with them? Only a handful of SSD models are actually
>> trustworthy for databasing.
>
> We haven't decided if we're going that route yet, but if we do, we will
> probably use do RAID SSD for added reliability.  We also have all our
> database servers in failover clusters, so we get added reliability that
> way too.

But will you know if you have corruption? Normally, transactional
integrity means:

1) If a transaction, from begin to commit, is not completely applied,
then it is completely not-applied; and
2) If a transaction that was affected by this one has been applied,
then so has this one.

SSDs that lie about fsync (and some hard disks lie too, as do some
operating systems and some file system drivers, but - under Linux at
least - it's possible to guarantee the OS and FS parts) can violate
both halves. Anything might have been written, anything might have
been missed. I did some tests with PostgreSQL on an SSD, and the
results were seriously scary.

> But, we have some runway left with more conventional technologies, so we
> don't need to decide for a while.  Ultimately, however, as reliability
> goes up and cost comes down, it's hard to imagine the SSD isn't going to
> be a big part of our lives at some point.

Yes. I hope that by then, the manufacturers will realize that TPS
isn't the only thing that matters. I'm sure SSDs will mature to the
point where we can trust all brands equally (or at least, most brands
- maybe it'll be like "server SSDs" and "desktop SSDs"?), but until
then, there aren't many options.

ChrisA

[toc] | [prev] | [next] | [standalone]

#42336

From	Roy Smith <roy@panix.com>
Date	2013-03-30 10:19 -0400
Message-ID	<roy-8B23B4.10192430032013@news.panix.com>
In reply to	#42317

In article <mailman.3988.1364617294.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> SSDs that lie about fsync (and some hard disks lie too, as do some
> operating systems and some file system drivers, but - under Linux at
> least - it's possible to guarantee the OS and FS parts) can violate
> both halves. Anything might have been written, anything might have
> been missed. I did some tests with PostgreSQL on an SSD, and the
> results were seriously scary.

What you say is true, but then again, it's true of other things too.  
We're running in a completely virtual environment.  Our "real" disk 
storage is EBS.  Somewhere out on the network there's something which 
pretends to be a disk.  Does "written" really mean "the bits have been 
committed to a piece of physical spinning iron oxide"?  I have no clue.

But, this conversation is getting very far away from Python.

[toc] | [prev] | [next] | [standalone]

#42305

From	rusi <rustompmody@gmail.com>
Date	2013-03-29 20:13 -0700
Message-ID	<a6438dda-cc5f-4898-98c8-a0e0d754418f@f5g2000pbs.googlegroups.com>
In reply to	#42302

On Mar 30, 7:49 am, Chris Angelico <ros...@gmail.com> wrote:
> On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <r...@panix.com> wrote:
> > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
> > seconds.  When I switch to "values", I'm getting more like 1000 rows in
> > 100 ms!
>
> > A truly breathtaking bug.
>
> *facepalm*
>
> Doubly facepalm because a regex could easily have tested for mixed case.
>
> Especially facepalm because there's some way to do this that's faster
> than straight INSERT statements, and it's not clearly documented as
> "hey, guys, if you want to dump loads of data in, use COPY instead"
> (it might be that, I don't know, but usually COPY isn't directly
> transliterable with INSERT).
>
> I agree. Breathtaking.
>
> ChrisA

I recently heard this:
A phone company needed to send out bulk-smses to its customers. It was
of the order of millions.
A (noob?) python programmer was assigned the task and used django with
whatever is the django orm.
It took of the order of weeks to send out the smses.
A python expert was called in.  He threw out the python and redid it
in SQL.
It was done in minutes.

[toc] | [prev] | [next] | [standalone]

#42307

From	rusi <rustompmody@gmail.com>
Date	2013-03-29 20:15 -0700
Message-ID	<1568eb91-0f1f-44f8-b3ea-17747026694a@u5g2000pbs.googlegroups.com>
In reply to	#42305

On Mar 30, 8:13 am, rusi <rustompm...@gmail.com> wrote:

> It took of the order of weeks to send out the smses.

'Week' I think is more accurate.

[toc] | [prev] | [next] | [standalone]

#42312

From	Roy Smith <roy@panix.com>
Date	2013-03-29 23:40 -0400
Message-ID	<roy-56955C.23403629032013@news.panix.com>
In reply to	#42305

In article 
<a6438dda-cc5f-4898-98c8-a0e0d754418f@f5g2000pbs.googlegroups.com>,
 rusi <rustompmody@gmail.com> wrote:

> I recently heard this:
> A phone company needed to send out bulk-smses to its customers. It was
> of the order of millions.
> A (noob?) python programmer was assigned the task and used django with
> whatever is the django orm.
> It took of the order of weeks to send out the smses.
> A python expert was called in.  He threw out the python and redid it
> in SQL.
> It was done in minutes.

I'm not surprised.  It almost certainly wasn't the python that was the 
problem.  More than likely, he was doing some horribly inefficient 
database operations.

Certainly, in our case, performance is all about the database.  We 
mostly can't even measure the time we spend running Python code.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Re: How to find bad row with db api executemany()?

Contents

#42289 — Re: How to find bad row with db api executemany()?

#42295

#42296

#42299

#42300

#42308

#42309 — Re: How to find bad row with db api executemany()? (PS)

#42311

#42301

#42302

#42304

#42306

#42310

#42314

#42315

#42317

#42336

#42305

#42307

#42312