Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #42289 > unrolled thread
| Started by | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| First post | 2013-03-29 18:25 -0400 |
| Last post | 2013-03-30 06:38 -0500 |
| Articles | 20 on this page of 24 — 5 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 18:25 -0400
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 20:41 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 11:57 +1100
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 21:19 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 13:05 +1100
Re: How to find bad row with db api executemany()? Tim Chase <python.list@tim.thechases.com> - 2013-03-29 22:17 -0500
Re: How to find bad row with db api executemany()? (PS) Tim Chase <python.list@tim.thechases.com> - 2013-03-29 22:38 -0500
Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 23:38 -0400
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 22:44 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 13:49 +1100
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:09 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 14:14 +1100
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:36 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 14:57 +1100
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-30 00:10 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 15:21 +1100
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-30 10:19 -0400
Re: How to find bad row with db api executemany()? rusi <rustompmody@gmail.com> - 2013-03-29 20:13 -0700
Re: How to find bad row with db api executemany()? rusi <rustompmody@gmail.com> - 2013-03-29 20:15 -0700
Re: How to find bad row with db api executemany()? Roy Smith <roy@panix.com> - 2013-03-29 23:40 -0400
Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-29 23:53 -0400
Re: How to find bad row with db api executemany()? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-03-30 00:19 -0400
Re: How to find bad row with db api executemany()? Chris Angelico <rosuav@gmail.com> - 2013-03-30 15:24 +1100
Re: How to find bad row with db api executemany()? Tim Chase <python.list@tim.thechases.com> - 2013-03-30 06:38 -0500
Page 1 of 2 [1] 2 Next page →
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-03-29 18:25 -0400 |
| Subject | Re: How to find bad row with db api executemany()? |
| Message-ID | <mailman.3971.1364595940.2939.python-list@python.org> |
On Fri, 29 Mar 2013 14:53:30 -0400, Dave Angel <davea@davea.name>
declaimed the following in gmane.comp.python.general:
> On 03/29/2013 10:48 AM, Roy Smith wrote:
> > I'm inserting a gazillion rows into a MySQL database using MySQLdb and cursor.executemany() for efficiency. Every once in a while, I get a row which violates some kind of database constraint and raises Error.
> >
> > I can catch the exception, but don't see any way to tell which row caused the problem. Is this information obtainable, short of retrying each row one by one?
> >
>
> I don't know the direct answer, or even if there is one (way to get
> MySQL to tell you which one failed), but ...
>
> Assuming that executeMany is much cheaper than a million calls to
> executeOne (or whatever).
If using MySQLdb, there isn't all that much difference... MySQLdb is
still compatible with MySQL v4 (and maybe even v3), and since those
versions don't have "prepared statements", .executemany() essentially
turns into something that creates a newline delimited "list" of
"identical" (but for argument substitution) statements and submits that
to MySQL.
Just look at the source code (cursors.py in MySQLdb package).
Hmmm... That gives a rally point. If it IS submitting one massive
string containing all the data copies, could the failure be coming from
submitting something too big for the client/server communication
channel?
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 20:41 -0400 |
| Message-ID | <roy-A61512.20410329032013@news.panix.com> |
| In reply to | #42289 |
In article <mailman.3971.1364595940.2939.python-list@python.org>,
Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
> If using MySQLdb, there isn't all that much difference... MySQLdb is
> still compatible with MySQL v4 (and maybe even v3), and since those
> versions don't have "prepared statements", .executemany() essentially
> turns into something that creates a newline delimited "list" of
> "identical" (but for argument substitution) statements and submits that
> to MySQL.
Shockingly, that does appear to be the case. I had thought during my
initial testing that I was seeing far greater throughput, but as I got
more into the project and started doing some side-by-side comparisons,
it the differences went away.
We're sucking in a pretty huge amount of data. The source document is a
7 GB gzipped XML file. I'm not sure how big it is uncompressed (we use
gzip.GzipFile to uncompress on the fly) but I'm guessing something like
a 30x compression ratio so 200 GB? The last time we imported the whole
set, it ran for 21 days!
It turns out, the problems we were seeing were all inserts into a new
table we added. Apparently, the default charset is latin-1 and we
didn't notice that when we altered the schema! Once I noticed that all
the other tables were utf-8 and changed this one to be that, the
problems went away.
Sadly, I ended up resorting to a truly ugly hack to diagnose the
problem. I catch the exception and parse the text message. Yuck.
try:
self.executemany(self.sql_statement, self.sql_params)
except MySQLdb.Error as ex:
code, message = ex.args
m = re.search(r".* at row (\d+)$", message)
if m:
i = int(m.group(1)) - 1 # Python is 0-index, SQL, 1-index
The other truly horrible part of the project was when I decided it was
bad for my import script to have too much schema knowledge hard-wired
in. So, I decided to use SQLAlchemy to introspect the database and
discover the column names, types, and defaults. It turns out, if an
integer column has a default (say, 0), the introspected data comes back
with the default as the string, '0'. WTF???
Does Postgress's Python adapter handle executemany() in a sane way?
We're not wedded to MySQL in any way. We use it for exactly this one
process. We get these XML dumps from a supplier's SQL-Server database.
We stage the data in MySQL, then export what we need into MongoDB. We
could easily swap out the MySQL staging for Postgress if that worked
better.
Hmmm, we do take advantage of REPLACE INTO, which I think is a
non-standard MySQL addition. Not sure if Postgress supports that.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 11:57 +1100 |
| Message-ID | <mailman.3977.1364605026.2939.python-list@python.org> |
| In reply to | #42295 |
On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote: > In article <mailman.3971.1364595940.2939.python-list@python.org>, > Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: > >> If using MySQLdb, there isn't all that much difference... MySQLdb is >> still compatible with MySQL v4 (and maybe even v3), and since those >> versions don't have "prepared statements", .executemany() essentially >> turns into something that creates a newline delimited "list" of >> "identical" (but for argument substitution) statements and submits that >> to MySQL. > > Shockingly, that does appear to be the case. I had thought during my > initial testing that I was seeing far greater throughput, but as I got > more into the project and started doing some side-by-side comparisons, > it the differences went away. How much are you doing per transaction? The two extremes (everything in one transaction, or each line in its own transaction) are probably the worst for performance. See what happens if you pepper the code with 'begin' and 'commit' statements (maybe every thousand or ten thousand rows) to see if performance improves. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 21:19 -0400 |
| Message-ID | <roy-AE1A29.21192229032013@news.panix.com> |
| In reply to | #42296 |
In article <mailman.3977.1364605026.2939.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote: > > In article <mailman.3971.1364595940.2939.python-list@python.org>, > > Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: > > > >> If using MySQLdb, there isn't all that much difference... MySQLdb is > >> still compatible with MySQL v4 (and maybe even v3), and since those > >> versions don't have "prepared statements", .executemany() essentially > >> turns into something that creates a newline delimited "list" of > >> "identical" (but for argument substitution) statements and submits that > >> to MySQL. > > > > Shockingly, that does appear to be the case. I had thought during my > > initial testing that I was seeing far greater throughput, but as I got > > more into the project and started doing some side-by-side comparisons, > > it the differences went away. > > How much are you doing per transaction? The two extremes (everything > in one transaction, or each line in its own transaction) are probably > the worst for performance. See what happens if you pepper the code > with 'begin' and 'commit' statements (maybe every thousand or ten > thousand rows) to see if performance improves. > > ChrisA We're doing it all in one transaction, on purpose. We start with an initial dump, then get updates about once a day. We want to make sure that the updates either complete without errors, or back out cleanly. If we ever had a partial daily update, the result would be a mess. Hmmm, on the other hand, I could probably try doing the initial dump the way you describe. If it fails, we can just delete the whole thing and start again.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 13:05 +1100 |
| Message-ID | <mailman.3979.1364609118.2939.python-list@python.org> |
| In reply to | #42299 |
On Sat, Mar 30, 2013 at 12:19 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.3977.1364605026.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <roy@panix.com> wrote: >> > In article <mailman.3971.1364595940.2939.python-list@python.org>, >> > Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: >> > >> >> If using MySQLdb, there isn't all that much difference... MySQLdb is >> >> still compatible with MySQL v4 (and maybe even v3), and since those >> >> versions don't have "prepared statements", .executemany() essentially >> >> turns into something that creates a newline delimited "list" of >> >> "identical" (but for argument substitution) statements and submits that >> >> to MySQL. >> > >> > Shockingly, that does appear to be the case. I had thought during my >> > initial testing that I was seeing far greater throughput, but as I got >> > more into the project and started doing some side-by-side comparisons, >> > it the differences went away. >> >> How much are you doing per transaction? The two extremes (everything >> in one transaction, or each line in its own transaction) are probably >> the worst for performance. See what happens if you pepper the code >> with 'begin' and 'commit' statements (maybe every thousand or ten >> thousand rows) to see if performance improves. >> >> ChrisA > > We're doing it all in one transaction, on purpose. We start with an > initial dump, then get updates about once a day. We want to make sure > that the updates either complete without errors, or back out cleanly. > If we ever had a partial daily update, the result would be a mess. > > Hmmm, on the other hand, I could probably try doing the initial dump the > way you describe. If it fails, we can just delete the whole thing and > start again. One transaction for the lot isn't nearly as bad as one transaction per row, but it can consume a lot of memory on the server - or at least, that's what I found last time I worked with MySQL. (PostgreSQL works completely differently, and I'd strongly recommend doing it all as one transaction if you switch.) It's not guaranteed to help, but if it won't hurt to try, there's a chance you'll gain some performance. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-03-29 22:17 -0500 |
| Message-ID | <mailman.3982.1364613376.2939.python-list@python.org> |
| In reply to | #42299 |
On 2013-03-29 21:19, Roy Smith wrote: > We're doing it all in one transaction, on purpose. We start with > an initial dump, then get updates about once a day. We want to > make sure that the updates either complete without errors, or back > out cleanly. If we ever had a partial daily update, the result > would be a mess. Having had to do some similarly-sized bulk data loads (in my case, MSSqlServer at $JOB) couple other ideas occur to me: 1) I believe MySQL has a side-loading function (I'd have to go digging for it; a quick google suggests a "LOAD DATA INFILE" statement[1]) that allows you to load data from an external file such as an XML or CSV file 2) Load into a temp table in testable batches, then do some sort of batch insert into your main table. Again, a quick google suggest the "INSERT ... SELECT" syntax[2] -tkc [1] http://dev.mysql.com/doc/refman/5.1/en/load-data.html [2] http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-03-29 22:38 -0500 |
| Subject | Re: How to find bad row with db api executemany()? (PS) |
| Message-ID | <mailman.3983.1364614588.2939.python-list@python.org> |
| In reply to | #42299 |
On 2013-03-29 22:17, Tim Chase wrote: > 2) Load into a temp table in testable batches, then do some sort of > batch insert into your main table. Again, a quick google suggest > the "INSERT ... SELECT" syntax[2] It looks like there's a corresponding "REPLACE INTO ... SELECT" syntax[1], as you mention doing a REPLACE INTO rather than a straight INSERT -tkc [1] http://dev.mysql.com/doc/refman/5.0/en/replace.html
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-03-29 23:38 -0400 |
| Message-ID | <mailman.3984.1364614714.2939.python-list@python.org> |
| In reply to | #42299 |
On Fri, 29 Mar 2013 21:19:22 -0400, Roy Smith <roy@panix.com> declaimed
the following in gmane.comp.python.general:
> In article <mailman.3977.1364605026.2939.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
> >
> > How much are you doing per transaction? The two extremes (everything
> > in one transaction, or each line in its own transaction) are probably
> > the worst for performance. See what happens if you pepper the code
> > with 'begin' and 'commit' statements (maybe every thousand or ten
> > thousand rows) to see if performance improves.
> >
> > ChrisA
>
> We're doing it all in one transaction, on purpose. We start with an
> initial dump, then get updates about once a day. We want to make sure
> that the updates either complete without errors, or back out cleanly.
> If we ever had a partial daily update, the result would be a mess.
>
As I recall, DB-API compliance should have started a transaction
when you submit a statement, so explicit "begin" should not be required.
My suggestion would be to first try just breaking the .executemany()
into chunks rather than everything at once. Submit maybe 500-1000
records at a time via .executemany(), checking for errors -- if a block
errors you can still rollback() the entire transaction AND you have a
smaller set of data to examine. Then at the end, if no errors, do the
commit()
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 22:44 -0400 |
| Message-ID | <roy-1FF146.22445329032013@news.panix.com> |
| In reply to | #42295 |
In article <roy-A61512.20410329032013@news.panix.com>,
Roy Smith <roy@panix.com> wrote:
> In article <mailman.3971.1364595940.2939.python-list@python.org>,
> Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
> > If using MySQLdb, there isn't all that much difference... MySQLdb is
> > still compatible with MySQL v4 (and maybe even v3), and since those
> > versions don't have "prepared statements", .executemany() essentially
> > turns into something that creates a newline delimited "list" of
> > "identical" (but for argument substitution) statements and submits that
> > to MySQL.
>
> Shockingly, that does appear to be the case. I had thought during my
> initial testing that I was seeing far greater throughput, but as I got
> more into the project and started doing some side-by-side comparisons,
> it the differences went away.
OMG, this is amazing.
http://stackoverflow.com/questions/3945642/
It turns out, the MySQLdb executemany() runs a regex over your SQL and
picks one of two algorithms depending on whether it matches or not.
restr = (r"\svalues\s*"
r"(\(((?<!\\)'[^\)]*?\)[^\)]*(?<!\\)?'"
r"|[^\(\)]|"
r"(?:\([^\)]*\))"
r")+\))")
Leaving aside the obvious line-noise aspects, the operative problem here
is that it only looks for "values" (in lower case).
I've lost my initial test script which convinced me that executemany()
would be a win; I'm assuming I used lower case for that. Our production
code uses "VALUES".
The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
seconds. When I switch to "values", I'm getting more like 1000 rows in
100 ms!
A truly breathtaking bug.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 13:49 +1100 |
| Message-ID | <mailman.3980.1364611804.2939.python-list@python.org> |
| In reply to | #42301 |
On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <roy@panix.com> wrote: > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4 > seconds. When I switch to "values", I'm getting more like 1000 rows in > 100 ms! > > A truly breathtaking bug. *facepalm* Doubly facepalm because a regex could easily have tested for mixed case. Especially facepalm because there's some way to do this that's faster than straight INSERT statements, and it's not clearly documented as "hey, guys, if you want to dump loads of data in, use COPY instead" (it might be that, I don't know, but usually COPY isn't directly transliterable with INSERT). I agree. Breathtaking. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 23:09 -0400 |
| Message-ID | <roy-8D4A5A.23093729032013@news.panix.com> |
| In reply to | #42302 |
In article <mailman.3980.1364611804.2939.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > Especially facepalm because there's some way to do this that's faster > than straight INSERT statements, and it's not clearly documented as > "hey, guys, if you want to dump loads of data in, use COPY instead" > (it might be that, I don't know, but usually COPY isn't directly > transliterable with INSERT). We're actually using REPLACE INTO. For the initial data load, we could just as well do INSERT, but we need the REPLACE functionality as we roll in the daily incremental updates. This also explains why, even after provisioning our RDS instance for 2000 IOPS (that's AWS-speak for "we paid extra to get more disk bandwidth"), we didn't see any performance improvement!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 14:14 +1100 |
| Message-ID | <mailman.3981.1364613280.2939.python-list@python.org> |
| In reply to | #42304 |
On Sat, Mar 30, 2013 at 2:09 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.3980.1364611804.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> Especially facepalm because there's some way to do this that's faster >> than straight INSERT statements, and it's not clearly documented as >> "hey, guys, if you want to dump loads of data in, use COPY instead" >> (it might be that, I don't know, but usually COPY isn't directly >> transliterable with INSERT). > > We're actually using REPLACE INTO. For the initial data load, we could > just as well do INSERT, but we need the REPLACE functionality as we roll > in the daily incremental updates. > > This also explains why, even after provisioning our RDS instance for > 2000 IOPS (that's AWS-speak for "we paid extra to get more disk > bandwidth"), we didn't see any performance improvement! Hmm. I heard around the forums that Amazon weren't that great at disk bandwidth anyway, and that provisioning IO was often a waste of money. But we never did all that much much research on Amazon I/O performance; shortly after doing some basic benchmarking, we decided that the cloud was a poor fit for our system model, and went looking at dedicated servers with their own RAID storage right there on the bus. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 23:36 -0400 |
| Message-ID | <roy-211B1A.23365429032013@news.panix.com> |
| In reply to | #42306 |
In article <mailman.3981.1364613280.2939.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > Hmm. I heard around the forums that Amazon weren't that great at disk > bandwidth anyway, and that provisioning IO was often a waste of money. Au, contraire. I guess it all depends on what you're doing. If you're CPU bound, increasing your I/O bandwidth won't help. But, at least on our database (MongoDB) servers, we saw a huge performance boost when we started going for provisioned IO. > But we never did all that much much research on Amazon I/O > performance; shortly after doing some basic benchmarking, we decided > that the cloud was a poor fit for our system model, and went looking > at dedicated servers with their own RAID storage right there on the > bus. As far as I can tell, from a raw price/performance basis, they're pretty expensive. But, from a convenience standpoint, it's hard to beat. Case in point: We've been thinking about SSD as our next performance step-up. One day, we just spun up some big honking machine, configured it with 2 TB of SSD, and played around for a while. Wicked fast. Then we shut it down. That experiment probably cost us $10 or so, and we were able to run it on the spur of the moment. Another example was last summer when we had a huge traffic spike because of a new product release. Caught us by surprise how much new traffic it would generate. Our site was in total meltdown. We were able to spin up 10 new servers in an afternoon. If we had to go out and buy hardware, have it shipped to us, figure out where we had rack space, power, network capacity, cooling, etc, we'd have been out of business before we got back on the air. Yet another example. We just (as in, while I've been typing this) had one of our servers go down. Looks like the underlying hardware the VM was running on croaked, because when the instance came back up, it had a new IP address. The whole event was over in a couple of minutes, with only minor disruption to the service. And, presumably, there's some piece of hardware somewhere in Virginia that needs repairing, but that's not our problem. The really big boys (Google, Facebook) run their own data centers. But, some surprisingly large operations run out of AWS. Netflix, for example. The convenience and flexibility is worth a lot.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 14:57 +1100 |
| Message-ID | <mailman.3986.1364615879.2939.python-list@python.org> |
| In reply to | #42310 |
On Sat, Mar 30, 2013 at 2:36 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.3981.1364613280.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> Hmm. I heard around the forums that Amazon weren't that great at disk >> bandwidth anyway, and that provisioning IO was often a waste of money. > > Au, contraire. I guess it all depends on what you're doing. If you're > CPU bound, increasing your I/O bandwidth won't help. But, at least on > our database (MongoDB) servers, we saw a huge performance boost when we > started going for provisioned IO. > > As far as I can tell, from a raw price/performance basis, they're pretty > expensive. But, from a convenience standpoint, it's hard to beat. Yeah, I'm not saying you won't see performance go up - just that it's going to be expensive compared to what you could do with a dedicated server. The flexibility costs. > Case in point: We've been thinking about SSD as our next performance > step-up. One day, we just spun up some big honking machine, configured > it with 2 TB of SSD, and played around for a while. Wicked fast. Then > we shut it down. That experiment probably cost us $10 or so, and we > were able to run it on the spur of the moment. That is one thing the cloud is *awesome* for. "Hmm, I wonder......" *half an hour later* "Now I know." *bill comes in* "That was cheap." > Another example was last summer when we had a huge traffic spike because > of a new product release. Caught us by surprise how much new traffic it > would generate. Our site was in total meltdown. We were able to spin > up 10 new servers in an afternoon. If we had to go out and buy > hardware, have it shipped to us, figure out where we had rack space, > power, network capacity, cooling, etc, we'd have been out of business > before we got back on the air. Yep. We're looking rather at server rental, from big honking data centers (by the way, I never managed to figure this out - at what size do things begin to honk? Larger than a bread box?), where hopefully they would be able to deploy us more servers within a matter of hours. Not as good as cloud (where you can have more servers in minutes), but you described "in an afternoon" as your success story, so I'm guessing most of the delay was an administrative one - the decision to actually go ahead and spin those servers up. Unless you're automating the whole thing (which is possible with the Amazon cloud, but not every client will do it), that's always going to cost you. > Yet another example. We just (as in, while I've been typing this) had > one of our servers go down. Looks like the underlying hardware the VM > was running on croaked, because when the instance came back up, it had a > new IP address. The whole event was over in a couple of minutes, with > only minor disruption to the service. And, presumably, there's some > piece of hardware somewhere in Virginia that needs repairing, but that's > not our problem. Yeah, that's also a concern. And that works beautifully as long as you're okay with that. It'll also happen more often with AWS than with dedicated hardware, because there are more components to fail. That's part of what wouldn't have fitted our server layout; we have a layer on top of all that to manage monitoring (and, incidentally, we set up a dozen home-grade laptops as a "server cluster" to test all those systems, and knew within minutes of one of the laptops deciding that it had crunched its last bits), and we really want stable IP addresses for DNS and such. Cloud isn't bad, it was just a bad fit for us. Side point: You mentioned SSDs. Are you aware of the fundamental risks associated with them? Only a handful of SSD models are actually trustworthy for databasing. Sure, they're fast, but can you afford data corruption in the event of a power outage? Most SSDs will cheerfully lie about fsync() and thus violate transactional integrity (ACID compliance etc). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-30 00:10 -0400 |
| Message-ID | <roy-0B9881.00100930032013@news.panix.com> |
| In reply to | #42314 |
In article <mailman.3986.1364615879.2939.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > Side point: You mentioned SSDs. Are you aware of the fundamental risks > associated with them? Only a handful of SSD models are actually > trustworthy for databasing. We haven't decided if we're going that route yet, but if we do, we will probably use do RAID SSD for added reliability. We also have all our database servers in failover clusters, so we get added reliability that way too. But, we have some runway left with more conventional technologies, so we don't need to decide for a while. Ultimately, however, as reliability goes up and cost comes down, it's hard to imagine the SSD isn't going to be a big part of our lives at some point.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-03-30 15:21 +1100 |
| Message-ID | <mailman.3988.1364617294.2939.python-list@python.org> |
| In reply to | #42315 |
On Sat, Mar 30, 2013 at 3:10 PM, Roy Smith <roy@panix.com> wrote: > In article <mailman.3986.1364615879.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> Side point: You mentioned SSDs. Are you aware of the fundamental risks >> associated with them? Only a handful of SSD models are actually >> trustworthy for databasing. > > We haven't decided if we're going that route yet, but if we do, we will > probably use do RAID SSD for added reliability. We also have all our > database servers in failover clusters, so we get added reliability that > way too. But will you know if you have corruption? Normally, transactional integrity means: 1) If a transaction, from begin to commit, is not completely applied, then it is completely not-applied; and 2) If a transaction that was affected by this one has been applied, then so has this one. SSDs that lie about fsync (and some hard disks lie too, as do some operating systems and some file system drivers, but - under Linux at least - it's possible to guarantee the OS and FS parts) can violate both halves. Anything might have been written, anything might have been missed. I did some tests with PostgreSQL on an SSD, and the results were seriously scary. > But, we have some runway left with more conventional technologies, so we > don't need to decide for a while. Ultimately, however, as reliability > goes up and cost comes down, it's hard to imagine the SSD isn't going to > be a big part of our lives at some point. Yes. I hope that by then, the manufacturers will realize that TPS isn't the only thing that matters. I'm sure SSDs will mature to the point where we can trust all brands equally (or at least, most brands - maybe it'll be like "server SSDs" and "desktop SSDs"?), but until then, there aren't many options. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-30 10:19 -0400 |
| Message-ID | <roy-8B23B4.10192430032013@news.panix.com> |
| In reply to | #42317 |
In article <mailman.3988.1364617294.2939.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > SSDs that lie about fsync (and some hard disks lie too, as do some > operating systems and some file system drivers, but - under Linux at > least - it's possible to guarantee the OS and FS parts) can violate > both halves. Anything might have been written, anything might have > been missed. I did some tests with PostgreSQL on an SSD, and the > results were seriously scary. What you say is true, but then again, it's true of other things too. We're running in a completely virtual environment. Our "real" disk storage is EBS. Somewhere out on the network there's something which pretends to be a disk. Does "written" really mean "the bits have been committed to a piece of physical spinning iron oxide"? I have no clue. But, this conversation is getting very far away from Python.
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2013-03-29 20:13 -0700 |
| Message-ID | <a6438dda-cc5f-4898-98c8-a0e0d754418f@f5g2000pbs.googlegroups.com> |
| In reply to | #42302 |
On Mar 30, 7:49 am, Chris Angelico <ros...@gmail.com> wrote: > On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <r...@panix.com> wrote: > > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4 > > seconds. When I switch to "values", I'm getting more like 1000 rows in > > 100 ms! > > > A truly breathtaking bug. > > *facepalm* > > Doubly facepalm because a regex could easily have tested for mixed case. > > Especially facepalm because there's some way to do this that's faster > than straight INSERT statements, and it's not clearly documented as > "hey, guys, if you want to dump loads of data in, use COPY instead" > (it might be that, I don't know, but usually COPY isn't directly > transliterable with INSERT). > > I agree. Breathtaking. > > ChrisA I recently heard this: A phone company needed to send out bulk-smses to its customers. It was of the order of millions. A (noob?) python programmer was assigned the task and used django with whatever is the django orm. It took of the order of weeks to send out the smses. A python expert was called in. He threw out the python and redid it in SQL. It was done in minutes.
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2013-03-29 20:15 -0700 |
| Message-ID | <1568eb91-0f1f-44f8-b3ea-17747026694a@u5g2000pbs.googlegroups.com> |
| In reply to | #42305 |
On Mar 30, 8:13 am, rusi <rustompm...@gmail.com> wrote: > It took of the order of weeks to send out the smses. 'Week' I think is more accurate.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-03-29 23:40 -0400 |
| Message-ID | <roy-56955C.23403629032013@news.panix.com> |
| In reply to | #42305 |
In article <a6438dda-cc5f-4898-98c8-a0e0d754418f@f5g2000pbs.googlegroups.com>, rusi <rustompmody@gmail.com> wrote: > I recently heard this: > A phone company needed to send out bulk-smses to its customers. It was > of the order of millions. > A (noob?) python programmer was assigned the task and used django with > whatever is the django orm. > It took of the order of weeks to send out the smses. > A python expert was called in. He threw out the python and redid it > in SQL. > It was done in minutes. I'm not surprised. It almost certainly wasn't the python that was the problem. More than likely, he was doing some horribly inefficient database operations. Certainly, in our case, performance is all about the database. We mostly can't even measure the time we spend running Python code.
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web