Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!feeder.erje.net!newsfeed.utanet.at!newscore.univie.ac.at!aconews-feed.univie.ac.at!aconews.univie.ac.at!not-for-mail From: "Laurenz Albe" Newsgroups: comp.databases.postgresql References: <1308640710.210659@proxy.dienste.wien.at><1308738638.232318@proxy.dienste.wien.at> <1308911383.819034@proxy.dienste.wien.at> Subject: Re: Is PostgreSQL good? Date: Fri, 1 Jul 2011 10:13:12 +0200 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Response X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6109 Organization: dienste.wien.at ISP Message-ID: <1309508018.14415@proxy.dienste.wien.at> X-Cache: nntpcache 2.3.3 (see http://www.nntpcache.org/) Lines: 167 NNTP-Posting-Host: 141.203.254.23 X-Trace: 1309508023 aconews.univie.ac.at 34944 141.203.254.23 X-Complaints-To: abuse@univie.ac.at Xref: x330-a1.tempe.blueboxinc.net comp.databases.postgresql:174 Don Y wrote: > [apologies for not having the technical depth to understand these > issues better -- I just use PostgreSQL as a "means to an end"... > I could use any similar "means" but this seems to work well enough > for me] No worries, but don't take it as rudeness if I give you links for some of these question instead of answering everything in detail. > Can someone please, *gently*, explain *quantitatively* the differences > that these issues make in performance? Quantifying performance is only possible in a narrowly defined setting (in my opinion). > Are hints on a par with the 'register' keyword in C? I.e., > something you *hope* gives the compiler an insight into your > algorithm -- but, that the compiler is free to *ignore*? > Are there consequences to using a "hint" (e.g., using 'register' > places constraints on what you can do)? Roughly yes. Except that (at least in Oracle) a hint will only be ignored if it is impossible to use (you cannot force an index that cannot be used for this query). > Is the "hint" concept essentially a workaround to compensate for > capabilities *missing* in the optimizer? I.e., as C compilers > get smarter, they deliberately *choose* to ignore the 'register' > keyword because they actually have *more* insight than the > programmer! Yes, it is a workaround for optimizer deficiencies. The problem here is that optimizing SQL statements is quite difficult because SQL is a declarative language, i.e. you tell the system what you want to get and not how to get it. The optimizer will have to figure out an execution path. > Are there ways (without using a "hint") to coerce the database > to perform in the way your hint would *hope* to have it behave? > (e.g., can you rewrite the SQL to cause the behavior you seek?) Sometimes, yes. Often a simplified statement will be easier to optimize. If you know the optimizer well, you can also trick it into not considering certain optimizations. The optimizer uses "statistics" which are data about the number and distribution of data in table columns. A lot of times the optimizer will perform better if you collect more data. You can also set database parameters that influence the optimizer. > What sort of *quantitative* difference can this make (example?) > that can't, otherwise, be attained? The effect of choosing a better plan can improve things *a lot*. If you fetch 10 million rows, only to discard them later because you join with an empty result set, that's *much* worse than not fetching anything at all. >> About vacuum: >> ============= >> >> Of course vacuum is a pain, but every database system has to pay a similar >> price somewhere. Either you reduce concurrency (DB2), or (if you use some kind >> of MVCC) you can move dead rows somewhere else (Oracle) or leave them where >> they are and clean up later. > > Is this just the difference between automatic garbage collection > and "manual" garbage collection (again, relying on parallels to > other "languages")? No. DB2 does not create garbage, because there is only one version of each row at any given time. To get that, a writer has to wait until a reader is done (i.e., the reader will take a "read lock" on the row he or she is reading). Oracle and PostgreSQL both create garbage, because they both keep old versions of rows around to avoid readers blocking writers. Oracle collects the garbage automatically *while* it executes SQL statements. PostgreSQL collects garbage at some later time, either manually by an administrator running the VACUUM command, or automatically if the autovacuum daemon is turned on. > Does the parallel maintain throughout operation or does it fall > apart at the extremes? E.g., if garbage collection must be > manually initiated (in a language), then failing to do so > can eventually cause you to run out of heap, etc. (which > causes the application to "fail"). Is the same true when > it comes to vacuuming -- or, does performance just suffer? > (i.e., will you ever "fill the disk" with dead rows if you > fail to initiate a VACUUM operation?) All of this is only relevant if you have autovacuum turned off (autovacuum is the recommended way these days). Without autovacuum and if VACUUM is never run, two things will happen: 1) Your tables and indexes will grow on every insert and update, and deletes and updates will create garbage that cannot be reused. That can fill the disk and drastically increase the duration of table and index scans. 2) You will at some point run out of transaction IDs, and the system will shutdown to prevent data loss. See http://www.postgresql.org/docs/current/static/routine-vacuuming.html for details. >> ROWID: >> ====== >> >> You make the point that because of VACUUM there is nothing compared to >> Oracle's ROWID. But do you really need that? Shouldn't a row be referenced >> by its primary key? > > The last question seems to be my understanding of how things should > work. (?) (actually, one of the harder ideas for me to get accustomed > to in this field :< ) Right. It was a rethorical question. To digress, I always find a library a good analogy for a database. You have a lot of books there, each has a title and an author. A primary key is a) an index (like an index in a library) and b) a rule that says that there may only be one book for every possible index entry. Without a primary key, one author could have more than one book with the same title, and you won't be able to tell them apart (even though the books themselves may be different). > How does a ROWID differ from using an OID? (which I presume is > not a good thing) The ROWID is the physical location of the row, basically the offset in the file. You can sort of just "lseek(3)" there and get it. This is the fastest way to access a row. An OID is a number from a system counter that *can* also be used as a primary key. To some extent, it's just a more old-fashioned way of what can be done better with a sequence these days. If used as a primary key, it is no faster than other primary keys. See the final paragraphs of http://www.postgresql.org/docs/current/static/ddl-system-columns.html for details. >> Fixed size of WAL segments: >> =========================== >> >> Maybe I am dense, but what is the problem there? > > I don't even know what they *are*! ;-) WAL stands for "write ahead log" and is a transaction log. It is a way to provide durability for committed transactions in case of a system failure. See http://www.postgresql.org/docs/9.0/static/continuous-archiving.html for details. Yours, Laurenz Albe