Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!feeder.erje.net!newsfeed.utanet.at!newscore.univie.ac.at!aconews-feed.univie.ac.at!aconews.univie.ac.at!not-for-mail
From: "Laurenz Albe" <invite@spam.to.invalid>
Newsgroups: comp.databases.postgresql
References: <ito056$ido$1@bruford.hrz.tu-chemnitz.de><1308640710.210659@proxy.dienste.wien.at><pan.2011.06.21.13.38.10@gmail.com><1308738638.232318@proxy.dienste.wien.at> <pan.2011.06.22.12.02.22@gmail.com> <1308911383.819034@proxy.dienste.wien.at> <iuhvrv$5j7$1@speranza.aioe.org>
Subject: Re: Is PostgreSQL good?
Date: Fri, 1 Jul 2011 10:13:12 +0200
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Response
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6109
Organization: dienste.wien.at ISP
Message-ID: <1309508018.14415@proxy.dienste.wien.at>
X-Cache: nntpcache 2.3.3 (see http://www.nntpcache.org/)
Lines: 167
NNTP-Posting-Host: 141.203.254.23
X-Trace: 1309508023 aconews.univie.ac.at 34944 141.203.254.23
X-Complaints-To: abuse@univie.ac.at
Xref: x330-a1.tempe.blueboxinc.net comp.databases.postgresql:174

Don Y wrote:
> [apologies for not having the technical depth to understand these
> issues better -- I just use PostgreSQL as a "means to an end"...
> I could use any similar "means" but this seems to work well enough
> for me]

No worries, but don't take it as rudeness if I give you links for
some of these question instead of answering everything in detail.

> Can someone please, *gently*, explain *quantitatively* the differences
> that these issues make in performance?

Quantifying performance is only possible in a narrowly defined setting
(in my opinion).

> Are hints on a par with the 'register' keyword in C?  I.e.,
> something you *hope* gives the compiler an insight into your
> algorithm -- but, that the compiler is free to *ignore*?
> Are there consequences to using a "hint" (e.g., using 'register'
> places constraints on what you can do)?

Roughly yes. Except that (at least in Oracle) a hint will only
be ignored if it is impossible to use (you cannot force an index
that cannot be used for this query).

> Is the "hint" concept essentially a workaround to compensate for
> capabilities *missing* in the optimizer?  I.e., as C compilers
> get smarter, they deliberately *choose* to ignore the 'register'
> keyword because they actually have *more* insight than the
> programmer!

Yes, it is a workaround for optimizer deficiencies.
The problem here is that optimizing SQL statements is quite difficult
because SQL is a declarative language, i.e. you tell the system what
you want to get and not how to get it. The optimizer will have to
figure out an execution path.

> Are there ways (without using a "hint") to coerce the database
> to perform in the way your hint would *hope* to have it behave?
> (e.g., can you rewrite the SQL to cause the behavior you seek?)

Sometimes, yes. Often a simplified statement will be easier
to optimize. If you know the optimizer well, you can also trick
it into not considering certain optimizations.
The optimizer uses "statistics" which are data about the number
and distribution of data in table columns. A lot of times the
optimizer will perform better if you collect more data.
You can also set database parameters that influence the optimizer.

> What sort of *quantitative* difference can this make (example?)
> that can't, otherwise, be attained?

The effect of choosing a better plan can improve things *a lot*.
If you fetch 10 million rows, only to discard them later because
you join with an empty result set, that's *much* worse than not
fetching anything at all.

>> About vacuum:
>> =============
>>
>> Of course vacuum is a pain, but every database system has to pay a similar
>> price somewhere. Either you reduce concurrency (DB2), or (if you use some kind
>> of MVCC) you can move dead rows somewhere else (Oracle) or leave them where
>> they are and clean up later.
>
> Is this just the difference between automatic garbage collection
> and "manual" garbage collection (again, relying on parallels to
> other "languages")?

No. DB2 does not create garbage, because there is only one version
of each row at any given time. To get that, a writer has to wait
until a reader is done (i.e., the reader will take a "read lock"
on the row he or she is reading).

Oracle and PostgreSQL both create garbage, because they both keep
old versions of rows around to avoid readers blocking writers.
Oracle collects the garbage automatically *while* it executes
SQL statements.

PostgreSQL collects garbage at some later time, either manually
by an administrator running the VACUUM command, or automatically
if the autovacuum daemon is turned on.

> Does the parallel maintain throughout operation or does it fall
> apart at the extremes?  E.g., if garbage collection must be
> manually initiated (in a language), then failing to do so
> can eventually cause you to run out of heap, etc. (which
> causes the application to "fail").  Is the same true when
> it comes to vacuuming -- or, does performance just suffer?
> (i.e., will you ever "fill the disk" with dead rows if you
> fail to initiate a VACUUM operation?)

All of this is only relevant if you have autovacuum turned off
(autovacuum is the recommended way these days).

Without autovacuum and if VACUUM is never run, two things will
happen:
1) Your tables and indexes will grow on every insert and update,
   and deletes and updates will create garbage that cannot be
   reused. That can fill the disk and drastically increase the
   duration of table and index scans.
2) You will at some point run out of transaction IDs, and the
   system will shutdown to prevent data loss.

See
http://www.postgresql.org/docs/current/static/routine-vacuuming.html
for details.

>> ROWID:
>> ======
>>
>> You make the point that because of VACUUM there is nothing compared to
>> Oracle's ROWID. But do you really need that? Shouldn't a row be referenced
>> by its primary key?
>
> The last question seems to be my understanding of how things should
> work. (?)  (actually, one of the harder ideas for me to get accustomed
> to in this field  :< )

Right. It was a rethorical question.

To digress, I always find a library a good analogy for a database.
You have a lot of books there, each has a title and an author.

A primary key is a) an index (like an index in a library) and
b) a rule that says that there may only be one book for every
possible index entry.

Without a primary key, one author could have more than one book with
the same title, and you won't be able to tell them apart (even though
the books themselves may be different).

> How does a ROWID differ from using an OID?  (which I presume is
> not a good thing)

The ROWID is the physical location of the row, basically the offset
in the file. You can sort of just "lseek(3)" there and get it. This
is the fastest way to access a row.

An OID is a number from a system counter that *can* also be used
as a primary key. To some extent, it's just a more old-fashioned
way of what can be done better with a sequence these days.
If used as a primary key, it is no faster than other primary keys.

See the final paragraphs of
http://www.postgresql.org/docs/current/static/ddl-system-columns.html
for details.

>> Fixed size of WAL segments:
>> ===========================
>>
>> Maybe I am dense, but what is the problem there?
>
> <grin>  I don't even know what they *are*!  ;-)

WAL stands for "write ahead log" and is a transaction log.
It is a way to provide durability for committed transactions in case
of a system failure.

See
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
for details.

Yours,
Laurenz Albe