Groups > comp.databases.mysql > #1389 > unrolled thread

MySQL vs Postgres for a "data warehouse", 5-10 TB

Started by	Ignoramus5263 <ignoramus5263@NOSPAM.5263.invalid>
First post	2011-09-10 00:14 -0500
Last post	2011-09-23 13:01 +0200
Articles	13 — 11 participants

Back to article view | Back to comp.databases.mysql

  MySQL vs Postgres for a "data warehouse", 5-10 TB Ignoramus5263 <ignoramus5263@NOSPAM.5263.invalid> - 2011-09-10 00:14 -0500
    Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Michael Vilain <vilain@NOspamcop.net> - 2011-09-09 23:30 -0700
      Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Luuk <Luuk@invalid.lan> - 2011-09-10 10:13 +0200
      Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid> - 2011-09-10 09:34 -0500
        Re: MySQL vs Postgres for a "data warehouse", 5-10 TB John Levine <johnl@iecc.com> - 2011-09-10 19:47 +0000
          Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid> - 2011-09-10 16:02 -0500
            Re: MySQL vs Postgres for a "data warehouse", 5-10 TB The Natural Philosopher <tnp@invalid.invalid> - 2011-09-11 15:40 +0100
            Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Axel Schwenke <axel.schwenke@gmx.de> - 2011-09-11 18:11 +0200
              Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Ignoramus21330 <ignoramus21330@NOSPAM.21330.invalid> - 2011-09-11 12:07 -0500
        Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Norman Peelman <npeelmandog@cfl.rr.com> - 2011-09-10 19:06 -0400
    Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Matthew Woodcraft <mattheww@chiark.greenend.org.uk> - 2011-09-11 10:24 +0100
      Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Ignoramus21330 <ignoramus21330@NOSPAM.21330.invalid> - 2011-09-11 07:35 -0500
    Re: MySQL vs Postgres for a "data warehouse", 5-10 TB Bodo <bodo@nomail.org.invalid> - 2011-09-23 13:01 +0200

#1389 — MySQL vs Postgres for a "data warehouse", 5-10 TB

From	Ignoramus5263 <ignoramus5263@NOSPAM.5263.invalid>
Date	2011-09-10 00:14 -0500
Subject	MySQL vs Postgres for a "data warehouse", 5-10 TB
Message-ID	<H9idndjaqY3Xb_fTnZ2dnUVZ_sGdnZ2d@giganews.com>

First of all, this is not an attempt to create a flamewar of any
kind. I have been a MySQL user for years, including owning a few
multi-gigabyte databases for my websites, and using it to host
algebra.com (about 12 GB database). 

I have had my ups and downs with MySQL. The ups were ease of use and
decent performance for small databases such as algebra.com. The downs
were things like twenty hour REPAIR TABLE operations on a 35 GB
table, etc.

Right now I have a personal (one user) project to create a 5-10
Terabyte data warehouse. The largest table will consume the most space
and will take, perhaps, 200,000,000 rows. 

I want to use it to obtain valuable business intelligence and to make
money.

I expect it to grow, never shrink, and to be accessed via batch
queries. I do not care for batch queries to be super fast, for example
an hour per query would be just fine. 

However, while an hour is fine, two weeks per query is NOT fine.

I have a server with about 18 TB of storage and 48 GB of RAM, and 12
CPU cores. 

My initial plan was to use MySQL, InnoDB, and deal with problems as
they arise. Perhaps, say, I would implement my own joining
procedures. 

After reading some disparaging stuff about InnoDB performance on large
datasets, however, I am getting cold feet. I have a general feeling
that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
with either MySQL and Postgres.

I do not know much about Postgres, but I am very eager to learn and
see if I can use it for my purposes more effectively than MySQL.

I cannot shell out $47,000 per CPU for Oracle for this project. 

To be more specific, the batch queries that I would do, I hope, 
would either use small JOINS of a small dataset to a large dataset, or
just SELECTS from one big table.

So... Can Postgres support a 5-10 TB database with the use pattern
stated above?

i

[toc] | [next] | [standalone]

#1390

From	Michael Vilain <vilain@NOspamcop.net>
Date	2011-09-09 23:30 -0700
Message-ID	<vilain-F1171C.23304409092011@news.individual.net>
In reply to	#1389

In article <H9idndjaqY3Xb_fTnZ2dnUVZ_sGdnZ2d@giganews.com>,
 Ignoramus5263 <ignoramus5263@NOSPAM.5263.invalid> wrote:

> First of all, this is not an attempt to create a flamewar of any
> kind. I have been a MySQL user for years, including owning a few
> multi-gigabyte databases for my websites, and using it to host
> algebra.com (about 12 GB database). 
> 
> I have had my ups and downs with MySQL. The ups were ease of use and
> decent performance for small databases such as algebra.com. The downs
> were things like twenty hour REPAIR TABLE operations on a 35 GB
> table, etc.
> 
> Right now I have a personal (one user) project to create a 5-10
> Terabyte data warehouse. The largest table will consume the most space
> and will take, perhaps, 200,000,000 rows. 
> 
> I want to use it to obtain valuable business intelligence and to make
> money.
> 
> I expect it to grow, never shrink, and to be accessed via batch
> queries. I do not care for batch queries to be super fast, for example
> an hour per query would be just fine. 
> 
> However, while an hour is fine, two weeks per query is NOT fine.
> 
> I have a server with about 18 TB of storage and 48 GB of RAM, and 12
> CPU cores. 
> 
> My initial plan was to use MySQL, InnoDB, and deal with problems as
> they arise. Perhaps, say, I would implement my own joining
> procedures. 
> 
> After reading some disparaging stuff about InnoDB performance on large
> datasets, however, I am getting cold feet. I have a general feeling
> that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
> with either MySQL and Postgres.
> 
> I do not know much about Postgres, but I am very eager to learn and
> see if I can use it for my purposes more effectively than MySQL.
> 
> I cannot shell out $47,000 per CPU for Oracle for this project. 
> 
> To be more specific, the batch queries that I would do, I hope, 
> would either use small JOINS of a small dataset to a large dataset, or
> just SELECTS from one big table.
> 
> So... Can Postgres support a 5-10 TB database with the use pattern
> stated above?
> 
> i

This is why database consultants get paid big bucks.  Answering 
questions like "what DB product should I buy?" and "how should I lay out 
my database system?" are all bread and butter for these guys.  And you 
want that sort of analysis for free?  Huh.

Basically, unless you're a high power database person who knows the ins 
and outs of each vendor's products, how to optimize performance (that's 
$150K/year easy), and can design and administer such systems, I think 
you're going to have do the research own your own.  

Back in my day of system performance monitoring, a customer would ask me 
how their workload would run on a specific hardware configuration or why 
it's running poorly and should they buy the larger machine.  Poorly 
written applications will still perform poorly on a bigger machine, just 
faster.  Spend time designing something well with growth, scale, 
archival, and _yes Virginia, fricken backups_ and you'll save yourself 
some time and stress.  Developers never think of "how do I back this 
up?", they just ask the datacenter to do it.  After they've gone live.

I have no idea how well you've designed your database or your experience 
in such matters.  You sound like a hardware guy from what you've 
described.  If this is just personal project, build it using MySQL, run 
some benchmarks, rebuild it with the same data on Postgres and run the 
same benchmarks.  If they're the same, add more data and see what 
happens?  Keep adding more data until you see a difference.  I've only 
used MySQL without transactions, so I can't tell you how well it will 
perform compared to Postgres which I think had them from day 1.  Some 
DBMS' have other ways of indexing tables beyond b-trees (DEC's RdB had a 
hashing method that allowed for very fast reads on certain types of 
tables).  I don't know if MySQL or Postgres offer that level of 
optimization.  

I just backup the stuff.

-- 
DeeDee, don't press that button!  DeeDee!  NO!  Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]

[toc] | [prev] | [next] | [standalone]

#1391

From	Luuk <Luuk@invalid.lan>
Date	2011-09-10 10:13 +0200
Message-ID	<4e6b1c20$0$2527$e4fe514c@news2.news.xs4all.nl>
In reply to	#1390

On 10-09-2011 08:30, Michael Vilain wrote:
> Basically, unless you're a high power database person who knows the ins 
> and outs of each vendor's products, how to optimize performance (that's 
> $150K/year easy), and can design and administer such systems, I think 
> you're going to have do the research *own* your own.  

nice typo ;)


-- 
Luuk

[toc] | [prev] | [next] | [standalone]

#1392

From	Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid>
Date	2011-09-10 09:34 -0500
Message-ID	<M9OdnSps5JUM6PbTnZ2dnUVZ_gGdnZ2d@giganews.com>
In reply to	#1390

On 2011-09-10, Michael Vilain <vilain@NOspamcop.net> wrote:
> This is why database consultants get paid big bucks.  Answering 
> questions like "what DB product should I buy?" and "how should I lay out 
> my database system?" are all bread and butter for these guys.  And you 
> want that sort of analysis for free?  Huh.

I was hoping that someone in these two groups has faced similar issues
before and would be willing to share their relevant experiences. 

> Basically, unless you're a high power database person who knows the ins 
> and outs of each vendor's products, how to optimize performance (that's 
> $150K/year easy), and can design and administer such systems, I think 
> you're going to have do the research own your own.  

Well, this is what I am trying to do here, do a research.

> Back in my day of system performance monitoring, a customer would
> ask me how their workload would run on a specific hardware
> configuration or why it's running poorly and should they buy the
> larger machine.  Poorly written applications will still perform
> poorly on a bigger machine, just faster.  Spend time designing
> something well with growth, scale, archival, and _yes Virginia,
> fricken backups_ and you'll save yourself some time and stress.
> Developers never think of "how do I back this up?", they just ask
> the datacenter to do it.  After they've gone live.

Well, for better or worse, I have already designed a backup system for
this warehouse. 

But there are issues of large datasets that just do not matter in
small datasets. I liken it to having a pet that happens to be an
elephant. It is not the same as having a mouse, cat or dog for a pet
-- a pet elephant comes with its own set of challenges. 

For example, I remember the horrors and upsets of REPAIR TABLE
statements that took almost a day. That was with a MyISAM table around
40 gigs (I do not remember the size exactly).

This means that I cannot use MyISAM, period, end of story, for
something 100 times the size of the above. 

And that is why I am asking for people with relevant experience to
share what it is. 

> I have no idea how well you've designed your database or your experience 
> in such matters.  You sound like a hardware guy from what you've 
> described. 

I am actually a computer programmer. 

> If this is just personal project, build it using MySQL, run some
> benchmarks, rebuild it with the same data on Postgres and run the
> same benchmarks.  If they're the same, add more data and see what
> happens?  Keep adding more data until you see a difference.  I've
> only used MySQL without transactions, so I can't tell you how well
> it will perform compared to Postgres which I think had them from day
> 1.  Some DBMS' have other ways of indexing tables beyond b-trees
> (DEC's RdB had a hashing method that allowed for very fast reads on
> certain types of tables).  I don't know if MySQL or Postgres offer
> that level of optimization.
>
> I just backup the stuff.

Doing it in both mysql and postgres, is not such a bad idea.

I have backups relatively well covered, as my paranoia about backups
has no bounds. 

i

[toc] | [prev] | [next] | [standalone]

#1393

From	John Levine <johnl@iecc.com>
Date	2011-09-10 19:47 +0000
Message-ID	<j4get5$1jc2$1@gal.iecc.com>
In reply to	#1392

>For example, I remember the horrors and upsets of REPAIR TABLE
>statements that took almost a day. That was with a MyISAM table around
>40 gigs (I do not remember the size exactly).

Yeah, they do.  I'm most of the way through repairing a 57G table which
takes about a day and a half.

>This means that I cannot use MyISAM, period, end of story, for
>something 100 times the size of the above. 

Hmmn.  This suggests that you might want to reread the manual.

A few moments spent looking at the MySQL manual reveals partitioned
tables and MERGE tables, both of which are reasonable ways to divide
one logical table into several physical ones.  If your data just
grows, but the oldest stuff never changes, merged tables may be
appropriate, since you can make the older tables packed and read-only.
Either way, the individual tables can be much smaller and more
tractable.

If your data naturally falls into slices (by numeric value in 5.1,
number or string in 5.5), partitioned tables may work, particularly if
you can construct your queries to take advantage of partition pruning
in your queries.

R's,
John

[toc] | [prev] | [next] | [standalone]

#1394

From	Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid>
Date	2011-09-10 16:02 -0500
Message-ID	<65Wdna3o9egcTfbTnZ2dnUVZ_uWdnZ2d@giganews.com>
In reply to	#1393

On 2011-09-10, John Levine <johnl@iecc.com> wrote:
>>For example, I remember the horrors and upsets of REPAIR TABLE
>>statements that took almost a day. That was with a MyISAM table around
>>40 gigs (I do not remember the size exactly).
>
> Yeah, they do.  I'm most of the way through repairing a 57G table which
> takes about a day and a half.

Then you know!

>>This means that I cannot use MyISAM, period, end of story, for
>>something 100 times the size of the above. 
>
> Hmmn.  This suggests that you might want to reread the manual.
>
> A few moments spent looking at the MySQL manual reveals partitioned
> tables and MERGE tables, both of which are reasonable ways to divide
> one logical table into several physical ones.  If your data just
> grows, but the oldest stuff never changes, merged tables may be
> appropriate, since you can make the older tables packed and read-only.
> Either way, the individual tables can be much smaller and more
> tractable.

At first, I really liked your idea of a MERGE. It does fit my dataset
very well. I had NO idea that it was possible. I thought that I can,
essentially, split my data by month. 

Then I realized that I perform my updates for the more recent items by
means of INSERT... ON DUPLICATE KEY UPDATE, and it would not work with
MERGE tables.

``Because the MERGE engine cannot enforce uniqueness over the set of
underlying tables''

> If your data naturally falls into slices (by numeric value in 5.1,
> number or string in 5.5), partitioned tables may work, particularly if
> you can construct your queries to take advantage of partition pruning
> in your queries.

Maybe I can operate it by slicing data by, say, last digit of the
numeric ID. (I get the numeric ID of my items with the items, they
already have a natural unique key)

The more I think about the problem, the more I have a bad feeling
about it. 

i

[toc] | [prev] | [next] | [standalone]

#1400

From	The Natural Philosopher <tnp@invalid.invalid>
Date	2011-09-11 15:40 +0100
Message-ID	<j4ih9n$rta$2@news.albasani.net>
In reply to	#1394

Ignoramus13208 wrote:
> On 2011-09-10, John Levine <johnl@iecc.com> wrote:
>>> For example, I remember the horrors and upsets of REPAIR TABLE
>>> statements that took almost a day. That was with a MyISAM table around
>>> 40 gigs (I do not remember the size exactly).
>> Yeah, they do.  I'm most of the way through repairing a 57G table which
>> takes about a day and a half.
> 
> Then you know!
> 
>>> This means that I cannot use MyISAM, period, end of story, for
>>> something 100 times the size of the above. 
>> Hmmn.  This suggests that you might want to reread the manual.
>>
>> A few moments spent looking at the MySQL manual reveals partitioned
>> tables and MERGE tables, both of which are reasonable ways to divide
>> one logical table into several physical ones.  If your data just
>> grows, but the oldest stuff never changes, merged tables may be
>> appropriate, since you can make the older tables packed and read-only.
>> Either way, the individual tables can be much smaller and more
>> tractable.
> 
> At first, I really liked your idea of a MERGE. It does fit my dataset
> very well. I had NO idea that it was possible. I thought that I can,
> essentially, split my data by month. 
> 
> Then I realized that I perform my updates for the more recent items by
> means of INSERT... ON DUPLICATE KEY UPDATE, and it would not work with
> MERGE tables.
> 

so do it a different way.

[toc] | [prev] | [next] | [standalone]

#1401

From	Axel Schwenke <axel.schwenke@gmx.de>
Date	2011-09-11 18:11 +0200
Message-ID	<v63vj8-su2.ln1@xl.homelinux.org>
In reply to	#1394

Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid> wrote:
>>
>> A few moments spent looking at the MySQL manual reveals partitioned
>> tables and MERGE tables
>
> At first, I really liked your idea of a MERGE. It does fit my dataset
> very well.

I doubt that. MERGE is the much older and quite crappy sister of
partitioned tables. And since you claimed partitions would not fit,
MERGE won't too.

Also the MERGE engine is notoriously unstable. I expect to see
support for MERGE tables being dropped soon.

XL

[toc] | [prev] | [next] | [standalone]

#1402

From	Ignoramus21330 <ignoramus21330@NOSPAM.21330.invalid>
Date	2011-09-11 12:07 -0500
Message-ID	<ROKdnQFVdL14d_HTnZ2dnUVZ_sydnZ2d@giganews.com>
In reply to	#1401

On 2011-09-11, Axel Schwenke <axel.schwenke@gmx.de> wrote:
> Ignoramus13208 <ignoramus13208@NOSPAM.13208.invalid> wrote:
>>>
>>> A few moments spent looking at the MySQL manual reveals partitioned
>>> tables and MERGE tables
>>
>> At first, I really liked your idea of a MERGE. It does fit my dataset
>> very well.
>
> I doubt that. MERGE is the much older and quite crappy sister of
> partitioned tables. And since you claimed partitions would not fit,
> MERGE won't too.
>
> Also the MERGE engine is notoriously unstable. I expect to see
> support for MERGE tables being dropped soon.

OK, thanks for letting me know... I will not even consider...

i

[toc] | [prev] | [next] | [standalone]

#1396

From	Norman Peelman <npeelmandog@cfl.rr.com>
Date	2011-09-10 19:06 -0400
Message-ID	<j4gqi3$epu$1@dont-email.me>
In reply to	#1392

On 09/10/2011 10:34 AM, Ignoramus13208 wrote:
> On 2011-09-10, Michael Vilain<vilain@NOspamcop.net>  wrote:
>> This is why database consultants get paid big bucks.  Answering
>> questions like "what DB product should I buy?" and "how should I lay out
>> my database system?" are all bread and butter for these guys.  And you
>> want that sort of analysis for free?  Huh.
>
> I was hoping that someone in these two groups has faced similar issues
> before and would be willing to share their relevant experiences.
>
>> Basically, unless you're a high power database person who knows the ins
>> and outs of each vendor's products, how to optimize performance (that's
>> $150K/year easy), and can design and administer such systems, I think
>> you're going to have do the research own your own.
>
> Well, this is what I am trying to do here, do a research.
>
>> Back in my day of system performance monitoring, a customer would
>> ask me how their workload would run on a specific hardware
>> configuration or why it's running poorly and should they buy the
>> larger machine.  Poorly written applications will still perform
>> poorly on a bigger machine, just faster.  Spend time designing
>> something well with growth, scale, archival, and _yes Virginia,
>> fricken backups_ and you'll save yourself some time and stress.
>> Developers never think of "how do I back this up?", they just ask
>> the datacenter to do it.  After they've gone live.
>
> Well, for better or worse, I have already designed a backup system for
> this warehouse.
>
> But there are issues of large datasets that just do not matter in
> small datasets. I liken it to having a pet that happens to be an
> elephant. It is not the same as having a mouse, cat or dog for a pet
> -- a pet elephant comes with its own set of challenges.
>
> For example, I remember the horrors and upsets of REPAIR TABLE
> statements that took almost a day. That was with a MyISAM table around
> 40 gigs (I do not remember the size exactly).
>
> This means that I cannot use MyISAM, period, end of story, for
> something 100 times the size of the above.
>
> And that is why I am asking for people with relevant experience to
> share what it is.
>
>> I have no idea how well you've designed your database or your experience
>> in such matters.  You sound like a hardware guy from what you've
>> described.
>
> I am actually a computer programmer.
>
>> If this is just personal project, build it using MySQL, run some
>> benchmarks, rebuild it with the same data on Postgres and run the
>> same benchmarks.  If they're the same, add more data and see what
>> happens?  Keep adding more data until you see a difference.  I've
>> only used MySQL without transactions, so I can't tell you how well
>> it will perform compared to Postgres which I think had them from day
>> 1.  Some DBMS' have other ways of indexing tables beyond b-trees
>> (DEC's RdB had a hashing method that allowed for very fast reads on
>> certain types of tables).  I don't know if MySQL or Postgres offer
>> that level of optimization.
>>
>> I just backup the stuff.
>
> Doing it in both mysql and postgres, is not such a bad idea.
>
> I have backups relatively well covered, as my paranoia about backups
> has no bounds.
>
> i

   It seems to me that if you could split the db into smaller tables 
based on per site usage or timeframe (monthly, quarterly, etc) any 
REPAIR TABLE operations would be significantly reduced. How much that 
would impact your scripts I don't know. That and a 64bit linux (is the 
hw 64bit) to cover the file structure and data structure limits.

http://jeremy.zawodny.com/blog/archives/000796.html
http://www.fromdual.ch/mysql-limitations




-- 
Norman
Registered Linux user #461062
-Have you been to www.mysql.com yet?-

[toc] | [prev] | [next] | [standalone]

#1398

From	Matthew Woodcraft <mattheww@chiark.greenend.org.uk>
Date	2011-09-11 10:24 +0100
Message-ID	<Mlb*t+XMt@news.chiark.greenend.org.uk>
In reply to	#1389

Ignoramus5263  <ignoramus5263@NOSPAM.5263.invalid> wrote:
>So... Can Postgres support a 5-10 TB database with the use pattern
>stated above?

If nobody here gives you a concrete answer, I suggest asking on the
pgsql-performance mailing list; it's the sort of question they're used
to answering.

http://archives.postgresql.org/pgsql-performance/

-M-

[toc] | [prev] | [next] | [standalone]

#1399

From	Ignoramus21330 <ignoramus21330@NOSPAM.21330.invalid>
Date	2011-09-11 07:35 -0500
Message-ID	<aYmdnU_PzfKWNvHTnZ2dnUVZ_t-dnZ2d@giganews.com>
In reply to	#1398

On 2011-09-11, Matthew Woodcraft <mattheww@chiark.greenend.org.uk> wrote:
> Ignoramus5263  <ignoramus5263@NOSPAM.5263.invalid> wrote:
>>So... Can Postgres support a 5-10 TB database with the use pattern
>>stated above?
>
> If nobody here gives you a concrete answer, I suggest asking on the
> pgsql-performance mailing list; it's the sort of question they're used
> to answering.
>
> http://archives.postgresql.org/pgsql-performance/
>
> -M-

Thanks. I sbuscribed and sent the same question, slightly reworded, to that list.

[toc] | [prev] | [next] | [standalone]

#1515

From	Bodo <bodo@nomail.org.invalid>
Date	2011-09-23 13:01 +0200
Message-ID	<bodo-F453C8.13010023092011@news.lrz-muenchen.de>
In reply to	#1389

> The largest table will consume the most space
> and will take, perhaps, 200,000,000 rows. 
> 

That should be no problem. 
We run a 250m rows beast here unpartitioned.
Updated in real time.

BTW select count(*) needs 5 minutes under full load and 2 minutes on a smaller but less busy box.

regards,

Toni

MySQL Innobase 5.5 and 5.1

[toc] | [prev] | [standalone]

csiph-web

MySQL vs Postgres for a "data warehouse", 5-10 TB

Contents

#1389 — MySQL vs Postgres for a "data warehouse", 5-10 TB

#1390

#1391

#1392

#1393

#1394

#1400

#1401

#1402

#1396

#1398

#1399

#1515