Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.databases.ms-sqlserver > #1283 > unrolled thread
| Started by | migurus <migurus@yahoo.com> |
|---|---|
| First post | 2012-09-13 14:47 -0700 |
| Last post | 2012-09-17 08:46 -0700 |
| Articles | 8 — 5 participants |
Back to article view | Back to comp.databases.ms-sqlserver
table design for massive GPS data migurus <migurus@yahoo.com> - 2012-09-13 14:47 -0700
Re: table design for massive GPS data bradbury9 <ray.bradbury9@gmail.com> - 2012-09-14 00:06 -0700
Re: table design for massive GPS data "Bob Barrows" <reb01501@NOSPAMyahoo.com> - 2012-09-14 07:33 -0400
Re: table design for massive GPS data migurus <migurus@yahoo.com> - 2012-09-14 10:13 -0700
Re: table design for massive GPS data rja.carnegie@gmail.com - 2012-09-14 05:05 -0700
Re: table design for massive GPS data migurus <migurus@yahoo.com> - 2012-09-14 10:25 -0700
Re: table design for massive GPS data Erland Sommarskog <esquel@sommarskog.se> - 2012-09-15 19:21 +0200
Re: table design for massive GPS data rja.carnegie@gmail.com - 2012-09-17 08:46 -0700
| From | migurus <migurus@yahoo.com> |
|---|---|
| Date | 2012-09-13 14:47 -0700 |
| Subject | table design for massive GPS data |
| Message-ID | <37389863-6d11-4092-9f04-2e27dbccbd2b@googlegroups.com> |
Windows Server/SQL Server 2008, we need to store GPS readings from roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data comes in in real-time. Records are never updated after insert. That gives us roughly 120 million records per year. We need to keep last 5 years of records. The usage of stored data is to query vehicle positions and event attributes within entire day by given vehicle and date. Geospatial features and functions are not really needed for the reporting. I'd like to discuss table design for these requirements. Here is a preliminary record content: - vehicle id - gps reading X and Y - gps reading time - one-letter event attribute - event id (nullable) First idea is to keep data normalized in one table, one record per GPS reading. It looks very simple to implement, but I doubt the server will be happy with this high number of small records. I we were to do this, what would be a good candidate for primary key here? Just for illustration I show table def for idea# 1 create table idea1 ( VEHICLE_ID int not null , GPS_X real not null , GPS_Y real not null , GPS_TIME datetime not null , EVENT_LETTER char not null , EVENT_ID int null ); I don't show indexes as I am not clear what to use. The typical query would look like this: select GPS_X , GPS_Y , GPS_TIME , EVENT_LETTER , EVENT_ID from idea1 where VEHICLE_ID=@id and GPS_TIME between @from and @to Second idea is to keep a small table where records are inserted as they come in the structure shown above and then on the daily basis aggregate it and store as one record per vehicle in the another table, which will be used for queries. By doing this we shrink number of records from 120 million to be ~ 350 thousands per year. This looks much more manageable. The cons here are additional effort to come up with the way of storing and retrieving gps readings as a blob. So, the table would look: create table idea2 ( VEHICLE_ID int not null , RUN_DATE date not null , GPS_READINGS varbinary or whatever appropriate to keep daily portion of readings -- we maight need reading_count as int here to help extract them from blob ); And typical query would look like: select GPS_READINGS from idea2 where VEHICLE_ID=@id and RUN_DATE=@date This result blob will need to be 'expanded', we might use sp to do it. Any comments are welcome, critique is appreciated. Thanks in advance. migurus.
[toc] | [next] | [standalone]
| From | bradbury9 <ray.bradbury9@gmail.com> |
|---|---|
| Date | 2012-09-14 00:06 -0700 |
| Message-ID | <f14f137d-2075-48ae-bd86-04bb6c189d8f@googlegroups.com> |
| In reply to | #1283 |
El jueves, 13 de septiembre de 2012 23:47:43 UTC+2, migurus escribió: > Windows Server/SQL Server 2008, we need to store GPS readings from roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data comes in in real-time. Records are never updated after insert. That gives us roughly 120 million records per year. We need to keep last 5 years of records. > > > > The usage of stored data is to query vehicle positions and event attributes within entire day by given vehicle and date. Geospatial features and functions are not really needed for the reporting. > > > > I'd like to discuss table design for these requirements. > > > > Here is a preliminary record content: > > - vehicle id > > - gps reading X and Y > > - gps reading time > > - one-letter event attribute > > - event id (nullable) > > > > First idea is to keep data normalized in one table, one record per GPS reading. It looks very simple to implement, but I doubt the server will be happy with this high number of small records. I we were to do this, what would be a good candidate for primary key here? > > Just for illustration I show table def for idea# 1 > > create table idea1 ( > > VEHICLE_ID int not null > > , GPS_X real not null > > , GPS_Y real not null > > , GPS_TIME datetime not null > > , EVENT_LETTER char not null > > , EVENT_ID int null > > ); > > I don't show indexes as I am not clear what to use. > > The typical query would look like this: > > select > > GPS_X > > , GPS_Y > > , GPS_TIME > > , EVENT_LETTER > > , EVENT_ID > > from > > idea1 > > where > > VEHICLE_ID=@id > > and GPS_TIME between @from and @to > > > > > > Second idea is to keep a small table where records are inserted as they come in the structure shown above and then on the daily basis aggregate it and store as one record per vehicle in the another table, which will be used for queries. By doing this we shrink number of records from 120 million to be ~ 350 thousands per year. This looks much more manageable. The cons here are additional effort to come up with the way of storing and retrieving gps readings as a blob. > > So, the table would look: > > create table idea2 ( > > VEHICLE_ID int not null > > , RUN_DATE date not null > > , GPS_READINGS varbinary or whatever appropriate to keep daily portion of readings > > -- we maight need reading_count as int here to help extract them from blob > > ); > > And typical query would look like: > > select > > GPS_READINGS > > from > > idea2 > > where > > VEHICLE_ID=@id > > and RUN_DATE=@date > > This result blob will need to be 'expanded', we might use sp to do it. > > > > Any comments are welcome, critique is appreciated. > > Thanks in advance. > > migurus. 120.000.000 record per year up to 5 years = 600.000.000. That is a pretty big database as you already said. I wonder if it is not better creating an indexed view. That way you avoid creating 547.000 extra records that consume disk space. BTW, make sure you run the SQL Server quey optimizer so the select used in the view has good indexes.
[toc] | [prev] | [next] | [standalone]
| From | "Bob Barrows" <reb01501@NOSPAMyahoo.com> |
|---|---|
| Date | 2012-09-14 07:33 -0400 |
| Message-ID | <k2v4pe$n3t$1@dont-email.me> |
| In reply to | #1283 |
migurus wrote: > Windows Server/SQL Server 2008, we need to store GPS readings from > roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data > comes in in real-time. Records are never updated after insert. That > gives us roughly 120 million records per year. We need to keep last 5 > years of records. > > The usage of stored data is to query vehicle positions and event > attributes within entire day by given vehicle and date. Geospatial > features and functions are not really needed for the reporting. > > I'd like to discuss table design for these requirements. > > Here is a preliminary record content: > - vehicle id > - gps reading X and Y > - gps reading time > - one-letter event attribute > - event id (nullable) > > First idea is to keep data normalized in one table, one record per > GPS reading. It looks very simple to implement, but I doubt the > server will be happy with this high number of small records. Why not? Narrow and long tables are perfect for relational databases > I we > were to do this, what would be a good candidate for primary key here? What combination of columns identifies a row? Vehicle_ID and GPS_TIME? If so, there's your clustered primary key (wtp?) What is that event id and why is it nullable? > Just for illustration I show table def for idea# 1 > create table idea1 ( > VEHICLE_ID int not null > , GPS_X real not null > , GPS_Y real not null > , GPS_TIME datetime not null > , EVENT_LETTER char not null > , EVENT_ID int null > ); > I don't show indexes as I am not clear what to use. > The typical query would look like this: > select > GPS_X > , GPS_Y > , GPS_TIME > , EVENT_LETTER > , EVENT_ID > from > idea1 > where > VEHICLE_ID=@id > and GPS_TIME between @from and @to The proposed clustered primary key above would suit this query perfectly > > > Second idea is to keep a small table where records are inserted as > they come in the structure shown above and then on the daily basis > aggregate it and store as one record per vehicle in the another > table, which will be used for queries. By doing this we shrink number > of records from 120 million to be ~ 350 thousands per year. This > looks much more manageable. The cons here are additional effort to > come up with the way of storing and retrieving gps readings as a > blob. > Nah, this data isn't really suitable for aggregation, IMO.
[toc] | [prev] | [next] | [standalone]
| From | migurus <migurus@yahoo.com> |
|---|---|
| Date | 2012-09-14 10:13 -0700 |
| Message-ID | <316dd77d-8a3b-4a09-9477-56fd928d6008@googlegroups.com> |
| In reply to | #1285 |
On Friday, September 14, 2012 4:35:11 AM UTC-7, Bob Barrows wrote: > migurus wrote: > > > Windows Server/SQL Server 2008, we need to store GPS readings from > > > roughly 1000 vehicles, ~ 300 readings per vehicle per day. The data > > > comes in in real-time. Records are never updated after insert. That > > > gives us roughly 120 million records per year. We need to keep last 5 > > > years of records. > > > > > > The usage of stored data is to query vehicle positions and event > > > attributes within entire day by given vehicle and date. Geospatial > > > features and functions are not really needed for the reporting. > > > > > > I'd like to discuss table design for these requirements. > > > > > > Here is a preliminary record content: > > > - vehicle id > > > - gps reading X and Y > > > - gps reading time > > > - one-letter event attribute > > > - event id (nullable) > > > > > > First idea is to keep data normalized in one table, one record per > > > GPS reading. It looks very simple to implement, but I doubt the > > > server will be happy with this high number of small records. > > > > Why not? Narrow and long tables are perfect for relational databases > > Thanks, I did not know that, as I have limited exposure in db field. > > > I we > > > were to do this, what would be a good candidate for primary key here? > > > > What combination of columns identifies a row? Vehicle_ID and GPS_TIME? If > > so, there's your clustered primary key (wtp?) > > What is that event id and why is it nullable? Some events do not require id, this will not be any foreign key or anything > > > > > > > Just for illustration I show table def for idea# 1 > > > create table idea1 ( > > > VEHICLE_ID int not null > > > , GPS_X real not null > > > , GPS_Y real not null > > > , GPS_TIME datetime not null > > > , EVENT_LETTER char not null > > > , EVENT_ID int null > > > ); > > > I don't show indexes as I am not clear what to use. > > > The typical query would look like this: > > > select > > > GPS_X > > > , GPS_Y > > > , GPS_TIME > > > , EVENT_LETTER > > > , EVENT_ID > > > from > > > idea1 > > > where > > > VEHICLE_ID=@id > > > and GPS_TIME between @from and @to > > > > The proposed clustered primary key above would suit this query perfectly > > > > > > > > > > > Second idea is to keep a small table where records are inserted as > > > they come in the structure shown above and then on the daily basis > > > aggregate it and store as one record per vehicle in the another > > > table, which will be used for queries. By doing this we shrink number > > > of records from 120 million to be ~ 350 thousands per year. This > > > looks much more manageable. The cons here are additional effort to > > > come up with the way of storing and retrieving gps readings as a > > > blob. > > > > > Nah, this data isn't really suitable for aggregation, IMO. Thanks for your input.
[toc] | [prev] | [next] | [standalone]
| From | rja.carnegie@gmail.com |
|---|---|
| Date | 2012-09-14 05:05 -0700 |
| Message-ID | <aa20c002-6e6e-4f36-b678-d4c2c736dada@googlegroups.com> |
| In reply to | #1283 |
Off the top of a semi-qualified head: If the table is indexed well, then query performance will depend on the volume of data in a query, but not so much on the size of a table and the rows that you aren't looking at. If the typical query is all the records for one vehicle on one calendar day (one working day), then a primary key of date, vehicle id, time, looks possible. If you want to count work after midnight for the previous day, that could be messier - you could think about using datetimeoffset. I expect such a design also to be efficient when you are deleting records that are 5 years old every day. If you use datetime order, then vehicle, then the server has to scan records of 1000 vehicles to get the data that you actually want to see. On the other hand, if you want the query for 1000 vehicles, then it's probably okay if they're mixed together. If data arrives in real time, then a clustered index of date, vehicle, time, may slow down updates, while favouring the query. Usually you want it that way around. The clustered index makes SQL Server physically store neighbouring records together. So you'll have 1000 distinct points in the data file where each vehicle's records are being added. You'd also see data filling pages that are then split into two, a lot. But I expect the server still to perform well in that case. There also may be an implication of the growth of the transaction log, but if data is inserted one row at a time then it won't make much of a difference - the log ray be pretty big whatever you do. Your primary key is of course unique and doesn't allow NULL and by default is a clustered index, but none of this is necessary, including even having a primary key as such. Indexes that aren't the primary key will work just as well. The clustered index does matter; it's also desirable for its key to be short, since it's incorporated into any nonclustered indexes, so if there isn't a need to use the clustered index to reorganise the data then a clustered index on an IDENTITY(int, 1, 1) column can be pretty good. /Then/ you set your nonclustered primary key (which does have to be unique and not NULL). Including a non-meaningful IDENTITY in a primary key wouldn't be right, but it's right if it is, in fact, the key of your table in relationships with other data. The point is, if your data is not unique in the date, time, and vehicle ID fields, then don't get sidetracked into worrying about a primary key that you can't have. Alternatively, add 1 second to one of the duplicate key values, and it'll be unique. Primary key also is a constraint that must have a unique object name (PK_nameoftable probably), whereas you can and I do name other indexes IX_1, IX_2, IX_3, on every table in your database. This doesn't help other developers understand the database, but other developers don't often help me, so if they want to know what my indexes are, they can look 'em up. I probably have 500 nearly identical tables created at the same time. (Arguably I should have one table with 500 times as many rows and a couple of extra columns, but that isn't how we do things here.) So I don't want to invent 500 meaningful index names. By the way, this project sounds important enough that you'll be thinking carefully about data backup and recovering from disaster or downtime, if you don't already have that covered thoroughly for your platform. So, your thinking about backup and recovery also may bear on your database design. For instance, do you want that big, fat transaction log that holds everything that happened in your database between the last backup and the failure - or do you have files of the last 7 days' GPS data that you can re-load as part of your disaster recovery process, so that it doesn't matter if the database is blown away except for the twice-weekly taken-off-site backup?
[toc] | [prev] | [next] | [standalone]
| From | migurus <migurus@yahoo.com> |
|---|---|
| Date | 2012-09-14 10:25 -0700 |
| Message-ID | <86717762-58b1-4a3a-a68d-2177c6047df7@googlegroups.com> |
| In reply to | #1286 |
On Friday, September 14, 2012 5:05:05 AM UTC-7, rja.ca...@gmail.com wrote: > Off the top of a semi-qualified head: > > > > If the table is indexed well, then query performance will depend on the > > volume of data in a query, but not so much on the size of a table and the > > rows that you aren't looking at. > > Thanks, I had an impression that hundreds of millions of small records is not a sound design. > > If the typical query is all the records for one vehicle on one calendar day > > (one working day), then a primary key of date, vehicle id, time, looks > > possible. If you want to count work after midnight for the previous day, > > that could be messier - you could think about using datetimeoffset. > > > > I expect such a design also to be efficient when you are deleting > > records that are 5 years old every day. > > > > If you use datetime order, then vehicle, then the server has to scan > > records of 1000 vehicles to get the data that you actually want to see. > > On the other hand, if you want the query for 1000 vehicles, then > > it's probably okay if they're mixed together. > One vehicle at a time queries required. > > > If data arrives in real time, then a clustered index of date, vehicle, time, > > may slow down updates, while favouring the query. Usually you want it > > that way around. The clustered index makes SQL Server physically store > > neighbouring records together. So you'll have 1000 distinct points in > > the data file where each vehicle's records are being added. You'd also > > see data filling pages that are then split into two, a lot. But I expect > > the server still to perform well in that case. There also may be an > > implication of the growth of the transaction log, but if data is inserted > > one row at a time then it won't make much of a difference - the log ray be > > pretty big whatever you do. > There is no updates, as I mentioned. Inserts only. I love your idea of building three part index: date, vehicle_id, time. This is right to the point and not a big overhead space wise. > > > Your primary key is of course unique and doesn't allow NULL and by > > default is a clustered index, but none of this is necessary, > > including even having a primary key as such. Indexes that aren't > > the primary key will work just as well. The clustered index does > > matter; it's also desirable for its key to be short, since it's > > incorporated into any nonclustered indexes, so if there isn't a need > > to use the clustered index to reorganise the data then a clustered > > index on an IDENTITY(int, 1, 1) column can be pretty good. > > /Then/ you set your nonclustered primary key (which does have > > to be unique and not NULL). Including a non-meaningful IDENTITY > > in a primary key wouldn't be right, but it's right if it is, > > in fact, the key of your table in relationships with other data. > > > > The point is, if your data is not unique in the date, time, and > > vehicle ID fields, then don't get sidetracked into worrying about > > a primary key that you can't have. Alternatively, add 1 second > > to one of the duplicate key values, and it'll be unique. > > > > Primary key also is a constraint that must have a unique object name > > (PK_nameoftable probably), whereas you can and I do name other indexes > > IX_1, IX_2, IX_3, on every table in your database. This doesn't > > help other developers understand the database, but other developers > > don't often help me, so if they want to know what my indexes are, > > they can look 'em up. I probably have 500 nearly identical tables > > created at the same time. (Arguably I should have one table with > > 500 times as many rows and a couple of extra columns, but that isn't > > how we do things here.) So I don't want to invent 500 meaningful > > index names. > > > > By the way, this project sounds important enough that you'll be > > thinking carefully about data backup and recovering from disaster > > or downtime, if you don't already have that covered thoroughly > > for your platform. So, your thinking about backup and recovery > > also may bear on your database design. For instance, do you > > want that big, fat transaction log that holds everything that > > happened in your database between the last backup and the failure - > > or do you have files of the last 7 days' GPS data that you can > > re-load as part of your disaster recovery process, so that it > > doesn't matter if the database is blown away except for the > > twice-weekly taken-off-site backup? I was thinking about an hourly log shipping and daily or weekly full backups. Would really appreciate a comment on this idea. Thanks for your thoughts.
[toc] | [prev] | [next] | [standalone]
| From | Erland Sommarskog <esquel@sommarskog.se> |
|---|---|
| Date | 2012-09-15 19:21 +0200 |
| Message-ID | <XnsA0CFC4E06222AYazorman@127.0.0.1> |
| In reply to | #1283 |
migurus (migurus@yahoo.com) writes: > First idea is to keep data normalized in one table, one record per GPS > reading. It looks very simple to implement, but I doubt the server will > be happy with this high number of small records. Why not? As long it is properly indexed, I don't see any problem. What you should consider though, is to partition the table (which requires Enterprise Edition). When five years has passed, and you want to purge old data, this is a swift affair with partitioning, provided that your partition aligns with what you want to purge. > I we were to do this, what would be a good candidate for primary key here? (VEHICLE_ID, GPS_TIME). Although, you should probably have GPS_TIME as the first column in the clustered index to avoid fragmentation. GPS_TIME would also be your partitioning column. > So, the table would look: > create table idea2 ( > VEHICLE_ID int not null > , RUN_DATE date not null > , GPS_READINGS varbinary or whatever appropriate to keep daily > portion of readings You may save some space with this, because you save row overhead. But you get a much more complex solution which is more difficult to understand and maintain. And if the neeeds changes in the future and people want to run queries like "which vehicles have been in that area at some time?", that is hopeless with the above. I strongly recommend against it. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Links for SQL Server Books Online: SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
[toc] | [prev] | [next] | [standalone]
| From | rja.carnegie@gmail.com |
|---|---|
| Date | 2012-09-17 08:46 -0700 |
| Message-ID | <c7f8f434-150c-4b5b-b223-9c4fe5431912@googlegroups.com> |
| In reply to | #1289 |
On Saturday, September 15, 2012 6:21:13 PM UTC+1, Erland Sommarskog wrote: > migurus (migurus@yahoo.com) writes: > > First idea is to keep data normalized in one table, one record per > > GPS reading. It looks very simple to implement, but I doubt the > > server will be happy with this high number of small records. > > Why not? As long it is properly indexed, I don't see any problem. > What you should consider though, is to partition the table (which > requires Enterprise Edition). When five years has passed, and you > want to purge old data, this is a swift affair with partitioning, > provided that your partition aligns with what you want to purge. Partitioning might be not a great boost to performance, depending on how it's done. With methods possibly out of date or wrong, I estimate the original design row size is 23 bytes, counting one byte to map NULL columns (for up to 8 table columns whether nullable or not), and an overhead of 11 bytes per table row, although I don't remember how I worked that out - maybe by trying it out. Then I think that one day's new data is a little less than 10 megabytes, and five years is around 20 GB. That's before indexing - which would probably be, well, somewhat smaller than the data. If it's stupid to make tables as large as that, I've been stupider. 300,000 transactions per day, however, I put at about 2 GB or more in the transaction log if it's fully logged, so keeping that under control might be an important part of the backup/maintenance plan. On the other hand, since I'm assuming storing one 8 KB page in the log for storing each 34 bytes row, it's a lousy way to keep a backup version of the data as far as efficiency goes. But efficient isn't the same thing as effective. If you can bear the inefficiency, it'll work. Whoops, I forgot that these additions will probably update the index as well, so, double that log requirement. For deleting data /without/ taking table partitioning into account, you might remove one day's records each day after five years, but that won't be logged so badly if the rows are grouped together by clustered index - you're probably okay even if they're ordered by vehicle ID and /then/ date/time. And then the physical storage of data rows will be merrily jumbled within the data file - fragmented somewhat as time goes on - but if you only want to query one day's and one vehicle's data at a time, I think that should be fine. If partitioning the table, instead, what size of partition would you make - one day, one month, one year - and how would you maintain that? Can it be done automatically? Backup should be whatever you need to have in order to restore as much as you need to restore, as promptly as you need to have it, with depth of design in case you discover just at the wrong time that your backup tape or disc isn't any good. There are some horror stories that it's good to read, but not at bedtime. So, what you should set up depends on the circumstances, and the budget. You may need to explain some of the horror stories to whoever pays for stuff. I might include separately storing at least a day's worth of GPS messages to re-load into your database from the external format, in case of (a) equipment failure and data loss and (b) finding out that you were mishandling and corrupting the messages for a while. There are horror stories about that, too. And if you can restore yesterday's data in that way, then you may only need a daily backup of the entire database, just after midnight - which will save you from storing copies of the transaction log. On the other hand, your recovery scenario may be, you have database backups, but someone stole all of your computers... Also, would it be bad if somebody stole your data... such as, to rob the vehicles. Like, if the truck full of money drives the same route every Friday... sure, the bad guys could go out and follow it anyway, but at least then they have to pay for gas. I don't think you want to tell /us/ if you have trucks full of money. So, anyway, you should stop that from happening.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.databases.ms-sqlserver
csiph-web