db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Chedru" <oChe...@infovista.com>
Subject RE: Horizontal partitioning?
Date Thu, 19 Jul 2007 07:12:21 GMT
Hi David,

This looks like a very elegant solution and I will give it a try!

Thanks for the tip!
Olivier.

-----Original Message-----
From: david.vancouvering@gmail.com [mailto:david.vancouvering@gmail.com]
On Behalf Of David Van Couvering
Sent: Wednesday, July 18, 2007 8:46 PM
To: Derby Discussion
Subject: Re: Horizontal partitioning?

Hi, Olivier, thanks for your response,  I understand your situation
now.  The use of the term "horizontal partitioning" triggered a
standard interpretation for me.

One thing you could do is write a Java procedure, see

http://wiki.apache.org/db-derby/DerbySQLroutines

Using this approach, you could take as parameters the begin date, the
end date, and then a string for the actual query (without the
timestamp constraints).  You could then compose a query against the
relevant tables, given the timestamp constraints, and execute this
query, which is returned as a result set to the user.

David

On 7/18/07, Olivier Chedru <oChedru@infovista.com> wrote:
> Hi David,
>
> The purpose of my application is to keep measurement data locally, and
> Derby is perfect to manage a local database. Because of deployment
> reasons, I cannot use a distributed database.
> The measurement data never change (they are just dropped when their
> lifetime is reached). This application aims at running 24/24, and I
need
> to keep the database size stable once its lifetime is reached.
Database
> users will want to query data over a given short period of time, with
> reasonable response times.
>
> If I use a single table, I will fall in the following issues:
> - Index maintenance will become more and more expensive as the table
> grows. And it will grow a lot: typically, I will store 10000 rows per
> second and keep 1 week of data. This would generate more than 6
billion
> rows!
> - Dropping obsolete data will mean using DELETE FROM, which is much
more
> expensive than dropping a table. And this will probably lead to
database
> fragmentation over time.
>
> So I try to proceed this way:
> 1 - When some criteria is met (period of time, number of rows...), I
> create a new table without indexes.
> 2 - Data are stored using batch insert statements.
> 3 - Once the table is considered full (using the same criteria as
> above), I create the necessary indexes on the table.
> 4 - Goto 1.
>
> To provide an easy access to data and hide all those tables, a view
> seems very appropriate. Unfortunately, as shown in my first post, the
> view does not take into account constraints on the timestamp columns,
> and scans all tables. No need to say the query never ends...
>
> Any idea about this problem is welcome!
>
> Thanks,
> Olivier.
>
> -----Original Message-----
> From: david.vancouvering@gmail.com
[mailto:david.vancouvering@gmail.com]
> On Behalf Of David Van Couvering
> Sent: Wednesday, July 18, 2007 7:09 AM
> To: Derby Discussion
> Subject: Re: Horizontal partitioning?
>
> Hm, if you're creating a view on all the tables, that must mean all
> the tables are under a single database?  If that's the case, then this
> doesn't really sound like horizontal partitioning (which usually means
> splitting it into multiple databases, each with its own disk and
> usually each with its own CPU).
>
> If you're keeping everything within one database, then partitioning
> across multiple tables really is not useful or necessary, as Derby
> itself accomplishes quick access to any row in the table using the
> index on the primary key.
>
> Normally, if you have N partitions, then you would have N databases on
> N machines.  There would be no way to create a view across all of
> them.
>
> This is how folks like eBay and Google do it, for example.  This
> usually only works if the data in each partition is completely or
> almost completely independent and you don't need to do queries that
> span tables (e.g. you can issue the query against one specific
> partition depending on things like timestamp, user id, geographic
> location, etc).  If your queries can't be isolated to a single
> partition, then your application code is going to have to deal with
> logic normally assigned to the database, such as sorting and merging
> data, and now you're in the realm of a clustered database, and having
> worked on two of these in the past, I think I can safely say you don't
> want to go there...
>
> David
>
> On 7/16/07, Olivier Chedru <oChedru@infovista.com> wrote:
> >
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I am trying to achieve horizontal partitioning with Derby.
> >
> >
> >
> > My application stores *a lot* of data as time passes, and it needs
to
> drop
> > obsolete data.
> >
> >
> >
> > Table partitioning would be the ideal solution, but it is not
> implemented in
> > Derby. So I create one table per period of time. All tables have the
> same
> > columns, the first one being the data timestamp. Dropping obsolete
> data is
> > then made easy: just drop old tables.
> >
> >
> >
> > Now, I need to query data without knowing how tables are organized.
> So, I
> > create a view on all tables (the view is deleted/created every time
a
> data
> > table is created/dropped): fine.
> >
> > However, checking the runtime statistics, I notice ALL tables are
> scanned
> > when I use the view! Note I added constraints on the Timestamp
field.
> >
> >
> >
> > Sample SQL:
> >
> >
> >
> > create table t1 (ts timestamp, a int, b int, c int);
> >
> > create table t2 (ts timestamp, a int, b int, c int);
> >
> > create table t3 (ts timestamp, a int, b int, c int);
> >
> > alter table t1 add constraint t1_c check (ts >= '2007-07-16
> 09:00:00.0' and
> > ts < '2007-07-16 09:01:00.0');
> >
> > alter table t2 add constraint t2_c check (ts >= '2007-07-16
> 09:01:00.0' and
> > ts < '2007-07-16 09:02:00.0');
> >
> > alter table t3 add constraint t3_c check (ts >= '2007-07-16
> 09:02:00.0' and
> > ts < '2007-07-16 09:03:00.0');
> >
> > create view t_view as select * from t1 union select * from t2 union
> select *
> > from t3;
> >
> > CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> >
> > CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
> >
> > select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> > '2007-07-16 09:02:00.0';
> >
> > VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
> >
> >
> >
> > The displayed execution plan shows t1, t2 and t3 are scanned. I
expect
> only
> > t2 is used. I also tried to force using constraints by defining the
> view
> > this way, but it did not improve the behavior:
> >
> >
> >
> > create view t_view as select * from t1 --DERBY-PROPERTIES
> constraint=t1_c
> > union select * from t2 --DERBY-PROPERTIES constraint=t2_c union
select
> *
> > from t3 --DERBY-PROPERTIES constraint=t3_c;
> >
> >
> >
> > Any idea on how to achieve horizontal partitioning?
> >
> >
> >
> > Thanks!
> >
> > Olivier.
>

Mime
View raw message