db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nurullah Akkaya <nurullah_akk...@yahoo.com>
Subject Re: keeping the table ordered
Date Tue, 06 Feb 2007 22:51:48 GMT
for who ever searches these archives creating a composite index that  
includes all the columns i query solved the problem. i can read 2.7  
million records in 11 seconds another query that selects around  800k  
records dropped to about 4 secs.



On Feb 6, 2007, at 12:40 PM, Craig L Russell wrote:

> It sounds like the issue is the structure of the rows on the data  
> pages. IIUC, once a row is stored on a data page, it's going to be  
> there for a long long time. So as you add new rows, they are added  
> to data pages where there is free space. The only way to move a row  
> is to delete it and insert it so it ends up in another place.
>
> It might be that for this purpose, you might have to restructure  
> the database so that it is organized the way you want it. For  
> example, perform an index scan based on the desired order and  
> insert rows into a new table, which will have the effect of  
> ordering the rows on data pages according to the insert order.
>
> If availability is an issue, you might have to partition the data  
> so that while the old table is being reorganized into the new  
> table, you keep track of insert and delete activity so you can  
> apply the changes to the new table before putting the new table  
> into service.
>
> Craig
>
> On Feb 6, 2007, at 10:08 AM, Mamta Satoor wrote:
>
>> To answer your question on compound index. It just means to define  
>> an index which includes more than one column. eg
>>
>> CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT)
>> CREATE INDEX i1_2_4 ON t1(c1, c4, c2)
>>
>> HTH,
>> Mamta
>>
>>
>> On 2/6/07, Nurullah Akkaya <nurullah_akkaya@yahoo.com> wrote:
>>
>> On Feb 6, 2007, at 11:25 AM, Michael Segel wrote:
>>
>> > Sorry to top post, on my crackberry...
>> >
>> > I think you missed my point.
>> > Select the count of your documents that use the word 'the'.
>> >
>> > Ok so let's say that you want to search for all of the documents
>> > that use the word 'the'.
>> > You first lookup the integer representation of the word. Let's say
>> > that its = 100.
>> >
>> > How many times is the value 100 going to be in your index?
>> that varies with the document set with 2 million documents i have
>> around 2.5 million 'the' entries.
>>
>> >
>> > Ok?
>> >
>> > But to your other point... You see that your data is not
>> > contiguous. Hmmm ok,so assuming that your primary index is wordID,
>> > how do you handle documents that have multiple words? So if you
>> > search on 'the' you'll get one set of data and if you then search
>> > on the wordID for 'is', you'll have data that isn't in sort order
>> > on the disk.
>> assume the following ids.
>> the -> 100
>> is -> 150
>> 101 -> linux
>>
>> i want my tables to be sorted like the following. not just the word
>> the but all id's are sorted
>> 100
>> 100
>> 100
>> 100
>> 100
>> 100
>> 101
>> 101
>> 101
>> 101
>> 150
>> 150
>> 150
>>
>> from my knowledge of databases they sorted in random order thus we
>> have indexes pointing where the data is. from the upper example i am
>> going to read one big chunk of data from the disk but in the bottom
>> example i will read 100 then jump a buch of records and read next.
>>
>> 100
>> 101
>> 150
>> 101
>> 101
>> 100
>> 150
>>
>> where can i learn more about the compound index. create index
>> statement in ref manual doesn't mention it?
>>
>> >
>> > Now here's something that may help,
>> > Drop all of your indexes and create a single compound index where
>> > the first field is wordID.
>> >
>> > That may help you out...
>> >
>> >
>> > Sent via BlackBerry.
>> >
>> > -Mike Segel
>> > Principal
>> > MSCC
>> > 312 952 8175
>> >
>> >
>> > -----Original Message-----
>> > From: Nurullah Akkaya < nurullah_akkaya@yahoo.com>
>> > Date: Tue, 6 Feb 2007 11:14:02
>> > To:Derby Discussion <derby-user@db.apache.org>
>> > Subject: Re: keeping the table ordered
>> >
>> > It is not quite clear to me what you are trying to achieve.  Why do
>> > you want a sequential read?  Scanning the entire table of 100
>> > million records should take longer time than looking up a record
>> > using a index on wordid.  Have you retrieved the query plan and
>> > made sure the index on wordid is used?  Or are you talking about
>> > doing a lookup of many different wordids in sorted order?
>> >
>> >
>> >
>> > i did not meant sequential scanning of the whole table i meant disk
>> > i/o( bottom paragraph explains it )
>> > yes i checked the query plan and derby uses index to lookup records
>> > and index look up checks only two index pages. so i came to the
>> > conclusion that most of the time is lost making random i/o request
>> > for the data thats why i am trying to keep the table sorted. since
>> > sequential hard disk access is much faster than random i/o .
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Feb 6, 2007, at 8:09 AM, Michael Segel wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > What exactly are you trying to do?
>> > Based on the little snippet, it looks like this is an exercise to
>> > create a
>> > "google like" search on a series of documents.
>> >
>> >
>> > The problem is that your wordID, while an integer, is not going to
>> > be unique
>> > enough.
>> >
>> >
>> >
>> > wordId isn't unique at all each word in a document gets a
>> > corresponding posting entry i look up wordId for the word the then
>> > select all docId's containg the wordId. that posting list is
>> > basicly a big inverted list. what i am trying to do is keep the
>> > table sorted by wordId so insted of keeping values randomly on disk
>> > they are being written sequentialy to the file so that instead of
>> > doing random i/o i just do a sequential read from the hard drive. i
>> > don't want  sequential scanning of the whole table.
>> >
>> >
>> >
>> >
>> >
>> > For example, search your documents where the wordID is the integer
>> > look up for
>> > the word "the".
>> >
>> >
>> > Do you see the problem?
>> >
>> >
>> > --
>> > --
>> > Michael Segel
>> > Principal
>> > Michael Segel Consulting Corp.
>> > derby -=-@segel.com: <mailto:-@segel.com>
>> > (312) 952-8175 [mobile]
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>
> Craig Russell
> Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
> 408 276-5638 mailto:Craig.Russell@sun.com
> P.S. A good JDO? O, Gasp!
>


Mime
View raw message