db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Segel" <mse...@mycingular.blackberry.net>
Subject Re: keeping the table ordered
Date Tue, 06 Feb 2007 17:25:31 GMT
Sorry to top post, on my crackberry...

I think you missed my point.
Select the count of your documents that use the word 'the'.

Ok so let's say that you want to search for all of the documents that use the word 'the'.
You first lookup the integer representation of the word. Let's say that its = 100.

How many times is the value 100 going to be in your index?

Ok?

But to your other point... You see that your data is not contiguous. Hmmm ok,so assuming that
your primary index is wordID, how do you handle documents that have multiple words? So if
you search on 'the' you'll get one set of data and if you then search on the wordID for 'is',
you'll have data that isn't in sort order on the disk.

Now here's something that may help,
Drop all of your indexes and create a single compound index where the first field is wordID.

That may help you out...


Sent via BlackBerry.

-Mike Segel
Principal
MSCC
312 952 8175


-----Original Message-----
From: Nurullah Akkaya <nurullah_akkaya@yahoo.com>
Date: Tue, 6 Feb 2007 11:14:02 
To:Derby Discussion <derby-user@db.apache.org>
Subject: Re: keeping the table ordered

It is not quite clear to me what you are trying to achieve.  Why do you want a sequential
read?  Scanning the entire table of 100 million records should take longer time than looking
up a record using a index on wordid.  Have you retrieved the query plan and made sure the
index on wordid is used?  Or are you talking about doing a lookup of many different wordids
in sorted order?



i did not meant sequential scanning of the whole table i meant disk i/o( bottom paragraph
explains it )
yes i checked the query plan and derby uses index to lookup records and index look up checks
only two index pages. so i came to the conclusion that most of the time is lost making random
i/o request for the data thats why i am trying to keep the table sorted. since sequential
hard disk access is much faster than random i/o .






On Feb 6, 2007, at 8:09 AM, Michael Segel wrote:








What exactly are you trying to do?
Based on the little snippet, it looks like this is an exercise to create a
"google like" search on a series of documents.


The problem is that your wordID, while an integer, is not going to be unique
enough.



wordId isn't unique at all each word in a document gets a corresponding posting entry i look
up wordId for the word the then select all docId's containg the wordId. that posting list
is basicly a big inverted list. what i am trying to do is keep the table sorted by wordId
so insted of keeping values randomly on disk they are being written sequentialy to the file
so that instead of doing random i/o i just do a sequential read from the hard drive. i don't
want  sequential scanning of the whole table.





For example, search your documents where the wordID is the integer look up for
the word "the".


Do you see the problem?


-- 
--
Michael Segel
Principal
Michael Segel Consulting Corp.
derby -=-@segel.com: <mailto:-@segel.com> 
(312) 952-8175 [mobile]




                    




 
Mime
View raw message