From derby-user-return-6113-apmail-db-derby-user-archive=db.apache.org@db.apache.org Tue Feb 06 18:40:54 2007 Return-Path: Delivered-To: apmail-db-derby-user-archive@www.apache.org Received: (qmail 91640 invoked from network); 6 Feb 2007 18:40:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2007 18:40:53 -0000 Received: (qmail 93014 invoked by uid 500); 6 Feb 2007 18:40:54 -0000 Delivered-To: apmail-db-derby-user-archive@db.apache.org Received: (qmail 92982 invoked by uid 500); 6 Feb 2007 18:40:54 -0000 Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Discussion" Delivered-To: mailing list derby-user@db.apache.org Received: (qmail 92962 invoked by uid 99); 6 Feb 2007 18:40:54 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 10:40:54 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [192.18.42.249] (HELO nwk-ea-fw-1.sun.com) (192.18.42.249) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 10:40:41 -0800 Received: from d1-sfbay-10.sun.com ([192.18.39.120]) by nwk-ea-fw-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id l16Ie7ii027462 for ; Tue, 6 Feb 2007 10:40:07 -0800 (PST) Received: from conversion-daemon.d1-sfbay-10.sun.com by d1-sfbay-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) id <0JD200M0114ES000@d1-sfbay-10.sun.com> (original mail from Craig.Russell@Sun.COM) for derby-user@db.apache.org; Tue, 06 Feb 2007 10:40:07 -0800 (PST) Received: from [129.145.133.119] by d1-sfbay-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPSA id <0JD200GGT16ULR20@d1-sfbay-10.sun.com> for derby-user@db.apache.org; Tue, 06 Feb 2007 10:40:06 -0800 (PST) Date: Tue, 06 Feb 2007 10:40:04 -0800 From: Craig L Russell Subject: Re: keeping the table ordered In-reply-to: Sender: Craig.Russell@Sun.COM To: Derby Discussion Message-id: <315BC3BD-715C-412C-9EEE-E5BAE74D2893@SUN.com> MIME-version: 1.0 X-Mailer: Apple Mail (2.752.2) Content-type: multipart/signed; protocol="application/pkcs7-signature"; boundary=Apple-Mail-33--793522743; micalg=sha1 References: <610430523-1170782996-cardhu_blackberry.rim.net-9455-@engine19-cell05> X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-33--793522743 Content-Type: multipart/alternative; boundary=Apple-Mail-32--793523585 --Apple-Mail-32--793523585 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed It sounds like the issue is the structure of the rows on the data pages. IIUC, once a row is stored on a data page, it's going to be there for a long long time. So as you add new rows, they are added to data pages where there is free space. The only way to move a row is to delete it and insert it so it ends up in another place. It might be that for this purpose, you might have to restructure the database so that it is organized the way you want it. For example, perform an index scan based on the desired order and insert rows into a new table, which will have the effect of ordering the rows on data pages according to the insert order. If availability is an issue, you might have to partition the data so that while the old table is being reorganized into the new table, you keep track of insert and delete activity so you can apply the changes to the new table before putting the new table into service. Craig On Feb 6, 2007, at 10:08 AM, Mamta Satoor wrote: > To answer your question on compound index. It just means to define > an index which includes more than one column. eg > > CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT) > CREATE INDEX i1_2_4 ON t1(c1, c4, c2) > > HTH, > Mamta > > > On 2/6/07, Nurullah Akkaya wrote: > > On Feb 6, 2007, at 11:25 AM, Michael Segel wrote: > > > Sorry to top post, on my crackberry... > > > > I think you missed my point. > > Select the count of your documents that use the word 'the'. > > > > Ok so let's say that you want to search for all of the documents > > that use the word 'the'. > > You first lookup the integer representation of the word. Let's say > > that its = 100. > > > > How many times is the value 100 going to be in your index? > that varies with the document set with 2 million documents i have > around 2.5 million 'the' entries. > > > > > Ok? > > > > But to your other point... You see that your data is not > > contiguous. Hmmm ok,so assuming that your primary index is wordID, > > how do you handle documents that have multiple words? So if you > > search on 'the' you'll get one set of data and if you then search > > on the wordID for 'is', you'll have data that isn't in sort order > > on the disk. > assume the following ids. > the -> 100 > is -> 150 > 101 -> linux > > i want my tables to be sorted like the following. not just the word > the but all id's are sorted > 100 > 100 > 100 > 100 > 100 > 100 > 101 > 101 > 101 > 101 > 150 > 150 > 150 > > from my knowledge of databases they sorted in random order thus we > have indexes pointing where the data is. from the upper example i am > going to read one big chunk of data from the disk but in the bottom > example i will read 100 then jump a buch of records and read next. > > 100 > 101 > 150 > 101 > 101 > 100 > 150 > > where can i learn more about the compound index. create index > statement in ref manual doesn't mention it? > > > > > Now here's something that may help, > > Drop all of your indexes and create a single compound index where > > the first field is wordID. > > > > That may help you out... > > > > > > Sent via BlackBerry. > > > > -Mike Segel > > Principal > > MSCC > > 312 952 8175 > > > > > > -----Original Message----- > > From: Nurullah Akkaya < nurullah_akkaya@yahoo.com> > > Date: Tue, 6 Feb 2007 11:14:02 > > To:Derby Discussion > > Subject: Re: keeping the table ordered > > > > It is not quite clear to me what you are trying to achieve. Why do > > you want a sequential read? Scanning the entire table of 100 > > million records should take longer time than looking up a record > > using a index on wordid. Have you retrieved the query plan and > > made sure the index on wordid is used? Or are you talking about > > doing a lookup of many different wordids in sorted order? > > > > > > > > i did not meant sequential scanning of the whole table i meant disk > > i/o( bottom paragraph explains it ) > > yes i checked the query plan and derby uses index to lookup records > > and index look up checks only two index pages. so i came to the > > conclusion that most of the time is lost making random i/o request > > for the data thats why i am trying to keep the table sorted. since > > sequential hard disk access is much faster than random i/o . > > > > > > > > > > > > > > On Feb 6, 2007, at 8:09 AM, Michael Segel wrote: > > > > > > > > > > > > > > > > > > What exactly are you trying to do? > > Based on the little snippet, it looks like this is an exercise to > > create a > > "google like" search on a series of documents. > > > > > > The problem is that your wordID, while an integer, is not going to > > be unique > > enough. > > > > > > > > wordId isn't unique at all each word in a document gets a > > corresponding posting entry i look up wordId for the word the then > > select all docId's containg the wordId. that posting list is > > basicly a big inverted list. what i am trying to do is keep the > > table sorted by wordId so insted of keeping values randomly on disk > > they are being written sequentialy to the file so that instead of > > doing random i/o i just do a sequential read from the hard drive. i > > don't want sequential scanning of the whole table. > > > > > > > > > > > > For example, search your documents where the wordID is the integer > > look up for > > the word "the". > > > > > > Do you see the problem? > > > > > > -- > > -- > > Michael Segel > > Principal > > Michael Segel Consulting Corp. > > derby -=-@segel.com: > > (312) 952-8175 [mobile] > > > > > > > > > > > > > > > > > > > > > > Craig Russell Architect, Sun Java Enterprise System http://java.sun.com/products/jdo 408 276-5638 mailto:Craig.Russell@sun.com P.S. A good JDO? O, Gasp! --Apple-Mail-32--793523585 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1 It sounds like the issue is the = structure of the rows on the data pages. IIUC, once a row is stored on a = data page, it's going to be there for a long long time. So as you add = new rows, they are added to data pages where there is free space. The = only way to move a row is to delete it and insert it so it ends up in = another place.

It = might be that for this purpose, you might have to restructure the = database so that it is organized the way you want it. For example, = perform an index scan based on the desired order and insert rows into a = new table, which will have the effect of ordering the rows on data pages = according to the insert order.

If availability is an = issue, you might have to partition the data so that while the old table = is being reorganized into the new table, you keep track of insert and = delete activity so you can apply the changes to the new table before = putting the new table into service.

Craig

On Feb 6, 2007, at 10:08 AM, Mamta Satoor wrote:

To = answer your question on compound index. It just means to define an index = which includes more than one column. eg
=
=A0
CREATE TABLE t1(c1 INT, c2 INT, c3 = INT, c4 INT)
CREATE INDEX i1_2_4 ON t1(c1, c4, c2)
=
=A0
HTH,
=
Mamta

=A0
On 2/6/07, = Nurullah Akkaya <nurullah_akkaya@yahoo.com>= ; wrote:

On Feb = 6, 2007, at 11:25 AM, Michael Segel wrote:

> Sorry to top = post, on my crackberry...
>
> I think you missed my = point.
> Select the count of your documents that use the word = 'the'.
>
> Ok so let's say that you want to search for all = of the documents
> that use the word 'the'.
> You first = lookup the integer representation of the word. Let's say
> that = its =3D 100.
>
> How many times is the value 100 going to be = in your index?
that varies with the document set with 2 million = documents i have
around 2.5 million 'the' = entries.

>
> Ok?
>
> But to your other = point... You see that your data is not
> contiguous. Hmmm ok,so = assuming that your primary index is wordID,
> how do you handle = documents that have multiple words? So if you
> search on 'the' = you'll get one set of data and if you then search
> on the wordID = for 'is', you'll have data that isn't in sort order
> on the = disk.
assume the following ids.
the -> 100
is -> = 150
101 -> linux

i want my tables to be sorted like the = following. not just the word
the but all id's are sorted
100
= 100
100
100
100
100
101
101
101
101
150
150<= BR>150

from my knowledge of databases they sorted in random order = thus we
have indexes pointing where the data is. from the upper = example i am
going to read one big chunk of data from the disk but = in the bottom
example i will read 100 then jump a buch of records and = read = next.

100
101
150
101
101
100
150

where = can i learn more about the compound index. create index
statement in = ref manual doesn't mention it?

>
> Now here's something = that may help,
> Drop all of your indexes and create a single = compound index where
> the first field is wordID.
>
> = That may help you out...
>
>
> Sent via = BlackBerry.
>
> -Mike Segel
> Principal
> = MSCC
> 312 952 8175
>
>
> -----Original = Message-----
> From: Nurullah Akkaya < nurullah_akkaya@yahoo.com>= ;
> Date: Tue, 6 Feb 2007 11:14:02
> To:Derby Discussion = <derby-user@db.apache.org> =
> Subject: Re: keeping the table ordered
>
> It is = not quite clear to me what you are trying to achieve.=A0=A0Why = do
> you want a sequential read?=A0=A0Scanning the entire table of = 100
> million records should take longer time than looking up a = record
> using a index on wordid.=A0=A0Have you retrieved the = query plan and
> made sure the index on wordid is used?=A0=A0Or = are you talking about
> doing a lookup of many different wordids = in sorted order?
>
>
>
> i did not meant = sequential scanning of the whole table i meant disk
> i/o( bottom = paragraph explains it )
> yes i checked the query plan and derby = uses index to lookup records
> and index look up checks only two = index pages. so i came to the
> conclusion that most of the time = is lost making random i/o request
> for the data thats why i am = trying to keep the table sorted. since
> sequential hard disk = access is much faster than random i/o .
= >
>
>
>
>
>
> On Feb 6, 2007, at = 8:09 AM, Michael Segel = wrote:
>
>
>
>
>
>
>
>
= > What exactly are you trying to do?
> Based on the little = snippet, it looks like this is an exercise to
> create a
> = "google like" search on a series of documents.
>
>
> = The problem is that your wordID, while an integer, is not going = to
> be unique
> enough.
>
>
>
> = wordId isn't unique at all each word in a document gets a
> = corresponding posting entry i look up wordId for the word the = then
> select all docId's containg the wordId. that posting list = is
> basicly a big inverted list. what i am trying to do is keep = the
> table sorted by wordId so insted of keeping values randomly = on disk
> they are being written sequentialy to the file so that = instead of
> doing random i/o i just do a sequential read from = the hard drive. i
> don't want=A0=A0sequential scanning of the = whole table.
>
>
>
>
>
> For = example, search your documents where the wordID is the integer
> = look up for
> the word "the".
>
>
> Do you see = the problem?
>
>
> --
> --
> Michael = Segel
> Principal
> Michael Segel Consulting Corp.
> = derby -=3D-@segel.com: <mailto:-@segel.com>
> = (312) 952-8175 = [mobile]
>
>
>
>
>
>
>
> = >
>



=

Craig Russell

Architect, Sun Java = Enterprise System http://java.sun.com/products/jdo=

408 = 276-5638 mailto:Craig.Russell@sun.com

P.S. A = good JDO? O, Gasp!


= --Apple-Mail-32--793523585-- --Apple-Mail-33--793522743 Content-Transfer-Encoding: base64 Content-Type: application/pkcs7-signature; name=smime.p7s Content-Disposition: attachment; filename=smime.p7s MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIGUDCCAwkw ggJyoAMCAQICECpJVMO68ii+Xfsc1O1YYFIwDQYJKoZIhvcNAQEFBQAwYjELMAkGA1UEBhMCWkEx JTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQ ZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBMB4XDTA2MTIwOTE5NTEwNVoXDTA3MTIwOTE5NTEw NVowbDEQMA4GA1UEBBMHUnVzc2VsbDEUMBIGA1UEKhMLQ3JhaWcgTGFpcmQxHDAaBgNVBAMTE0Ny YWlnIExhaXJkIFJ1c3NlbGwxJDAiBgkqhkiG9w0BCQEWFUNyYWlnLlJ1c3NlbGxAU3VuLkNPTTCC ASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMNAB4Ih+ShsCt89HzqIdwEx8L4o1UHiY6V7 16mrCedfd4Y0/uI7z9Zam8ysgEh+F7aDnQEKmEsVFN35G4nPMfLU6dZYkvADwUjbq82t/dJ3FDDg Q945nHHpqECZff/S/UMho9AFfj6PZvZBAlDCJAayb4RdKIlfuvPW9YcQStQ1IfVJcVuKnC0Q+tdc a4A7zn7IzLOQohO1lTc3hXSBigEIGiGYn6Ny0wmexfA3X1WsXekFx5czd+M4GjDjswn8CNoBmnBr jOTGK1mOsXR6GSRHnly2s9xTdE4qv9qimM+7C2yzMHbKcszV7OQoLsRsZKDh+6u9wYU+TrjcY4ym bA8CAwEAAaMyMDAwIAYDVR0RBBkwF4EVQ3JhaWcuUnVzc2VsbEBTdW4uQ09NMAwGA1UdEwEB/wQC MAAwDQYJKoZIhvcNAQEFBQADgYEAU/EpPDztnb55Fz7iGSVm1mYEVj5m2OQKTYG26POUAomCBRrt /CdBBvqYmcHUTpra0qLELHAQadYFl2v11iQkqwF5PPJs19oU/zA0m5qFnOMTAiCvel7IprIwA2r6 eJR9siaPwDRgVJ/Sj71dD+utwf+nRrNy0/7PMNK5y+ocsYQwggM/MIICqKADAgECAgENMA0GCSqG SIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UECBMMV2VzdGVybiBDYXBlMRIwEAYDVQQH EwlDYXBlIFRvd24xGjAYBgNVBAoTEVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZp Y2F0aW9uIFNlcnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJlZW1h aWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRoYXd0ZS5jb20wHhcNMDMw NzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQswCQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3Rl IENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWls IElzc3VpbmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZWh5owHUE cJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO3cnwK4Vaqj9xVsuvPAsH5/Ef kTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNenprufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMB AAGjgZQwgZEwEgYDVR0TAQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3Js LnRoYXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8EBAMCAQYwKQYD VR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDItMTM4MA0GCSqGSIb3DQEBBQUAA4GB AEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FDlpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+ hLGZGwDFGguCdJ4lUJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC 3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTElMCMGA1UEChMcVGhh d3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVt YWlsIElzc3VpbmcgQ0ECECpJVMO68ii+Xfsc1O1YYFIwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcN AQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDcwMjA2MTg0MDA1WjAjBgkqhkiG9w0B CQQxFgQUZUDxT8cTjN1d36vJCrV20DHkIsUwgYUGCSsGAQQBgjcQBDF4MHYwYjELMAkGA1UEBhMC WkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0 ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBAhAqSVTDuvIovl37HNTtWGBSMIGHBgsqhkiG 9w0BCRACCzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQ dHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBAhAq SVTDuvIovl37HNTtWGBSMA0GCSqGSIb3DQEBAQUABIIBADE9zlCHmIFWzTXhV/hc5yzfsl1lyygs lLn3KhT554ZaipKBDYMf75V/OyQDr+gFqhFBXueAkrmnsL3YXW3HIxU+VJ5v50BE/VZJ/DDLL5Th B/Rl9v7OPCYPwZ6OX9KU+FbDx8mBPeU35Mv0Jl3jdxexIl7uF26Ng4lN1K/HA0i29PKQAs9wAQyp 0xB6D4OXS7Qfn3Y0QqGGnZJvo7PVP1QC9UUAxU8dYN2Nvsg3AXDiXGBvrECX3/tweDPymcK/tqD9 74qAYYXdoauNzvq0CCjUuTakRNR8jLuXR9+UgJgqPmAu3pb9tGPp6mVhFuN4iDfiPKarGr0DMMnC YsKv2N8AAAAAAAA= --Apple-Mail-33--793522743--