Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type;
  b=REAf3YCWrYyKA7eCz5WcNSyGobnUaTC97WLK3wAQRkvSmrmazo9/fCW9y8XAlnUl2Cp6Au3zBbTE5hk8m8PuD9zqbiV2wozBRuGsJJC6XueeiOlfZpofsR7x31GZQC5jNQ6ZHDBP76wFwuLxaOjKG9ACUCYjvCAvP8lS2wyUNEg=;
Message-ID: <28798.85230.qm@web111813.mail.gq1.yahoo.com>
Date: Tue, 19 May 2009 07:14:52 -0700 (PDT)
From: Alex Steward <alex_lucene@yahoo.com>
Subject: Re: lucene code changes
To: java-user@lucene.apache.org
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-1744116595-1242742492=:85230"

--0-1744116595-1242742492=:85230
Content-Type: multipart/alternative; boundary="0-370331009-1242742492=:85230"

--0-370331009-1242742492=:85230
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

=A0I have a need to implement an custom inverted index in Lucene.
I=0Ahave files like the ones I have attached here. The Files have words and=
=0Aand scores assigned to that word. There will 100's of such files. Each=
=0Afile will have atleast 50000 such name value pairs.=20
=0ANote: Currently the file only shows 10s of such name value pairs. But=0A=
My real production data will have 50000 plus name value pairs in file.

Currently=0AI index the data=A0using Lucene's Inverted Index. The query tha=
t is being=0Aexecute against the Index has 100 Words. When the query is exc=
uted=0Aagainst the index the result is returned in 100 milli seconds or so.=
=20
=0A
Problem: Once i have the results of the query, I have to go=0Athrough each =
file (for ex. attached file one). Then for each word in=0Athe user input qu=
ery, I have to compute the total score. Doing this=0Aagainst 100's of files=
 and 100's of keywords is causing the score=0Acomputation to be slow i.e. a=
bout 3-5seconds. =0A=0AI need help resolving the above problem so that scor=
e computation takes less than 200Milli Seconds or so.=0AOne Resolution I wa=
s thinking is modifying the Lucene Source Code=0Afor creating inverted inde=
x. In this index we store the score in the=0Aindex itself. When the results=
 of the query are returned, we will get=0Athe scores along with the file na=
mes, there by eleminating the need to=0Asearch the file for the keyword and=
 corresponding score. I need to=0Acompute the total of all scores that belo=
ng to one single file.
=0A
I am also open to any other ideas that you may have. Any suggestions regard=
ing this will be very helpful.

a.


=0A=0A      =0A=0A=0A      
--0-370331009-1242742492=:85230
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<table cellspacing=3D"0" cellpadding=3D"0" border=3D"0" ><tr><td valign=3D"=
top" style=3D"font: inherit;"><blockquote style=3D"border-left: 2px solid r=
gb(16, 16, 255); margin-left: 5px; padding-left: 5px;"><div id=3D"yiv202366=
6119"><table border=3D"0" cellpadding=3D"0" cellspacing=3D"0"><tbody><tr><t=
d style=3D"font-family: inherit; font-style: inherit; font-variant: inherit=
; font-weight: inherit; font-size: inherit; line-height: inherit; font-size=
-adjust: inherit; font-stretch: inherit;" valign=3D"top"><p>&nbsp;I have a =
need to implement an custom <span style=3D"border-bottom: 1px dashed rgb(0,=
 102, 204); cursor: pointer;" class=3D"yshortcuts" id=3D"lw_1242742107_0">i=
nverted index</span> in Lucene.<br>I=0Ahave files like the ones I have atta=
ched here. The Files have words and=0Aand scores assigned to that word. The=
re will 100's of such files. Each=0Afile will have atleast 50000 such name =
value pairs. <br>=0ANote: Currently the file only shows 10s of such name va=
lue pairs. But=0AMy real production data will have 50000 plus name value pa=
irs in file.<br><br>Currently=0AI index the data&nbsp;using Lucene's Invert=
ed Index. The query that is being=0Aexecute against the Index has 100 Words=
. When the query is excuted=0Aagainst the index the result is returned in 1=
00 milli seconds or so. <br>=0A<br><strong>Problem: Once i have the results=
 of the query, I have to go=0Athrough each file (for ex. attached file one)=
. Then for each word in=0Athe user input query, I have to compute the total=
 score. Doing this=0Aagainst 100's of files and 100's of keywords is causin=
g the score=0Acomputation to be slow i.e. about 3-5seconds. </strong></p>=
=0A=0A<p><strong>I need help resolving the above problem so that score comp=
utation takes less than 200Milli Seconds or so</strong>.</p>=0AOne Resoluti=
on I was thinking is modifying the Lucene Source Code=0Afor creating invert=
ed index. In this index we store the score in the=0Aindex itself. When the =
results of the query are returned, we will get=0Athe scores along with the =
file names, there by eleminating the need to=0Asearch the file for the keyw=
ord and corresponding score. I need to=0Acompute the total of all scores th=
at belong to one <span style=3D"border-bottom: 1px dashed rgb(0, 102, 204);=
 cursor: pointer;" class=3D"yshortcuts" id=3D"lw_1242742107_1">single file<=
/span>.<br>=0A<br>I am also open to any other ideas that you may have. Any =
suggestions regarding this will be very helpful.<br><br>a.<br><blockquote s=
tyle=3D"border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-=
left: 5px;"><div class=3D"plainMail"><br></div></blockquote></td></tr></tbo=
dy></table><br>=0A=0A      </div></blockquote></td></tr></table><br>=0A=0A=
=0A=0A      
--0-370331009-1242742492=:85230--

--0-1744116595-1242742492=:85230
Content-Type: text/plain; charset=us-ascii


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
--0-1744116595-1242742492=:85230--