lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <m...@curtin.com>
Subject Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)
Date Wed, 02 Nov 2005 17:00:35 GMT
Richard Jones wrote:

>>If you're willing to continue subsetting / summarizing the data out into
>>Lucene, how about subsetting it out into a dedicated MySQL instance for
>>this purpose?  100 artists * 1M profiles * 2 ints * 4 bytes/int =
>>roughly 1 GB of data, which would easily fit into RAM.  Queries should
>>be pretty fast off of that.  Good luck!
> 
> We used to do this, but there is a lot of overhead involved in 
> updating/deleting/inserting all those rows / db indexes More wasted cycles 
> and disk activity than we see with lucene. Even ignoring the fancy ACID stuff 
> with MyISAM (no ref. integrity) it's still slower.
> 
> Furthermore, with lucene i can query "artists:1" and it returns what lucene 
> deems to be the "best" matches for artist 1 (radiohead). This is far easier 
> that with an SQL database, because the person whose listen counter for 
> radiohead is highest isnt necessarily the "biggest fan". it depends on the 
> size of the profile.  This gets even more complicated when trying to find the 
> "best" fan of a combination of a few artists. Lucene is more useful for this 
> than a database query.

Ah, so the fact that "1" actually appears many times in the string you 
give Lucene is important.  Neat application!

Sounds like the custom Analyzer (really a custom TokenStream) approach 
suggested by others may be the way for you to go.  If the information 
you get from the MySQL profiles is an artistid and count, you could code 
up an Analyzer to emit N "1" tokens in a row, rather than concatenate N 
"1"s together into a single string and then let QueryParser, et. al., 
parse them back apart again.  Even if you get all N of the "1"s from 
MySQL, a custom Analyzer could emit them, one at a time, rather than the 
concatenate, parse-apart cycle.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message