lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Jones>
Subject Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)
Date Wed, 02 Nov 2005 15:41:10 GMT
> I can think of a few ways.  If elegance is your goal, then a little
> relational database theory might help.  Specifically, instead of having
> one record per listener, have one record per listener-artist
> combination, with three fields:  listenerid, artistid, and count.  Your
> example above would then look like
> listenerid  artistid  count
> ----------  --------  -----
>           X         1  00010
>           X         2  00005
>           X         3  00002
> You could compose queries to get all artists somebody every listened to
> (listenerid:X), all Radiohead listeners (artistid:1), anybody who
> listened to Coldplay 5 or more times (artistid:2 and count:[00005 to
> 99999]) or what-have-you.  This approach would require two-stage
> processing for queries of the form "find everybody who listened to
> Radiohead three times and Coldplay twice", though.
> Really, though, your problem sounds more like a relational db problem
> than a text search problem.  A simple MySQL database with a few tables
> might be a better fit ...

Hi, thanks for the feedback.
The data i'm dealing with is stored over a few mysql dbs on different 
machines, horizontally partitioned so each user is assigned to a single db. 
The queries i'm doing can be done in SQL in parallel over all machines then 
combined, which i've tested - it's unacceptably slow :(

There is rather a lot of data; each profile can contain in excess of 1000 
artists, and there are almost 1million profiles created. I'm only adding the 
top N artists for each user to a lucene index, which is fast and still yields 
good results (even with just the top 100 artists each).

I think this could be done reasonably well with SQL in a DBMS like Oracle RAC 
or something like mysql cluster.. sadly the former doesnt scale financially, 
and the latter requires everything to be held in RAM .. which is why i'm 
using lucene.


> --MDC
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message