lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sven <>
Subject constructing a mini-index with just the number of hits for a term
Date Thu, 13 Nov 2008 16:08:47 GMT
First - I apologize for the double post on my earlier email.  The first 
time I sent it I received an error message from 
saying that I should instead send email to so I 
thought it did not go through. 

My question is this - is there a way to use the Lucene/Solr 
infrastructure to create a mini-index that simply contains a lookup 
table of terms and the number of times they have appeared? 

I do not need to record which documents have them nor do I need to know 
where in the documents they appear.  There could be (and probably will 
be) more than 2^32 terms, however.  I'm not sure if that makes a 
difference to the Lucene backend, but thought it might be relevant. 

This question coincides with my earlier question about counting the 
times a given term is associated with another term.  I figure that this 
would be more easily accomplished by making the mini-index described 
above alongside the normal index when a document is indexed.  For 
example, when scanning:

Bravely bold Sir Robin, brought forth from Camelot.  He was not afraid 
to die!  Oh, brave Sir Robin!

In addition to the normal indexing function of Lucene, I would like to 
write something on the backend to also index:

bold|camelot  ("from" being a stop word)
...and so on

I only need to keep a running total of each "bravely|bold" term, 
however, since the number of terms will be quite large and keeping track 
of the document/termpositions would translate to a lot of wasted HD space. 

If such a thing is not already in place, could someone let me know if 
there are some tutorials, documentation, or presentations that describe 
the inner workings of Lucene and the theories/implementation at work for 
the actual file formats, structures, data manipulations, etc?  (The 
javadocs don't go into this kind of detail.)  I'm sure I can sift 
through the code and eventually make sense of it, but if there is 
documentation out there, I'd prefer to peruse that first.  My thought 
being that I can simply generate my own kind of hash for each combined 
term and write it out to a custom file structure similar to Lucene - but 
the specifics of how to (optimally) do so are not plain to me yet. 

Thanks again!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message