lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Drapkin <edwa...@wolfram.com>
Subject Re: Converting an existing index format to Lucene Index
Date Fri, 25 Feb 2011 08:53:57 GMT
On 2/25/2011 12:26 AM, Lokendra Singh wrote:
> Hi all,
>
> I am seeking for some guidelines to directly convert an already 
> existing index to Lucene index.
> The index available to me is of a set of <value1,value2> pairs. Where 
> each pair is :
> < word ,  fileName >
> i.e a word as a 'value1', and the 'value2' being the fileName 
> containing that word.
>
> A word might appear in several fileNames as well a same file can 
> contain multiple copies of a word. For eg, following index is possible:
> < "my"  , "file1" >
> < "you" , "file2" >
> < "my",  "file2" >
> < "my", "file1">
>
> My actual problem is that the index available to me is very large in 
> size, hence I am bit reluctant to create 'Document' object for each 
> file because for that I will have to read through all the pairs first 
> and store them in memory. Or I will have to 'update' the 'Document' 
> object of a particular file while iterating through the Pairs of my 
> index, this 'update', again, is a costly operation.
>
> Please correct me if my understanding of Lucene is wrong or other 
> alternative ways.
>
> Regards
> Lokendra


Er, sorry for the blank email, hit the wrong button!

There are basically two ways to do this:

1) Buffer everything in RAM and then write all at once - this is 
probably the quickest way to do it, but the most resource intensive and 
prone to failure (OOM will lose all work, for example).
2) Iterate through the list, collecting some number of values and then 
periodically committing them to the index.

There's not really any other way: you either write it out in chunks or 
you write it out all at once.  However, there is some leeway in how you 
iterate through your old index.  Iterating through the entire index and 
buffering everything in RAM and writing it all out at once is, like you 
said, probably prohibitively resource intensive.  You could, on the 
other hand, iterate through the index and only collect values for a 
particular file, then commit that, then iterate again.  I would imagine 
this is a much slower approach, but it will be less memory intensive.

Personally, the way I'd approach this problem... I'd iterate through the 
old index in one pass.  Every time I encountered a new file, I'd create 
a new Document and store it somewhere (something trivial like 
Map<String, Document> where the key is the filename).   I'd also ensure 
that the Documents have a field called "file" so that I could easily 
query them later.  Every iteration, I'd continue to add to the Documents 
and every n iterations I'd commit all the Documents to the index 
(ostensibly calling IndexWriter.updateDocument).  By tuning the number 
of iterations that triggers an index write optimization, you can adjust 
the balance between RAM usage and CPU/IO time spent.  n=1 would 
obviously be the most CPU/IO intensive and n=inf would be the most RAM 
intensive and the "sweet spot" for your requirements is very probably 
somewhere between those two points.

How big is this old index, by the way?   Have you run tests to ensure 
that the memory limit or cpu cost in either method is actually a 
problem?  I think you may be surprised at the speeds you get, if you 
haven't run tests already.

Thanks,
Eddie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message