lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lokendra Singh <lsingh....@gmail.com>
Subject Re: Converting an existing index format to Lucene Index
Date Fri, 25 Feb 2011 09:55:36 GMT
Eddie and Aditya: Thanks a lot for your suggestions!!.

The collect-and-commit approach looks good.

@Eddie: My old index size is of order of  ~Milion pairs. Though, I haven't
run any such tests, I will be running them quickly for analysis.

Regards
Lokendra


On Fri, Feb 25, 2011 at 2:23 PM, Edward Drapkin <edwardd@wolfram.com> wrote:

> On 2/25/2011 12:26 AM, Lokendra Singh wrote:
>
>> Hi all,
>>
>> I am seeking for some guidelines to directly convert an already existing
>> index to Lucene index.
>> The index available to me is of a set of <value1,value2> pairs. Where each
>> pair is :
>> < word ,  fileName >
>> i.e a word as a 'value1', and the 'value2' being the fileName containing
>> that word.
>>
>> A word might appear in several fileNames as well a same file can contain
>> multiple copies of a word. For eg, following index is possible:
>> < "my"  , "file1" >
>> < "you" , "file2" >
>> < "my",  "file2" >
>> < "my", "file1">
>>
>> My actual problem is that the index available to me is very large in size,
>> hence I am bit reluctant to create 'Document' object for each file because
>> for that I will have to read through all the pairs first and store them in
>> memory. Or I will have to 'update' the 'Document' object of a particular
>> file while iterating through the Pairs of my index, this 'update', again, is
>> a costly operation.
>>
>> Please correct me if my understanding of Lucene is wrong or other
>> alternative ways.
>>
>> Regards
>> Lokendra
>>
>
>
> Er, sorry for the blank email, hit the wrong button!
>
> There are basically two ways to do this:
>
> 1) Buffer everything in RAM and then write all at once - this is probably
> the quickest way to do it, but the most resource intensive and prone to
> failure (OOM will lose all work, for example).
> 2) Iterate through the list, collecting some number of values and then
> periodically committing them to the index.
>
> There's not really any other way: you either write it out in chunks or you
> write it out all at once.  However, there is some leeway in how you iterate
> through your old index.  Iterating through the entire index and buffering
> everything in RAM and writing it all out at once is, like you said, probably
> prohibitively resource intensive.  You could, on the other hand, iterate
> through the index and only collect values for a particular file, then commit
> that, then iterate again.  I would imagine this is a much slower approach,
> but it will be less memory intensive.
>
> Personally, the way I'd approach this problem... I'd iterate through the
> old index in one pass.  Every time I encountered a new file, I'd create a
> new Document and store it somewhere (something trivial like Map<String,
> Document> where the key is the filename).   I'd also ensure that the
> Documents have a field called "file" so that I could easily query them
> later.  Every iteration, I'd continue to add to the Documents and every n
> iterations I'd commit all the Documents to the index (ostensibly calling
> IndexWriter.updateDocument).  By tuning the number of iterations that
> triggers an index write optimization, you can adjust the balance between RAM
> usage and CPU/IO time spent.  n=1 would obviously be the most CPU/IO
> intensive and n=inf would be the most RAM intensive and the "sweet spot" for
> your requirements is very probably somewhere between those two points.
>
> How big is this old index, by the way?   Have you run tests to ensure that
> the memory limit or cpu cost in either method is actually a problem?  I
> think you may be surprised at the speeds you get, if you haven't run tests
> already.
>
> Thanks,
> Eddie
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message