lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Re: Fields, Index segments and docIds (second Try)
Date Thu, 01 May 2014 22:57:08 GMT
On 05/01/2014 10:28 AM, Shai Erera wrote:
> I'm glad it helped you. Good luck with the implementation.

Thanks. First I started looking at the lucene internal code. To 
understand when/where and why docIds are changing/need to be changed (in 
merge and doc deletions) .
I have always wanted to understand this and I think the understanding 
may help me somehow.
>
> One thing I didn't mention (though it's in the jdocs) -- it's not enough to
> have the documents of each index aligned, you also have to have the
> segments aligned. That is, if both indexes have documents 0-5 aligned, but
> one index contains a single segment and the other one 2 segments, that's
> not going to work.

That's good to know.

> It is possible to do w/ some care -- when you build the German index,
> disable merges (use NoMergePolicy) and flush whenever you indexed enough
> documents to match an existing segment on e.g. the Common index.
>
> Or, if rebuilding all indexes won't take long, you can always rebuild all
> of them.
Yes. That's what I am usually doing (it takes less than 1 minute)
Yet, I usually do a forceMarge too to only have 1 segment :/

> Shai
>
>
> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <olivier.binda@wanadoo.fr>wrote:
>
>> On 04/30/2014 10:48 AM, Shai Erera wrote:
>>
>>> I hope I got all the details right, if I didn't then please clarify. Also,
>>> I haven't read the entire thread, so if someone already suggested this ...
>>> well, it probably means it's the right solution :)
>>>
>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>> already handles multiple IndexReaders that are aligned by their internal
>>> document IDs. The way it would work, as far as I understand your scenario
>>> is something like the following table (columns denote different indexes).
>>> Each index contains a subset of relevant fields, where common contains the
>>> common fields, and each language index contains the respective language
>>> fields.
>>>
>>> DocID        LuceneID  Common  English       German        ....
>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>                                  EN_sentences  DE_sentences
>>> "SecondDoc"  1         A,B,C
>>> "ThirdDoc"   2         A,B,C
>>>
>>> Each index can contain all relevant fields, or only a subset (e.g. maybe
>>> not all documents have a value for the 'B' field in the 'common' index).
>>> What's absolutely very important here though is that the indexes are
>>> created very carefully, and if e.g. SecondDoc is not translated into
>>> German, *you must still have an empty document* in the German index, or
>>> otherwise, document IDs will not align.
>>>
>> That's exactly how I saw it and what I need to do. So, I'll have a very
>> good look at
>>
>> ParallelCompositeReader
>>
>>
>>> Lucene does not offer a way to build those indexes though (patches
>>> welcome!!).
>>>
>> This answers my question 1. Thanks.  :)
>> I somehow hoped that there was already support for that kind of situation
>> in lucene but well,
>> now at least I know that I won't find an already made solution to my
>> problem in the lucene classes and that I will have to code one myself,
>> by taking inspiration in the lucene classes that do similar processing.
>>
>>> We've started some effort very long time ago on LUCENE-1879
>>> (there's a patch and a discussion for an alternative approach) as well as
>>> there is a very useful suggestion in ParallelCompositeReader's jdocs (use
>>> LogDocMergePolicy).
>>>
>> Wow, priceless. This gives me some headstart and inspiration. :)
>>
>>
>>> One challenge is how to support multi-threaded indexing, but perhaps this
>>> isn't a problem in your application? It sounds like, by you writing that a
>>> user will "download the german index", that the indexes are built offline?
>>>
>> Indeed. The index is built offline, in a single thread, and once it is
>> built, it is read only.
>> Cant find an easier situation. :)
>>
>>
>>   Another challenge is how to control segment merging, so that the *exact
>>> same segments* are merged over the parallel indexes. Again, if your
>>> application builds the indexes offline, then this should be easier to
>>> accomplish.
>>>
>>> I assume though that when you index e.g. the German documents, then the
>>> already indexes 'common' fields do not change for a document. If they do,
>>> you will need to rebuild the 'common' index too.
>>>
>>> Once you achieve a correct parallel index, it is very easy to open a
>>> ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
>>> Common+German, or Common+English+German and search it, since the internal
>>> document IDs are perfectly aligned.
>>>
>>> Shai
>>>
>> Many thanks for the awesome answer and the help (I love you).
>> As I really really really need this to happen, I'm going to start working
>> on this really soon.
>>
>> I'm definately not an expert on threads/filesystems/and lucene inner
>> workings, so I can't promise to contribute a miracoulous patch though.
>> Especially since I won't work on the muli-thread aspect of the problem.
>> But I'll do the best I can and contribute back whatever code I can produce.
>>
>> Many thanks, again. :)
>>
>>>
>>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>> jose.carlos.canova@gmail.com> wrote:
>>>
>>>   My suggestion is you not worry about the docId, in practice it is an
>>>> "internal lucene" id, quite similar with a rowId on a database, each
>>>> index
>>>> may generate a different docId (it is their problem) from a translated
>>>> document, you may use your own ID that relates one document to another on
>>>> different index mainly because like you mention are translated documents
>>>> that on theory can be ranked differently from language to language (it is
>>>> not an obligation that a set of documents on different languages spams
>>>> the
>>>> same rank order but i am not 100% sure about this),
>>>>
>>>> Second reason is that 'they may change the internal structure of lucene
>>>> without warrant', and then you lose the forward compatibility.
>>>>
>>>> I am not an expert on Lucene like Schindler, but reading their
>>>> documentation understood that they have a special attention on
>>>> "internal lucene" and "experimental lucene" which means internal is "non
>>>> warrant compatible", and experimental "may be removed".
>>>>
>>>> For example they (apache-lucene) discover a "new manner" to relate each
>>>> document that is more efficient and change some mechanism, then your
>>>> application uses an internal mechanism that is high coupled with lucene
>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>> version and   on future have to rewrite some code because and this might
>>>> cause some "management conflict" if your project follows a continuous
>>>> integration and you are subordinated on a management structure (bad to
>>>> you).
>>>>
>>>> I saw this on several projects that uses Lucene around they do not
>>>> upgrade
>>>> their lucene components on their new releases one example if i am not
>>>> wrong
>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>>>> that "The project was abandoned because the manner how they integrate
>>>> with
>>>> Lucene was not fully functional".
>>>>
>>>> Another interesting thing is that developing around Lucene is more
>>>> effective, you guarantee that your product will work and they guarantee
>>>> that Lucene works too. This is related with design by contract.
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <olivier.binda@wanadoo.fr
>>>>
>>>>> wrote:
>>>>> Hello.
>>>>>
>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>> disrespect, but after thinking it through today,
>>>>> I need to and would really love to have the answer to the following
>>>>> question :
>>>>>
>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only index
>>>>>
>>>> so
>>>>
>>>>> that some fields are only found in some segments (and how ?)
>>>>>
>>>>>
>>>>> Uwe Schindler suggested using different index and a MultiReader for my
>>>>> needs and It probably answers my second question, better formulated as
>>>>>
>>>> "Is
>>>>
>>>>> it possible to restrict  an index to some of it's segments ? " as a
>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
the
>>>>> aforementioned segments might do the trick
>>>>>
>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>
>>>> solve
>>>>
>>>>> my needs as I have around 300000 documents of the following kind :
>>>>>
>>>>> READ ONLY Document :
>>>>> // common fields shipped with the App that aren't language related
>>>>> A:
>>>>> B:
>>>>> C:
>>>>> // fields shipped with the English package (a zip)
>>>>> EN:
>>>>> EN_Words:
>>>>> EN_Sentences:
>>>>> some DocValues
>>>>> // fields shipped with the German package (a zip)
>>>>> DE:
>>>>> DE_Words:
>>>>> DE_Sentences:
>>>>> some DocValues
>>>>> ...
>>>>> There might be hundreds of language package that my users might use
>>>>>
>>>>>
>>>>> If I use different indexes
>>>>> indexA for the common stuff,
>>>>> indexEN for the English package,
>>>>> indexDE for the german package,
>>>>>
>>>>> For sure, I will be able to make a big index out of those by using a
>>>>> MultiReader
>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>> means
>>>>> I'll have 900000 documents
>>>>> and the documents in the indexA won't have any relations to the
>>>>> documents
>>>>> in indexEN (right ?) except if I give each document an id in each index
>>>>>
>>>> and
>>>>
>>>>> make a join at query time which is a big no no, because I use a
>>>>>
>>>> queryParser
>>>>
>>>>> and users may enter queries like "A:gah AND (DE:schlaffen OR EN:sleep)"
>>>>>
>>>>> Or I am mistaken and there is a way to create a document in three
>>>>> different index that stay in relations with the same docId ?
>>>>>
>>>>>
>>>>> My solution if question 1 is possible :
>>>>>
>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>> Document
>>>>> are stored in
>>>>>
>>>>> SEGMENT 1
>>>>> // common fields shipped with the App that aren't language related
>>>>> A:
>>>>> B:
>>>>> C:
>>>>>
>>>>> SEGMENT 2
>>>>> // fields shipped with the English package (a zip)
>>>>> EN:
>>>>> EN_Words:
>>>>> EN_Sentences:
>>>>> some DocValues
>>>>>
>>>>> SEGMENT 3
>>>>> // fields shipped with the German package (a zip)
>>>>> DE:
>>>>> DE_Words:
>>>>> DE_Sentences:
>>>>> some DocValues
>>>>>
>>>>>
>>>>> I only need to ship SEGMENT 1 in the App and let users download SEGMENT
>>>>> 2
>>>>> or SEGMENT 3 whether they want english or german
>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>> frankenstein index at query time with a queryparser
>>>>>
>>>>>
>>>>> Also, In case question 1 is possible. I would really like to know too,
>>>>> if
>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>> An application of this would be :
>>>>>
>>>>> At day 1, I shipp my app with 2 languages packages : English and german
>>>>> (documents are uniquely identified by a docId... or by an external id
>>>>> (thanks to a docId<-> external id map)
>>>>>
>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>> able
>>>>> to build an index with English, German, French with the same exact
>>>>> docIds
>>>>> for each document that the index shipped at day 1
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message