lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Re: Fields, Index segments and docIds (second Try)
Date Wed, 30 Apr 2014 21:00:50 GMT
On 04/30/2014 10:48 AM, Shai Erera wrote:
> I hope I got all the details right, if I didn't then please clarify. Also,
> I haven't read the entire thread, so if someone already suggested this ...
> well, it probably means it's the right solution :)
>
> It sounds like you could use Lucene's ParallelCompositeReader, which
> already handles multiple IndexReaders that are aligned by their internal
> document IDs. The way it would work, as far as I understand your scenario
> is something like the following table (columns denote different indexes).
> Each index contains a subset of relevant fields, where common contains the
> common fields, and each language index contains the respective language
> fields.
>
> DocID        LuceneID  Common  English       German        ....
> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>                                 EN_sentences  DE_sentences
> "SecondDoc"  1         A,B,C
> "ThirdDoc"   2         A,B,C
>
> Each index can contain all relevant fields, or only a subset (e.g. maybe
> not all documents have a value for the 'B' field in the 'common' index).
> What's absolutely very important here though is that the indexes are
> created very carefully, and if e.g. SecondDoc is not translated into
> German, *you must still have an empty document* in the German index, or
> otherwise, document IDs will not align.

That's exactly how I saw it and what I need to do. So, I'll have a very 
good look at

ParallelCompositeReader

>
> Lucene does not offer a way to build those indexes though (patches
> welcome!!).

This answers my question 1. Thanks.  :)
I somehow hoped that there was already support for that kind of 
situation in lucene but well,
now at least I know that I won't find an already made solution to my 
problem in the lucene classes and that I will have to code one myself,
by taking inspiration in the lucene classes that do similar processing.
> We've started some effort very long time ago on LUCENE-1879
> (there's a patch and a discussion for an alternative approach) as well as
> there is a very useful suggestion in ParallelCompositeReader's jdocs (use
> LogDocMergePolicy).

Wow, priceless. This gives me some headstart and inspiration. :)

>
> One challenge is how to support multi-threaded indexing, but perhaps this
> isn't a problem in your application? It sounds like, by you writing that a
> user will "download the german index", that the indexes are built offline?
Indeed. The index is built offline, in a single thread, and once it is 
built, it is read only.
Cant find an easier situation. :)


> Another challenge is how to control segment merging, so that the *exact
> same segments* are merged over the parallel indexes. Again, if your
> application builds the indexes offline, then this should be easier to
> accomplish.
>
> I assume though that when you index e.g. the German documents, then the
> already indexes 'common' fields do not change for a document. If they do,
> you will need to rebuild the 'common' index too.
>
> Once you achieve a correct parallel index, it is very easy to open a
> ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
> Common+German, or Common+English+German and search it, since the internal
> document IDs are perfectly aligned.
>
> Shai

Many thanks for the awesome answer and the help (I love you).
As I really really really need this to happen, I'm going to start 
working on this really soon.

I'm definately not an expert on threads/filesystems/and lucene inner 
workings, so I can't promise to contribute a miracoulous patch though.
Especially since I won't work on the muli-thread aspect of the problem.
But I'll do the best I can and contribute back whatever code I can produce.

Many thanks, again. :)
>
>
> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
> jose.carlos.canova@gmail.com> wrote:
>
>> My suggestion is you not worry about the docId, in practice it is an
>> "internal lucene" id, quite similar with a rowId on a database, each index
>> may generate a different docId (it is their problem) from a translated
>> document, you may use your own ID that relates one document to another on
>> different index mainly because like you mention are translated documents
>> that on theory can be ranked differently from language to language (it is
>> not an obligation that a set of documents on different languages spams the
>> same rank order but i am not 100% sure about this),
>>
>> Second reason is that 'they may change the internal structure of lucene
>> without warrant', and then you lose the forward compatibility.
>>
>> I am not an expert on Lucene like Schindler, but reading their
>> documentation understood that they have a special attention on
>> "internal lucene" and "experimental lucene" which means internal is "non
>> warrant compatible", and experimental "may be removed".
>>
>> For example they (apache-lucene) discover a "new manner" to relate each
>> document that is more efficient and change some mechanism, then your
>> application uses an internal mechanism that is high coupled with lucene
>> version xxx (marked as "internal-lucene") you can stuck on a specific
>> version and   on future have to rewrite some code because and this might
>> cause some "management conflict" if your project follows a continuous
>> integration and you are subordinated on a management structure (bad to
>> you).
>>
>> I saw this on several projects that uses Lucene around they do not upgrade
>> their lucene components on their new releases one example if i am not wrong
>> still uses Lucene 3 and other that i saw around (e.g. Luke) which means
>> that "The project was abandoned because the manner how they integrate with
>> Lucene was not fully functional".
>>
>> Another interesting thing is that developing around Lucene is more
>> effective, you guarantee that your product will work and they guarantee
>> that Lucene works too. This is related with design by contract.
>>
>> Regards.
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <olivier.binda@wanadoo.fr
>>> wrote:
>>> Hello.
>>>
>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>> disrespect, but after thinking it through today,
>>> I need to and would really love to have the answer to the following
>>> question :
>>>
>>> 1) At lucene indexing time, is it possible to rewrite a read-only index
>> so
>>> that some fields are only found in some segments (and how ?)
>>>
>>>
>>> Uwe Schindler suggested using different index and a MultiReader for my
>>> needs and It probably answers my second question, better formulated as
>> "Is
>>> it possible to restrict  an index to some of it's segments ? " as a
>>> CompositeReader with AtomicReaders (or a custom Directory) that read the
>>> aforementioned segments might do the trick
>>>
>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>> solve
>>> my needs as I have around 300000 documents of the following kind :
>>>
>>> READ ONLY Document :
>>> // common fields shipped with the App that aren't language related
>>> A:
>>> B:
>>> C:
>>> // fields shipped with the English package (a zip)
>>> EN:
>>> EN_Words:
>>> EN_Sentences:
>>> some DocValues
>>> // fields shipped with the German package (a zip)
>>> DE:
>>> DE_Words:
>>> DE_Sentences:
>>> some DocValues
>>> ...
>>> There might be hundreds of language package that my users might use
>>>
>>>
>>> If I use different indexes
>>> indexA for the common stuff,
>>> indexEN for the English package,
>>> indexDE for the german package,
>>>
>>> For sure, I will be able to make a big index out of those by using a
>>> MultiReader
>>> BUT it really makes an union out of the three index (right ?) which means
>>> I'll have 900000 documents
>>> and the documents in the indexA won't have any relations to the documents
>>> in indexEN (right ?) except if I give each document an id in each index
>> and
>>> make a join at query time which is a big no no, because I use a
>> queryParser
>>> and users may enter queries like "A:gah AND (DE:schlaffen OR EN:sleep)"
>>>
>>> Or I am mistaken and there is a way to create a document in three
>>> different index that stay in relations with the same docId ?
>>>
>>>
>>> My solution if question 1 is possible :
>>>
>>> In contrast, if I am able to build my index so that my READ ONLY Document
>>> are stored in
>>>
>>> SEGMENT 1
>>> // common fields shipped with the App that aren't language related
>>> A:
>>> B:
>>> C:
>>>
>>> SEGMENT 2
>>> // fields shipped with the English package (a zip)
>>> EN:
>>> EN_Words:
>>> EN_Sentences:
>>> some DocValues
>>>
>>> SEGMENT 3
>>> // fields shipped with the German package (a zip)
>>> DE:
>>> DE_Words:
>>> DE_Sentences:
>>> some DocValues
>>>
>>>
>>> I only need to ship SEGMENT 1 in the App and let users download SEGMENT 2
>>> or SEGMENT 3 whether they want english or german
>>> and use a composite reader with atomic readers (right ?) to use my
>>> frankenstein index at query time with a queryparser
>>>
>>>
>>> Also, In case question 1 is possible. I would really like to know too, if
>>> it is possible to remap at build time docIds in a read-only index.
>>> An application of this would be :
>>>
>>> At day 1, I shipp my app with 2 languages packages : English and german
>>> (documents are uniquely identified by a docId... or by an external id
>>> (thanks to a docId<-> external id map)
>>>
>>> At day 2, I ship an additional language package (French) because I'm able
>>> to build an index with English, German, French with the same exact docIds
>>> for each document that the index shipped at day 1
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message