From Olivier Binda <>
Subject Some feedback on parrallel index building (Fields, Index segments and docIds)
Date Fri, 09 May 2014 21:24:34 GMT
Some feedback (might be usefull for other users) :

I have experimented a bit and it seems that I have been able to build a 
parrallel index for my use case
(9 different index, with docIds in sync, with only 1 segment).

I had to set the IndexWriterConfig of all my indexWriters with
to first build everything in in RAM, by adding a document (sometimes 
empty to build a whole row) in all index (the columns)
then I did a forceMerge(1, true) on each indexWriter and close()

To test if it was ok, I had added a docValues to each document
docAddedOrd = 0L
for (index in indexes) {
    document = Document()
    document.add(NumericDocValuesField("docAddedOrd", docAddedOrd))

And then I checked if the docId was equal to the docValue

I had less success without the calls to setRAMBufferSizeMB() and 
setMaxBufferedDocs() :
I managed to build some small indexes with LogDocMergePolicy, but as 
soon as the index got too big,
the docIds went out of sync (merges dprobably happened and shuffled the 

I tried to commit() -> it made it worse
LogByteSizePolicy, NoMergePolicy  -> didn't fix it

There. Now that I'm able to build a parrallel index, I'll check if I can 
read it with a Parrallel reader.

Best regards,

On 05/02/2014 02:42 PM, Shai Erera wrote:I don't think that you need to 
be concerned with the internal docIDs much. Just imagine the indexes as 
a big table with multiple columns, where columns are grouped together. 
Each group is a different index. If a document does not have a value in 
one column, then you have an empty cell. if a document doesn't have a 
value in entire group of columns, then you denote that by adding an 
empty document. Oh, and make sure to use a LogMergePolicy, so segments 
are merged in the same order across all indexes. And given that you 
rebuild the indexes every time, you can create them one-by-one. You 
don't need to do that in parallel to all indexes, unless it's more 
convenient for you. Shai On Fri, May 2, 2014 at 9:28 AM, Olivier Binda 
>> On 05/02/2014 06:05 AM, Shai Erera wrote:
>>> If you're always rebuilding, let alone forceMerge, you shouldn't have too
>>> much trouble implementing it. Just make sure that you add documents in the
>>> same order to all indexes.
>>> If you're always rebuilding, how come you have deletions? Anyway, you must
>>> also delete in all indexes.
>> Indeed, I don't have deletions and I'm mainly concerned with merges.
>> But I just want to understand the whole docId remapping process,
>> out of curiosity and also because obtaining a docId (and not losing it)
>> seems to be the key of parallel indexes
>>   On May 2, 2014 1:57 AM, "Olivier Binda" <> wrote:
>>>   On 05/01/2014 10:28 AM, Shai Erera wrote:
>>>>   I'm glad it helped you. Good luck with the implementation.
>>>>>   Thanks. First I started looking at the lucene internal code. To
>>>> understand
>>>> when/where and why docIds are changing/need to be changed (in merge and
>>>> doc
>>>> deletions) .
>>>> I have always wanted to understand this and I think the understanding may
>>>> help me somehow.
>>>>   One thing I didn't mention (though it's in the jdocs) -- it's not enough
>>>>> to
>>>>> have the documents of each index aligned, you also have to have the
>>>>> segments aligned. That is, if both indexes have documents 0-5 aligned,
>>>>> but
>>>>> one index contains a single segment and the other one 2 segments, that's
>>>>> not going to work.
>>>>>   That's good to know.
>>>>    It is possible to do w/ some care -- when you build the German index,
>>>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>>>>> documents to match an existing segment on e.g. the Common index.
>>>>> Or, if rebuilding all indexes won't take long, you can always rebuild
>>>>> all
>>>>> of them.
>>>>>   Yes. That's what I am usually doing (it takes less than 1 minute)
>>>> Yet, I usually do a forceMarge too to only have 1 segment :/
>>>>    Shai
>>>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <
>>>>> wrote:
>>>>>    On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>>>>    I hope I got all the details right, if I didn't then please clarify.
>>>>>>> Also,
>>>>>>> I haven't read the entire thread, so if someone already suggested
>>>>>>> ...
>>>>>>> well, it probably means it's the right solution :)
>>>>>>> It sounds like you could use Lucene's ParallelCompositeReader,
>>>>>>> already handles multiple IndexReaders that are aligned by their
>>>>>>> internal
>>>>>>> document IDs. The way it would work, as far as I understand your
>>>>>>> scenario
>>>>>>> is something like the following table (columns denote different
>>>>>>> indexes).
>>>>>>> Each index contains a subset of relevant fields, where common
>>>>>>> the
>>>>>>> common fields, and each language index contains the respective
>>>>>>> language
>>>>>>> fields.
>>>>>>> DocID        LuceneID  Common  English       German        ....
>>>>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>>>>                                    EN_sentences  DE_sentences
>>>>>>> "SecondDoc"  1         A,B,C
>>>>>>> "ThirdDoc"   2         A,B,C
>>>>>>> Each index can contain all relevant fields, or only a subset
>>>>>>> maybe
>>>>>>> not all documents have a value for the 'B' field in the 'common'
>>>>>>> index).
>>>>>>> What's absolutely very important here though is that the indexes
>>>>>>> created very carefully, and if e.g. SecondDoc is not translated
>>>>>>> German, *you must still have an empty document* in the German
>>>>>>> or
>>>>>>> otherwise, document IDs will not align.
>>>>>>>    That's exactly how I saw it and what I need to do. So, I'll
have a
>>>>>>> very
>>>>>> good look at
>>>>>> ParallelCompositeReader
>>>>>>    Lucene does not offer a way to build those indexes though (patches
>>>>>>> welcome!!).
>>>>>>>    This answers my question 1. Thanks.  :)
>>>>>> I somehow hoped that there was already support for that kind of
>>>>>> situation
>>>>>> in lucene but well,
>>>>>> now at least I know that I won't find an already made solution to
>>>>>> problem in the lucene classes and that I will have to code one myself,
>>>>>> by taking inspiration in the lucene classes that do similar processing.
>>>>>>    We've started some effort very long time ago on LUCENE-1879
>>>>>>> (there's a patch and a discussion for an alternative approach)
as well
>>>>>>> as
>>>>>>> there is a very useful suggestion in ParallelCompositeReader's
>>>>>>> (use
>>>>>>> LogDocMergePolicy).
>>>>>>>    Wow, priceless. This gives me some headstart and inspiration.
>>>>>>    One challenge is how to support multi-threaded indexing, but perhaps
>>>>>>> this
>>>>>>> isn't a problem in your application? It sounds like, by you writing
>>>>>>> that a
>>>>>>> user will "download the german index", that the indexes are built
>>>>>>> offline?
>>>>>>>    Indeed. The index is built offline, in a single thread, and
once it
>>>>>>> is
>>>>>> built, it is read only.
>>>>>> Cant find an easier situation. :)
>>>>>>     Another challenge is how to control segment merging, so that
>>>>>> *exact
>>>>>>   same segments* are merged over the parallel indexes. Again, if
>>>>>>> application builds the indexes offline, then this should be easier
>>>>>>> accomplish.
>>>>>>> I assume though that when you index e.g. the German documents,
>>>>>>> the
>>>>>>> already indexes 'common' fields do not change for a document.
If they
>>>>>>> do,
>>>>>>> you will need to rebuild the 'common' index too.
>>>>>>> Once you achieve a correct parallel index, it is very easy to
open a
>>>>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>>>>> Common+English,
>>>>>>> Common+German, or Common+English+German and search it, since
>>>>>>> internal
>>>>>>> document IDs are perfectly aligned.
>>>>>>> Shai
>>>>>>>    Many thanks for the awesome answer and the help (I love you).
>>>>>> As I really really really need this to happen, I'm going to start
>>>>>> working
>>>>>> on this really soon.
>>>>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>>>>> workings, so I can't promise to contribute a miracoulous patch though.
>>>>>> Especially since I won't work on the muli-thread aspect of the problem.
>>>>>> But I'll do the best I can and contribute back whatever code I can
>>>>>> produce.
>>>>>> Many thanks, again. :)
>>>>>>   On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>>>>>> wrote:
>>>>>>>     My suggestion is you not worry about the docId, in practice
it is
>>>>>>> an
>>>>>>>   "internal lucene" id, quite similar with a rowId on a database,
>>>>>>>> index
>>>>>>>> may generate a different docId (it is their problem) from
>>>>>>>> translated
>>>>>>>> document, you may use your own ID that relates one document
>>>>>>>> another
>>>>>>>> on
>>>>>>>> different index mainly because like you mention are translated
>>>>>>>> documents
>>>>>>>> that on theory can be ranked differently from language to
>>>>>>>> (it
>>>>>>>> is
>>>>>>>> not an obligation that a set of documents on different languages
>>>>>>>> spams
>>>>>>>> the
>>>>>>>> same rank order but i am not 100% sure about this),
>>>>>>>> Second reason is that 'they may change the internal structure
>>>>>>>> lucene
>>>>>>>> without warrant', and then you lose the forward compatibility.
>>>>>>>> I am not an expert on Lucene like Schindler, but reading
>>>>>>>> documentation understood that they have a special attention
>>>>>>>> "internal lucene" and "experimental lucene" which means internal
>>>>>>>> "non
>>>>>>>> warrant compatible", and experimental "may be removed".
>>>>>>>> For example they (apache-lucene) discover a "new manner"
to relate
>>>>>>>> each
>>>>>>>> document that is more efficient and change some mechanism,
then your
>>>>>>>> application uses an internal mechanism that is high coupled
>>>>>>>> lucene
>>>>>>>> version xxx (marked as "internal-lucene") you can stuck on
a specific
>>>>>>>> version and   on future have to rewrite some code because
and this
>>>>>>>> might
>>>>>>>> cause some "management conflict" if your project follows
a continuous
>>>>>>>> integration and you are subordinated on a management structure
>>>>>>>> to
>>>>>>>> you).
>>>>>>>> I saw this on several projects that uses Lucene around they
do not
>>>>>>>> upgrade
>>>>>>>> their lucene components on their new releases one example
if i am not
>>>>>>>> wrong
>>>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke)
>>>>>>>> means
>>>>>>>> that "The project was abandoned because the manner how they
>>>>>>>> with
>>>>>>>> Lucene was not fully functional".
>>>>>>>> Another interesting thing is that developing around Lucene
is more
>>>>>>>> effective, you guarantee that your product will work and
>>>>>>>> guarantee
>>>>>>>> that Lucene works too. This is related with design by contract.
>>>>>>>> Regards.
>>>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>>>>>    wrote:
>>>>>>>>> Hello.
>>>>>>>>> Sorry to bring this up again. I don't want to be rudeand
I mean no
>>>>>>>>> disrespect, but after thinking it through today,
>>>>>>>>> I need to and would really love to have the answer to
the following
>>>>>>>>> question :
>>>>>>>>> 1) At lucene indexing time, is it possible to rewrite
a read-only
>>>>>>>>> index
>>>>>>>>>    so
>>>>>>>>    that some fields are only found in some segments (and
how ?)
>>>>>>>>> Uwe Schindler suggested using different index and a MultiReader
>>>>>>>>> my
>>>>>>>>> needs and It probably answers my second question, better
>>>>>>>>> as
>>>>>>>>>    "Is
>>>>>>>>    it possible to restrict  an index to some of it's segments
? " as a
>>>>>>>>> CompositeReader with AtomicReaders (or a custom Directory)
that read
>>>>>>>>> the
>>>>>>>>> aforementioned segments might do the trick
>>>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong),
it doesn't
>>>>>>>>>    solve
>>>>>>>>    my needs as I have around 300000 documents of the following
kind :
>>>>>>>>> READ ONLY Document :
>>>>>>>>> // common fields shipped with the App that aren't language
>>>>>>>>> A:
>>>>>>>>> B:
>>>>>>>>> C:
>>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>>> EN:
>>>>>>>>> EN_Words:
>>>>>>>>> EN_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>>> DE:
>>>>>>>>> DE_Words:
>>>>>>>>> DE_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> ...
>>>>>>>>> There might be hundreds of language package that my users
might use
>>>>>>>>> If I use different indexes
>>>>>>>>> indexA for the common stuff,
>>>>>>>>> indexEN for the English package,
>>>>>>>>> indexDE for the german package,
>>>>>>>>> For sure, I will be able to make a big index out of those
by using a
>>>>>>>>> MultiReader
>>>>>>>>> BUT it really makes an union out of the three index (right
?) which
>>>>>>>>> means
>>>>>>>>> I'll have 900000 documents
>>>>>>>>> and the documents in the indexA won't have any relations
to the
>>>>>>>>> documents
>>>>>>>>> in indexEN (right ?) except if I give each document an
id in each
>>>>>>>>> index
>>>>>>>>>    and
>>>>>>>>    make a join at query time which is a big no no, because
I use a
>>>>>>>>>    queryParser
>>>>>>>>    and users may enter queries like "A:gah AND (DE:schlaffen
>>>>>>>>> EN:sleep)"
>>>>>>>>> Or I am mistaken and there is a way to create a document
in three
>>>>>>>>> different index that stay in relations with the same
docId ?
>>>>>>>>> My solution if question 1 is possible :
>>>>>>>>> In contrast, if I am able to build my index so that my
>>>>>>>>> Document
>>>>>>>>> are stored in
>>>>>>>>> SEGMENT 1
>>>>>>>>> // common fields shipped with the App that aren't language
>>>>>>>>> A:
>>>>>>>>> B:
>>>>>>>>> C:
>>>>>>>>> SEGMENT 2
>>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>>> EN:
>>>>>>>>> EN_Words:
>>>>>>>>> EN_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> SEGMENT 3
>>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>>> DE:
>>>>>>>>> DE_Words:
>>>>>>>>> DE_Sentences:
>>>>>>>>> some DocValues
>>>>>>>>> I only need to ship SEGMENT 1 in the App and let users
>>>>>>>>> SEGMENT
>>>>>>>>> 2
>>>>>>>>> or SEGMENT 3 whether they want english or german
>>>>>>>>> and use a composite reader with atomic readers (right
?) to use my
>>>>>>>>> frankenstein index at query time with a queryparser
>>>>>>>>> Also, In case question 1 is possible. I would really
like to know
>>>>>>>>> too,
>>>>>>>>> if
>>>>>>>>> it is possible to remap at build time docIds in a read-only
>>>>>>>>> An application of this would be :
>>>>>>>>> At day 1, I shipp my app with 2 languages packages :
English and
>>>>>>>>> german
>>>>>>>>> (documents are uniquely identified by a docId... or by
an external
>>>>>>>>> id
>>>>>>>>> (thanks to a docId<-> external id map)
>>>>>>>>> At day 2, I ship an additional language package (French)
because I'm
>>>>>>>>> able
>>>>>>>>> to build an index with English, German, French with the
same exact
>>>>>>>>> docIds
>>>>>>>>> for each document that the index shipped at day 1
