lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Bazhenov <>
Subject Document serializable representation
Date Thu, 30 Mar 2017 07:45:50 GMT

We have in-house distributed Lucene setup. 40 dual-socket servers with approximatley 700 cores
divided in 7 partitions. Those machines are doing index search only. Indexes are prepared
on several isolated machines (so called, Index Masters) and distributed over the cluster with
plain rsync.

The search speed is great, but we need more indexation throughput. Index Masters are becoming
CPU-bounded lately. The reason is we use quite complicated analysis pipeline using morphological
dictionary as opposed to stemming and some NER-elements. Right now indexation throughput is
about ~1-1.5K documents per second. Considering corpus size of 140 million documents, full
reindex is about day or so. We want better. Out target at the moment >10K documents per
second. It seems like Lucene by itself can handle this requirement. It's just our comparatively
slow analysis pipeline can't.

So we have a Plan.

To move analysis algorithm from Index Master dedicated boxes where it can be easily scaled,
as being stateless. The problem we facing is Lucene at the moment doesn't have serializable
Document representation which can be used for communicating over network.

We are planning to implement this kind of representation. The question is there any pitfals
or problems we'd better know before starting? :)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message