lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin <acc4konstan...@gmail.com>
Subject Re: Near real time search improvement
Date Thu, 14 Jul 2016 12:14:52 GMT
Hello Michael,
Maybe this problem is already solved/(can be solved) on a different level
of abstraction (in Solr or Elasticsearch) - write new documents to both
persistent index and RAMDirectory, so new docs will be queried from it
immediately.
My motivation for this is to learn from Lucene. Could you please suggest
any source of information on BytesRefHash, TermsHash   and the whole
indexing process ?
Changing anything in there looks like a complex task to me too.


2016-07-14 11:54 GMT+03:00 Michael McCandless <lucene@mikemccandless.com>:

> Another example is Michael Busch's work while at Twitter, extending Lucene
> so you can do real-time searches of the write cache ... here's a paper
> describing it:
> http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf
>
> But this was a very heavy modification of Lucene and wasn't ever
> contributed back.
>
> I do think it should be possible (just complex!) to have real-time
> searching of recently indexed documents, and the sorted terms is really
> only needed if you must support multi-term queries.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Jul 12, 2016 at 12:29 PM, Adrien Grand <jpountz@gmail.com> wrote:
>
>> This is not something I am very familiar with, but this issue
>> https://issues.apache.org/jira/browse/LUCENE-2312 tried to improve NRT
>> latency by adding the ability to search directly into the indexing buffer
>> of the index writer.
>>
>> Le mar. 12 juil. 2016 à 16:11, Konstantin <acc4konstantin@gmail.com> a
>> écrit :
>>
>>> Hello everyone,
>>> As far as I understand NRT requires flushing new segment to disk. Is it
>>> correct that write cache is not searchable ?
>>>
>>> Competing search library groonga
>>> <http://groonga.org/docs/characteristic.html> - claim that they have
>>> much smaller realtime search latency (as far as I understand via searchable
>>> write-cache), but loading data into their index takes almost three times
>>> longer (benchmark in blog post in Japanese
>>> <http://blog.createfield.com/entry/2014/07/22/080958> , seems like
>>>  wikipedia XML, I'm not sure if it's English one ).
>>>
>>> I've created incomplete prototype of searchable write cache in my pet
>>> project <https://github.com/kk00ss/Rhinodog> - and it takes two times
>>> longer to index fraction of wikipedia using same EnglishAnalyzer from
>>> lucene.analysis (probably there is a room for optimizations). While loading
>>> data into Lucene I didn't reuse Document instances. Searchable write-cache
>>> was implemented as a bunch of persistent  scala's SortedMap[TermKey,
>>> Measure](), one per logical core. Where TermKey is defined as TermKey(termID:Int,
>>> docID: Long)and Measure is just frequency and norm (but could be
>>> extended).
>>>
>>> Do you think it's worth the slowdown ? If so I'm interested to learn how
>>> this part of Lucene works while implementing this feature. However it is
>>> unclear to me how hard would it be to change existing implementation. I
>>> cannot wrap my head around TermHash and the whole flush process - are there
>>> any documentation, good blog posts to read about it ?
>>>
>>>
>

Mime
View raw message