lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene in-memory index
Date Fri, 18 Oct 2013 19:36:58 GMT
Unfortunately, SpanNearQuery is a very costly query.  What slop are you passing?

You might want to check out
https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds
proximity boosting to queries, but it's still very early in the
iterating, and if you need a precise count of only those documents
matching the SpanNearQuery, then that issue won't help.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov
<ishalyminov@yandex-team.ru> wrote:
> Mike,
>
> For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one
segment - one thread, the complete setup is 30 segments with the total of 20GB).
>
> I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian
National Corpus).
> The main query type in it is co-occurrence search with desired word morphological features
and distance between tokens.
>
> In my test case I work with a single field - grammar (it is word-level - every word in
the corpus has one). Full grammar annotation of a word is a set of atomic grammar features.
> For an example, the verb "book" has in its grammar:
> - POS  tag (V);
> - time (pres);
>
> and the noun "book":
> - POS tag (N)
> - number (sg).
>
> In general one grammar annotation has approximately 8 atomic features.
>
> Words are treated as initially ambiguous, so that for the word "book" occurrence in the
text we get grammar tokens:
> V    pres    N    sg
> 2 parses: "V,pres" and "N,sg" are just independent tokens with positionIncrement=0 in
the index.
>
> Moreover, each such token has parse bitmask in its payload:
> V|0001    pres|0001    N|0010    sg|0010
>
> Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the maximum of 4
parse variants. It allows me to find the word "book" for the query "V" & "pres" but not
for the query "V" & "sg".
>
> So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with position
and payload checking over a 600MB segment and getting the precise doc hits number and overall
matches number via iterating over getSpans().
>
> This takes me about 20 seconds, even if everything is in RAM.
> The next thing I'm going to explore is compression, I'll try DirectPostingsFormat as
you suggested.
>
> --
> Best Regards,
> Igor
>
> 17.10.2013, 20:26, "Michael McCandless" <lucene@mikemccandless.com>:
>> DirectPostingsFormat holds all postings in RAM, uncompressed, as
>> simple java arrays.  But it's quite RAM heavy...
>>
>> The hotspots may also be in the queries you are running ... maybe you
>> can describe more how you're using Lucene?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>> <ishalyminov@yandex-team.ru> wrote:
>>
>>>  Hello!
>>>
>>>  I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work
the same for me (the same bad:( ).
>>>  Thus, I think my problem is not disk access (although I always see getPayload()
in the VisualVM top).
>>>  So, maybe the hard part in the postings traversal is decompression?
>>>  Are there Lucene codecs which use light postings compression (maybe none at
all)?
>>>
>>>  And, getting back to in-memory index topic, is lucene.codecs.memory somewhat
similar to RAMDirectory?
>>>
>>>  --
>>>  Best Regards,
>>>  Igor
>>>
>>>  10.10.2013, 03:01, "Vitaly Funstein" <vfunstein@gmail.com>:
>>>>  I don't think you want to load indexes of this size into a RAMDirectory.
>>>>  The reasons have been listed multiple times here... in short, just use
>>>>  MMapDirectory.
>>>>
>>>>  On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>  <ishalyminov@yandex-team.ru>wrote:
>>>>>   Hello!
>>>>>
>>>>>   I need to perform an experiment of loading the entire index in RAM
and
>>>>>   seeing how the search performance changes.
>>>>>   My index has TermVectors with payload and position info, StoredFields,
and
>>>>>   DocValues. It takes ~30GB on disk (the server has 48).
>>>>>
>>>>>   _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>   File(_indexDirectory)));
>>>>>
>>>>>   Is the line above the only thing I have to do to complete my goal?
>>>>>
>>>>>   And also:
>>>>>   - will all the data be loaded in the RAM right after opening, or during
>>>>>   the reading stage?
>>>>>   - will the index data be stored in RAM as it is on disk, or will it
be
>>>>>   uncompressed first?
>>>>>
>>>>>   --
>>>>>   Best Regards,
>>>>>   Igor
>>>>>
>>>>>   ---------------------------------------------------------------------
>>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message