lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Dearing <gregdear...@gmail.com>
Subject Re: Avoid memory issues when indexing terms with multiplicity
Date Fri, 04 Apr 2014 21:09:16 GMT
Hi David,

I'm not an expert, but I've climbed through the consumers myself in the
past.  The big limit is that the full postings for a document or document
block must fit into memory.  There may be other hidden processing limits
(ie. memory used per-field).

I think it would be possible to create a custom consumer chain that avoids
these limits problems, but it would be a lot of work.

My suggestions would be...

1.) If you're able to index your documents when not expanding terms,
consider whether expansion is really necessary.

If you're expanding them for relavency purposes, then consider storing the
frequency as a payload.  You can use something like PayloadTermQuery and
Similarity.scorePayload() to adjust scoring based on the value.  I wouldn't
expect this to noticably affect query times but, of course, it will depend
on your use case.

2.) I think you could override your TermsConsumer's implementation of
finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3,
right before the term is written to the postings.  This is not for the
faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your
needs.

-Greg



On Fri, Apr 4, 2014 at 6:16 AM, Dávid Nemeskey <david@cliqz.com> wrote:

> Hi guys,
>
> I have just recently (re-)joined the list. I have an issue with indexing;
> I hope
> someone can help me with it.
>
> The use-case is that some of the fields in the document are made up of
> term:frequency pairs. What I am doing right now is to expand these with a
> TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat
> cat", and
> index that. However, the problem is that when these fields contain real
> data
> (anchor text, references, etc.), the resulting field texts for some
> documents
> can be really huge; so much in fact, that I get OutOfMemory exceptions.
>
> I would be grateful if someone could tell me how this issue could be
> solved. I
> thought of circumventing the problem by maximizing the frequency I allow or
> using the logarithm thereof, but it would be nice to know if there is a
> proper
> solution for the problem. I have had a look at the code, but got lost in
> all the
> different Consumers. Here are a few questions I have come up with, but the
> real
> solution might be something entirely different...
>
> 1. Is there information on how much using payloads (and hence positions)
> slow
> down querying?
> 2. Provided that I do not want payloads, can I extend something (perhaps a
> Consumer) to achieve what I want?
> 3. Is there a documentation somewhere that describes how indexing works,
> which
> Consumer, Writer, etc. is invoked when?
> 4. Am I better off by just post-processing indices, perhaps by writing the
> frequency to a payload during indexing, and then run through the index,
> remove
> the payloads and positions and writing the posting lists myself?
>
> Thank you very much.
>
> Best,
> Dávid Nemeskey

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message