Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of gregdearing@gmail.com
 designates 209.85.212.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <594951757.80458.1396606565523.open-xchange@email.1und1.de>
References: <594951757.80458.1396606565523.open-xchange@email.1und1.de>
Date: Fri, 4 Apr 2014 17:09:16 -0400
Message-ID: 
 <CAASL1--B6wq=+UQd3CRZPri3XyG=nhgyN7OYsddZhtMPc88RZQ@mail.gmail.com>
Subject: Re: Avoid memory issues when indexing terms with multiplicity
From: Gregory Dearing <gregdearing@gmail.com>
To: java-user@lucene.apache.org,
 =?ISO-8859-1?Q?D=E1vid_Nemeskey?= <david@cliqz.com>
Content-Type: multipart/alternative; boundary=089e0102f2d22b05a604f63deed9

--089e0102f2d22b05a604f63deed9
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi David,

I'm not an expert, but I've climbed through the consumers myself in the
past.  The big limit is that the full postings for a document or document
block must fit into memory.  There may be other hidden processing limits
(ie. memory used per-field).

I think it would be possible to create a custom consumer chain that avoids
these limits problems, but it would be a lot of work.

My suggestions would be...

1.) If you're able to index your documents when not expanding terms,
consider whether expansion is really necessary.

If you're expanding them for relavency purposes, then consider storing the
frequency as a payload.  You can use something like PayloadTermQuery and
Similarity.scorePayload() to adjust scoring based on the value.  I wouldn't
expect this to noticably affect query times but, of course, it will depend
on your use case.

2.) I think you could override your TermsConsumer's implementation of
finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3,
right before the term is written to the postings.  This is not for the
faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your
needs.

-Greg


On Fri, Apr 4, 2014 at 6:16 AM, D=E1vid Nemeskey <david@cliqz.com> wrote:

> Hi guys,
>
> I have just recently (re-)joined the list. I have an issue with indexing;
> I hope
> someone can help me with it.
>
> The use-case is that some of the fields in the document are made up of
> term:frequency pairs. What I am doing right now is to expand these with a
> TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat
> cat", and
> index that. However, the problem is that when these fields contain real
> data
> (anchor text, references, etc.), the resulting field texts for some
> documents
> can be really huge; so much in fact, that I get OutOfMemory exceptions.
>
> I would be grateful if someone could tell me how this issue could be
> solved. I
> thought of circumventing the problem by maximizing the frequency I allow =
or
> using the logarithm thereof, but it would be nice to know if there is a
> proper
> solution for the problem. I have had a look at the code, but got lost in
> all the
> different Consumers. Here are a few questions I have come up with, but th=
e
> real
> solution might be something entirely different...
>
> 1. Is there information on how much using payloads (and hence positions)
> slow
> down querying?
> 2. Provided that I do not want payloads, can I extend something (perhaps =
a
> Consumer) to achieve what I want?
> 3. Is there a documentation somewhere that describes how indexing works,
> which
> Consumer, Writer, etc. is invoked when?
> 4. Am I better off by just post-processing indices, perhaps by writing th=
e
> frequency to a payload during indexing, and then run through the index,
> remove
> the payloads and positions and writing the posting lists myself?
>
> Thank you very much.
>
> Best,
> D=E1vid Nemeskey

--089e0102f2d22b05a604f63deed9--