lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dávid Nemeskey <da...@cliqz.com>
Subject Avoid memory issues when indexing terms with multiplicity
Date Fri, 04 Apr 2014 10:16:05 GMT
Hi guys,

I have just recently (re-)joined the list. I have an issue with indexing; I hope
someone can help me with it.

The use-case is that some of the fields in the document are made up of
term:frequency pairs. What I am doing right now is to expand these with a
TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat", and
index that. However, the problem is that when these fields contain real data
(anchor text, references, etc.), the resulting field texts for some documents
can be really huge; so much in fact, that I get OutOfMemory exceptions.

I would be grateful if someone could tell me how this issue could be solved. I
thought of circumventing the problem by maximizing the frequency I allow or
using the logarithm thereof, but it would be nice to know if there is a proper
solution for the problem. I have had a look at the code, but got lost in all the
different Consumers. Here are a few questions I have come up with, but the real
solution might be something entirely different...

1. Is there information on how much using payloads (and hence positions) slow
down querying?
2. Provided that I do not want payloads, can I extend something (perhaps a
Consumer) to achieve what I want?
3. Is there a documentation somewhere that describes how indexing works, which
Consumer, Writer, etc. is invoked when?
4. Am I better off by just post-processing indices, perhaps by writing the
frequency to a payload during indexing, and then run through the index, remove
the payloads and positions and writing the posting lists myself?

Thank you very much.

Best,
Dávid Nemeskey
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message