Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB75010610 for ; Fri, 4 Apr 2014 21:09:48 +0000 (UTC) Received: (qmail 14268 invoked by uid 500); 4 Apr 2014 21:09:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 14135 invoked by uid 500); 4 Apr 2014 21:09:43 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13940 invoked by uid 99); 4 Apr 2014 21:09:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Apr 2014 21:09:41 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gregdearing@gmail.com designates 209.85.212.173 as permitted sender) Received: from [209.85.212.173] (HELO mail-wi0-f173.google.com) (209.85.212.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Apr 2014 21:09:37 +0000 Received: by mail-wi0-f173.google.com with SMTP id z2so2049836wiv.0 for ; Fri, 04 Apr 2014 14:09:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=5BGBXvC6P+EUIZFqQ9CTkLL1KW8w3Hitlu26FabVPK0=; b=FkIoD2b1Y+I+r1S+UCMtZ9sQ+3RvFcxkuhFKU9pgPccOhA/+35vfd2Vx1KyOHKsX5i MfkM9yUOnNKrRUzhBBzBkAlBjkxE/VTmZpjZRsp1389Wb4l7JHLLBehFsv+D5T46aMdq Th7muw2n/eqHSSBNngIWVxtwhWDb1FiKTvrqssLgp2KvPeAsj6NCu4ezUehEwUyq0Eg7 GuoRS76IIbbjuLdFbiEmWZhxIBD9SLUqjzHcBD7+8nc5d9v4N62aevl+Xvweci0E02iJ cIm/gOwYV5tm0iQbzPjfGF4afQdVHn9mPQHboVAGYpjzuJODtEa4Md4ObwReTvEbljSs fljg== MIME-Version: 1.0 X-Received: by 10.194.88.230 with SMTP id bj6mr1561019wjb.85.1396645756075; Fri, 04 Apr 2014 14:09:16 -0700 (PDT) Received: by 10.180.100.135 with HTTP; Fri, 4 Apr 2014 14:09:16 -0700 (PDT) In-Reply-To: <594951757.80458.1396606565523.open-xchange@email.1und1.de> References: <594951757.80458.1396606565523.open-xchange@email.1und1.de> Date: Fri, 4 Apr 2014 17:09:16 -0400 Message-ID: Subject: Re: Avoid memory issues when indexing terms with multiplicity From: Gregory Dearing To: java-user@lucene.apache.org, =?ISO-8859-1?Q?D=E1vid_Nemeskey?= Content-Type: multipart/alternative; boundary=089e0102f2d22b05a604f63deed9 X-Virus-Checked: Checked by ClamAV on apache.org --089e0102f2d22b05a604f63deed9 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi David, I'm not an expert, but I've climbed through the consumers myself in the past. The big limit is that the full postings for a document or document block must fit into memory. There may be other hidden processing limits (ie. memory used per-field). I think it would be possible to create a custom consumer chain that avoids these limits problems, but it would be a lot of work. My suggestions would be... 1.) If you're able to index your documents when not expanding terms, consider whether expansion is really necessary. If you're expanding them for relavency purposes, then consider storing the frequency as a payload. You can use something like PayloadTermQuery and Similarity.scorePayload() to adjust scoring based on the value. I wouldn't expect this to noticably affect query times but, of course, it will depend on your use case. 2.) I think you could override your TermsConsumer's implementation of finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3, right before the term is written to the postings. This is not for the faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your needs. -Greg On Fri, Apr 4, 2014 at 6:16 AM, D=E1vid Nemeskey wrote: > Hi guys, > > I have just recently (re-)joined the list. I have an issue with indexing; > I hope > someone can help me with it. > > The use-case is that some of the fields in the document are made up of > term:frequency pairs. What I am doing right now is to expand these with a > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat > cat", and > index that. However, the problem is that when these fields contain real > data > (anchor text, references, etc.), the resulting field texts for some > documents > can be really huge; so much in fact, that I get OutOfMemory exceptions. > > I would be grateful if someone could tell me how this issue could be > solved. I > thought of circumventing the problem by maximizing the frequency I allow = or > using the logarithm thereof, but it would be nice to know if there is a > proper > solution for the problem. I have had a look at the code, but got lost in > all the > different Consumers. Here are a few questions I have come up with, but th= e > real > solution might be something entirely different... > > 1. Is there information on how much using payloads (and hence positions) > slow > down querying? > 2. Provided that I do not want payloads, can I extend something (perhaps = a > Consumer) to achieve what I want? > 3. Is there a documentation somewhere that describes how indexing works, > which > Consumer, Writer, etc. is invoked when? > 4. Am I better off by just post-processing indices, perhaps by writing th= e > frequency to a payload during indexing, and then run through the index, > remove > the payloads and positions and writing the posting lists myself? > > Thank you very much. > > Best, > D=E1vid Nemeskey --089e0102f2d22b05a604f63deed9--