From java-user-return-49643-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu May 5 22:01:56 2011 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8464E2F44 for ; Thu, 5 May 2011 22:01:56 +0000 (UTC) Received: (qmail 10546 invoked by uid 500); 5 May 2011 22:01:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10514 invoked by uid 500); 5 May 2011 22:01:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10504 invoked by uid 99); 5 May 2011 22:01:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 22:01:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sokolov@ifactory.com designates 68.236.111.2 as permitted sender) Received: from [68.236.111.2] (HELO camelot.ifactory.com) (68.236.111.2) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2011 22:01:44 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by camelot.ifactory.com (Postfix) with ESMTP id A66613672E26; Thu, 5 May 2011 18:01:23 -0400 (EDT) Received: from camelot.ifactory.com ([127.0.0.1]) by localhost (camelot.ifactory.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XhPxA0RoIcXN; Thu, 5 May 2011 18:01:18 -0400 (EDT) Received: from aix.ifactory.com (aix.ifactory.com [192.168.10.27]) by camelot.ifactory.com (Postfix) with ESMTPA id 56A863672623; Thu, 5 May 2011 18:01:18 -0400 (EDT) Message-ID: <4DC31E2E.10001@ifactory.com> Date: Thu, 05 May 2011 18:01:18 -0400 From: Mike Sokolov User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: java-user@lucene.apache.org CC: Chris Schilling Subject: Re: new to lucene, non standard index References: <659EF0A1-D63B-4550-848E-49AEC1A700CD@cellixis.com> In-Reply-To: <659EF0A1-D63B-4550-848E-49AEC1A700CD@cellixis.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields: id (doc#/token#) doc-id (doc#) token weight1 weight2 frequency Then search for token, sort by weight1, weight2 or frequency. If the token matches are unique within a document you will only get each document listed once. If they aren't unique, it's not clear what you want to sort by anyway.... -Mike On 05/05/2011 04:12 PM, Chris Schilling wrote: > Hi, > > I am trying to figure out how to solve this problem: > > I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout: > > doc1 > token1, weight11, frequency1, weight21 > token2, weight12, frequency2, weight22 > . > . > . > > etc for 500,000 docs. > > Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2. > > So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function: > > public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception { > Document doc = new Document(); > doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED)); > doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1)); > doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2)); > doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency)); > return doc; > } > > So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields. > > When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields. > > My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword). > > FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword. > > private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30)); > > public void search(String q, String sortby) throws IOException, ParseException { > Query query = parser.parse(q); > long start = System.currentTimeMillis(); > TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true))); > long end = System.currentTimeMillis(); > System.out.println("Found " + hits.totalHits + > " document(s) (in " + (end - start) + > " milliseconds) that matched query '" + > q + "':"); > for(ScoreDoc scoreDoc : hits.scoreDocs) { > Document doc = this.is.doc(scoreDoc.doc); > String hash = doc.get("hash"); > System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash); > } > } > > I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :) > > Thanks for your help, > Chris S. > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org