Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 76257 invoked from network); 16 Sep 2009 15:06:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Sep 2009 15:06:47 -0000 Received: (qmail 11122 invoked by uid 500); 16 Sep 2009 15:06:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 11085 invoked by uid 500); 16 Sep 2009 15:06:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 11075 invoked by uid 99); 16 Sep 2009 15:06:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2009 15:06:44 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.177] (HELO homiemail-a13.g.dreamhost.com) (208.97.132.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2009 15:06:32 +0000 Received: from [10.0.0.77] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by homiemail-a13.g.dreamhost.com (Postfix) with ESMTPA id 3E8D76A8073 for ; Wed, 16 Sep 2009 08:06:10 -0700 (PDT) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) Subject: Re: Problems with ItemBasedRecommender with Lucene From: Grant Ingersoll In-Reply-To: <4AB0EC9D.3050002@mufin.com> Date: Wed, 16 Sep 2009 11:06:09 -0400 Content-Transfer-Encoding: 7bit Message-Id: <4001AD6D-7371-4508-AD4B-C0E6BC92D799@apache.org> References: <4AB0EC9D.3050002@mufin.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1076) X-Virus-Checked: Checked by ClamAV on apache.org On Sep 16, 2009, at 9:48 AM, Thomas Rewig wrote: > Hello, > > I build a "real time ItemBasedRecommender" based on a users history > and a (sparse) item similarity matrix with lucene. Some time ago Ted > Dunning recommended me this approach at the mahout mailing list to > create a ItemBasedRecommender: > > "It is actually very easy to do. The output of the recommendation > off-line process is generally a sparse matrix of item-item links. > Each line of this sparse matrix can be considered a document in > creating a Lucene index. You will have to use a correct analyzer and > a line by line document segmenter, but that is trivial. Then > recommendation is a simple query step." > > So for 100000 items it works fine - but for 1 million items the > Indexing fails and I have no idea how to avoid this. Maybe you can > give me a hint. > > First I create a Item-Item-Similaritymatrix with mahout's taste and > in the second step I index it. The matrix is sparce because only > Item-Item-Relations with a high correlation will be saved. > > Here are the Code Snippets for this indexing : > CachedRowSetImpl rowSetMainItemList = null; // Mapping of > Items > ArrayList listBelongingItems = null; // Belonging and > highest correlating Items for a MainItem > Document aDocument = null; > Field aField = null; > Field aField1 = null; > Analyzer aAnalyzer = new StandardAnalyzer(); > IndexWriter aWriter = new IndexWriter(this.indexDirectory, > aAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); > aWriter.setRAMBufferSizeMB(48); > rowSetMainItemList = getRowSetItemList(); //get all Items > aField1 = new Field("Item1", "", > Field.Store.YES,Field.Index.ANALYZED); // reuse this field > while (rowSetMainItemList.next()){ > aDocument = new Document(); > aField1.setValue > (rowSetMainItemList.getString(1)); aDocument.add > (aField1); > listBelongingItems = getRowSetBelongingItems > (rowSetMainItemList.getString(1)); // get the most similar Items fpr > a Item > Iterator itrBelongingItems = > listBelongingItems.iterator(); > while (itrBelongingItems.hasNext()){ > String strBelongingItem = (String) > itrBelongingItems.next(); > //No reuse of Field possible because of different > fieldnames: > aField = new Field(strBelongingItem,"1", > Field.Store.NO,Field.Index.ANALYZED_NO_NORMS); > aDocument.add(aField); > } > aWriter.addDocument > (aDocument); } > aWriter.optimize(); > aWriter.close(); > aAnalyzer.close(); > Actually the Field of the BelongingItem have to be > boosted with the MainItem-BelongingItem-Correlation-Value to get > accurate Recommendations, but here the Index would be about 80 GByte > for 6 million items... without it will only be about 2Gbyte. > But under the condition that only relevant Correlations will be > saved in the Similaritymatrix the recommendation quality will be > good enough. > > The item recommendation for a User is a simple BooleanQuery with > userhistory boosted TermQuerys. Here I search for documents with the > largest Correspondence regarding the userhistory. So I look in > which Documents the most Fields with the name of a BelongingItem are > set (with value 1) and recommend the "key"-value which was set in > aField1("Item"...) > Whatever, as i mentioned it worked for a Number of 100000 Items. > But if there are 1 million items the indexing crash after a while with > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.HashMap.resize(HashMap.java:462) > at java.util.HashMap.addEntry(HashMap.java:755) > at java.util.HashMap.put(HashMap.java:385) > at java.util.HashSet.add(HashSet.java:200) > at org.apache.lucene.index.DocInverter.flush(DocInverter.java: > 66) > at org.apache.lucene.index.DocFieldConsumers.flush > (DocFieldConsumers.java:75) > at org.apache.lucene.index.DocFieldProcessor.flush > (DocFieldProcessor.java:60) > at org.apache.lucene.index.DocumentsWriter.flush > (DocumentsWriter.java:574) > at org.apache.lucene.index.IndexWriter.doFlush > (IndexWriter.java:3540) > at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java: > 3450) > at org.apache.lucene.index.IndexWriter.addDocument > (IndexWriter.java:1937) > at org.apache.lucene.index.IndexWriter.addDocument > (IndexWriter.java:1895) > > if I increase the Java heap space there will be a > "OutOfMemoryError: /PermGen space" /Exception. > If I increase the PermGen Space -XX:MaxPermSize=1024m the Java heap > space is still the limiting factor. > I can increase both to the maximum of my system - 20Gbyte Ram are > available - but this doesn't solve the problem. > > Through indexing the ram-memory consumtion growing steadily until it > chrashes. It does not matter if I index the data in segments with > open and close each time the IndexWriter or if I optimize the index > periodically - the ram-memory consumtion is still growing ... > > I think the problem is, that I can't reuse the field aField for my > approach and it seems the GC doesn't collect it. Extrapolated thats > 600 Million unique fields... > > I'm using lucene 2.4.1 and java version "1.6.0_16". > > Do anyone have an idea to avoid the growing memory. Or do somebody > know an other approche for a "realtime Item based Recommender" with > Lucene? You might want to ask on mahout-user, but I'm guessing Ted didn't mean a new field for every item-item, but instead to represent them as tokens and then create the corresponding appropriate queries (seems like payloads may be useful, or function queries). That to me is the only way you would achieve the sparseness savings you are after. -Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org