<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>lucene-net-user@incubator.apache.org Archives</title>
<link rel="self" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/?format=atom"/>
<link href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/"/>
<id>http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/</id>
<updated>2009-12-10T18:27:37Z</updated>
<entry>
<title>Re: idf on per-field basis</title>
<author><name>Artem Chereisky &lt;a.chereisky@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3cBF90C35D-3883-4289-8063-805197D23E79@gmail.com%3e"/>
<id>urn:uuid:%3cBF90C35D-3883-4289-8063-805197D23E79@gmail-com%3e</id>
<updated>2009-12-10T09:40:03Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Michael, thank you.

Query filter only solves half of my problem. Unfortunately I do need  
to have a proper score for some fields.

I ended up extending Term class (I removed sealed attribute which is a  
bad thing). The new myTerm class has one boolean member, omitIdf.  
Then, when I compile my queries, I use myTerm with omitIdf set to  
true, for some fields. Then I extended Similarity cladd and I cast  
Term passes into Idf method to myTerm and only calculate Idf if  
omitIdf is true. Seems to work.

I don't like the solution but that's the best I could do today.

Any thoughts?

Regards,
Artem


On 10/12/2009, at 15:51, Michael Garski &lt;mgarski@myspace-inc.com&gt; wrote:

&gt; Artem,
&gt;
&gt; Do you need any scoring information at all on that field?  How about  
&gt; using a QueryFilter for those fields?
&gt;
&gt; Michael
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Artem Chereisky [mailto:a.chereisky@gmail.com]
&gt; Sent: Wed 12/9/2009 4:53 PM
&gt; To: lucene-net-user@incubator.apache.org; lucene-net-developer@incubator.apache.org
&gt; Subject: idf on per-field basis
&gt;
&gt; Hi,
&gt;
&gt; I came across a situation when my scores are adversely affected by  
&gt; the IDF
&gt; component. Let me explain.
&gt;
&gt; My index documents contain a number of fields, for some, TF and IDF  
&gt; are
&gt; important and need to be taken into account, for others niether TF  
&gt; nor IDF
&gt; should apply. I dealt with TF by omiting norms during indexing but I  
&gt; can't
&gt; find a way to calculate IDF for certain fields only.
&gt;
&gt; The formula for IDF is defined in Similarity. I have my own  
&gt; implementation
&gt; of Similarity where I can set it to 1 or use the default  
&gt; implementation.
&gt; mySearcher.SetSimilarity is where I tell Lucene which similarity  
&gt; instance to
&gt; use, but that's global, so it applies to all fields in the index.
&gt;
&gt; So, here's my question. Is there a way to calculate IDF on per-field  
&gt; basis?
&gt;
&gt; Regards,
&gt; Art
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>AW: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;Markus Wolters&quot; &lt;markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c007d01ca7974$7a042660$6e0c7320$@com%3e"/>
<id>urn:uuid:%3c007d01ca7974$7a042660$6e0c7320$@com%3e</id>
<updated>2009-12-10T08:40:49Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I am going to use Lucene for a new public community portal. I would love to
be the one, who tests 2.9.1 for production, but I think, before that Michael
should test at least the numeric fields feature as he said.

Markus

-----Ursprüngliche Nachricht-----
Von: George Aroush [mailto:george@aroush.net] 
Gesendet: Donnerstag, 10. Dezember 2009 04:19
An: lucene-net-user@incubator.apache.org
Betreff: RE: Safe to use Release 2.9.1 Beta for production?

I really want to see folks using 2.9.1, if possible, in production too.
It's the only way we can be sure of its quality.  And yes, all NUnit test
are passing but I want to see how it's doing outside NUnit too.

-- George

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Wednesday, December 09, 2009 3:42 AM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

I actually need especially the numeric range feature, because I've got a
fulltext search within a specified distance, so I need to filter the results
by spatial data.

Is the Lucene Spatial contrib the right way to go? It's been ported by Roger
Chapman.

Markus


-----Ursprüngliche Nachricht-----
Von: Michael Garski [mailto:mgarski@myspace-inc.com] 
Gesendet: Mittwoch, 9. Dezember 2009 02:26
An: lucene-net-user@incubator.apache.org
Betreff: RE: Safe to use Release 2.9.1 Beta for production?

Two of the features I've been looking forward to and am now testing are
numeric fields, which provide significant performance improvements on
numeric range queries, and the underlying change to per-segment searching
and caching.  The latter sounds rather innocuous at first glance, but when
coupled with a custom merge policy can improve heap utilization for large
indexes.

You can certainly build your application against 2.4 and then migrate to 2.9
at a later time if you are under time constraints to deliver something
quickly.

Michael

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Tuesday, December 08, 2009 1:24 PM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

Thanks for your advice.

Are there any mature features I might miss when using version 2.4.0?

Markus

-----Ursprüngliche Nachricht-----
Von: xzxz@mail.ru [mailto:xzxz@mail.ru] 
Gesendet: Dienstag, 8. Dezember 2009 18:23
An: lucene-net-user@incubator.apache.org
Betreff: Re: Safe to use Release 2.9.1 Beta for production?

I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   











</pre>
</div>
</content>
</entry>
<entry>
<title>AW: Spatial search with Lucene.Net</title>
<author><name>&quot;Markus Wolters&quot; &lt;markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c007c01ca7974$22ce67c0$686b3740$@com%3e"/>
<id>urn:uuid:%3c007c01ca7974$22ce67c0$686b3740$@com%3e</id>
<updated>2009-12-10T08:38:23Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi George,

actually, there IS a port by Robert Chapman unless I'm mistaken:

http://issues.apache.org/jira/browse/LUCENENET-199

I had not yet the time to check it.

Markus

-----UrsprÃ¼ngliche Nachricht-----
Von: George Aroush [mailto:george@aroush.net] 
Gesendet: Donnerstag, 10. Dezember 2009 04:16
An: lucene-net-user@incubator.apache.org
Betreff: RE: Spatial search with Lucene.Net

There isn't any port of Spatial.Net under contrib.  If someone has ported it, and want to
contribute it, that would be great.
 
-- George

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: Wednesday, December 09, 2009 5:12 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Spatial search with Lucene.Net

I'm not familiar with any of the spatial search libraries in the contrib section, however
they may need to be updated to use 2.9's numeric fields.

Michael 

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Wednesday, December 09, 2009 12:46 AM
To: lucene-net-user@incubator.apache.org
Subject: Spatial search with Lucene.Net

What is the best way to incorporate spatial search data into lucene?

The Spatial.Net contrib, which is been ported by Robert Chapman or this code
if found here:

http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Markus









</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to search number in text field?</title>
<author><name>Floyd Wu &lt;floyd.wu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c103c59c90912092150v6b17cd9awfa5babf2c40f6b13@mail.gmail.com%3e"/>
<id>urn:uuid:%3c103c59c90912092150v6b17cd9awfa5babf2c40f6b13@mail-gmail-com%3e</id>
<updated>2009-12-10T05:50:24Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Michael

Thanks, I finally found my problem. The final query text executed by Lucene
QueryParser was accidentally modified by another function. using wildcard
query is okay and no problem so far.

Thanks.

Floyd



2009/12/10 Michael Garski &lt;mgarski@myspace-inc.com&gt;

&gt; Be sure to use the same analyzer at search time that you do at index
&gt; time.  The StandardAnalyzer will keep the numbers and allow you to
&gt; search for the text "2009123" in the field "title" with "title:2009123"
&gt; or "title:2009*".  I'm not sure why you are having difficulty using the
&gt; wildcard query.
&gt;
&gt; Wildcard queries do not perform the best if you have a large number of
&gt; terms that could match the wildcard.  If that is the case you will want
&gt; to alter the analysis of your text.
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Floyd Wu [mailto:floyd.wu@gmail.com]
&gt; Sent: Sunday, December 06, 2009 7:44 PM
&gt; To: lucene-net-user@incubator.apache.org
&gt;  Subject: Re: How to search number in text field?
&gt;
&gt; Hi Michael
&gt;
&gt; The query is constructed using QueryText.
&gt; Partial of my code as floowing
&gt;
&gt; string queryText = "title:2009123";
&gt; QueryParser parser = new QueryParser(SpecialFields.Title, Analyzer);
&gt;            parser.SetLowercaseExpandedTerms(false);
&gt;            Lucene.Net.Search.Query query
&gt; =parser.Parse(queryText);
&gt; searcher.Search(query); //searcher is an IndexSearcher instance
&gt;
&gt; Floyd
&gt;
&gt;
&gt;
&gt; 2009/12/5 Michael Garski &lt;mgarski@myspace-inc.com&gt;
&gt;
&gt; &gt; Floyd,
&gt; &gt;
&gt; &gt; How are you constructing the query?
&gt; &gt;
&gt; &gt; Michael
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Floyd Wu [mailto:floyd.wu@gmail.com]
&gt; &gt; Sent: Thursday, December 03, 2009 10:33 PM
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: How to search number in text field?
&gt; &gt;
&gt; &gt; Hi all,
&gt; &gt; When using StandardAnalyzer indexing documents, Lucene.Net did not
&gt; &gt; search
&gt; &gt; for numbers in text fields with numbers.
&gt; &gt; For example, I have built a index which title is 2009123.
&gt; &gt; Fire the querystring as "title:2009" returns no records and even if
&gt; &gt; using
&gt; &gt; title:2009*
&gt; &gt; There does exist record with title value=2009123;
&gt; &gt; How can I search this record when my client really want to use 2009 as
&gt; &gt; keyword for searching?
&gt; &gt;
&gt; &gt; Thanks
&gt; &gt;
&gt; &gt; Floyd
&gt; &gt;
&gt; &gt;
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: idf on per-field basis</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c7112862FD2F84D49927A5A5E0758451E046CFF@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E046CFF@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-12-10T04:51:18Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Artem,

Do you need any scoring information at all on that field?  How about using a QueryFilter for
those fields?

Michael


-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com]
Sent: Wed 12/9/2009 4:53 PM
To: lucene-net-user@incubator.apache.org; lucene-net-developer@incubator.apache.org
Subject: idf on per-field basis
 
Hi,

I came across a situation when my scores are adversely affected by the IDF
component. Let me explain.

My index documents contain a number of fields, for some, TF and IDF are
important and need to be taken into account, for others niether TF nor IDF
should apply. I dealt with TF by omiting norms during indexing but I can't
find a way to calculate IDF for certain fields only.

The formula for IDF is defined in Similarity. I have my own implementation
of Similarity where I can set it to 1 or use the default implementation.
mySearcher.SetSimilarity is where I tell Lucene which similarity instance to
use, but that's global, so it applies to all fields in the index.

So, here's my question. Is there a way to calculate IDF on per-field basis?

Regards,
Art

 


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;George Aroush&quot; &lt;george@aroush.net&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c008601ca7947$729f6c90$57de45b0$@net%3e"/>
<id>urn:uuid:%3c008601ca7947$729f6c90$57de45b0$@net%3e</id>
<updated>2009-12-10T03:18:30Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I really want to see folks using 2.9.1, if possible, in production too.
It's the only way we can be sure of its quality.  And yes, all NUnit test
are passing but I want to see how it's doing outside NUnit too.

-- George

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Wednesday, December 09, 2009 3:42 AM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

I actually need especially the numeric range feature, because I've got a
fulltext search within a specified distance, so I need to filter the results
by spatial data.

Is the Lucene Spatial contrib the right way to go? It's been ported by Roger
Chapman.

Markus


-----Ursprüngliche Nachricht-----
Von: Michael Garski [mailto:mgarski@myspace-inc.com] 
Gesendet: Mittwoch, 9. Dezember 2009 02:26
An: lucene-net-user@incubator.apache.org
Betreff: RE: Safe to use Release 2.9.1 Beta for production?

Two of the features I've been looking forward to and am now testing are
numeric fields, which provide significant performance improvements on
numeric range queries, and the underlying change to per-segment searching
and caching.  The latter sounds rather innocuous at first glance, but when
coupled with a custom merge policy can improve heap utilization for large
indexes.

You can certainly build your application against 2.4 and then migrate to 2.9
at a later time if you are under time constraints to deliver something
quickly.

Michael

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Tuesday, December 08, 2009 1:24 PM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

Thanks for your advice.

Are there any mature features I might miss when using version 2.4.0?

Markus

-----Ursprüngliche Nachricht-----
Von: xzxz@mail.ru [mailto:xzxz@mail.ru] 
Gesendet: Dienstag, 8. Dezember 2009 18:23
An: lucene-net-user@incubator.apache.org
Betreff: Re: Safe to use Release 2.9.1 Beta for production?

I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   









</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Spatial search with Lucene.Net</title>
<author><name>&quot;George Aroush&quot; &lt;george@aroush.net&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c008401ca7947$0a849360$1f8dba20$@net%3e"/>
<id>urn:uuid:%3c008401ca7947$0a849360$1f8dba20$@net%3e</id>
<updated>2009-12-10T03:15:35Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
There isn't any port of Spatial.Net under contrib.  If someone has ported it, and want to contribute
it, that would be great.
 
-- George

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: Wednesday, December 09, 2009 5:12 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Spatial search with Lucene.Net

I'm not familiar with any of the spatial search libraries in the contrib section, however
they may need to be updated to use 2.9's numeric fields.

Michael 

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Wednesday, December 09, 2009 12:46 AM
To: lucene-net-user@incubator.apache.org
Subject: Spatial search with Lucene.Net

What is the best way to incorporate spatial search data into lucene?

The Spatial.Net contrib, which is been ported by Robert Chapman or this code
if found here:

http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Markus







</pre>
</div>
</content>
</entry>
<entry>
<title>idf on per-field basis</title>
<author><name>Artem Chereisky &lt;a.chereisky@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c9c1e6c8f0912091653u12d77764y25069e35d13bbe92@mail.gmail.com%3e"/>
<id>urn:uuid:%3c9c1e6c8f0912091653u12d77764y25069e35d13bbe92@mail-gmail-com%3e</id>
<updated>2009-12-10T00:53:10Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

I came across a situation when my scores are adversely affected by the IDF
component. Let me explain.

My index documents contain a number of fields, for some, TF and IDF are
important and need to be taken into account, for others niether TF nor IDF
should apply. I dealt with TF by omiting norms during indexing but I can't
find a way to calculate IDF for certain fields only.

The formula for IDF is defined in Similarity. I have my own implementation
of Similarity where I can set it to 1 or use the default implementation.
mySearcher.SetSimilarity is where I tell Lucene which similarity instance to
use, but that's global, so it applies to all fields in the index.

So, here's my question. Is there a way to calculate IDF on per-field basis?

Regards,
Art


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: How to search number in text field?</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93B1B@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93B1B@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-12-09T22:15:13Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Be sure to use the same analyzer at search time that you do at index
time.  The StandardAnalyzer will keep the numbers and allow you to
search for the text "2009123" in the field "title" with "title:2009123"
or "title:2009*".  I'm not sure why you are having difficulty using the
wildcard query.  

Wildcard queries do not perform the best if you have a large number of
terms that could match the wildcard.  If that is the case you will want
to alter the analysis of your text.

Michael

-----Original Message-----
From: Floyd Wu [mailto:floyd.wu@gmail.com] 
Sent: Sunday, December 06, 2009 7:44 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: How to search number in text field?

Hi Michael

The query is constructed using QueryText.
Partial of my code as floowing

string queryText = "title:2009123";
QueryParser parser = new QueryParser(SpecialFields.Title, Analyzer);
            parser.SetLowercaseExpandedTerms(false);
            Lucene.Net.Search.Query query
=parser.Parse(queryText);
searcher.Search(query); //searcher is an IndexSearcher instance

Floyd



2009/12/5 Michael Garski &lt;mgarski@myspace-inc.com&gt;

&gt; Floyd,
&gt;
&gt; How are you constructing the query?
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Floyd Wu [mailto:floyd.wu@gmail.com]
&gt; Sent: Thursday, December 03, 2009 10:33 PM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: How to search number in text field?
&gt;
&gt; Hi all,
&gt; When using StandardAnalyzer indexing documents, Lucene.Net did not
&gt; search
&gt; for numbers in text fields with numbers.
&gt; For example, I have built a index which title is 2009123.
&gt; Fire the querystring as "title:2009" returns no records and even if
&gt; using
&gt; title:2009*
&gt; There does exist record with title value=2009123;
&gt; How can I search this record when my client really want to use 2009 as
&gt; keyword for searching?
&gt;
&gt; Thanks
&gt;
&gt; Floyd
&gt;
&gt;



</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Spatial search with Lucene.Net</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93B1A@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93B1A@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-12-09T22:11:44Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I'm not familiar with any of the spatial search libraries in the contrib section, however they
may need to be updated to use 2.9's numeric fields.

Michael 

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Wednesday, December 09, 2009 12:46 AM
To: lucene-net-user@incubator.apache.org
Subject: Spatial search with Lucene.Net

What is the best way to incorporate spatial search data into lucene?

The Spatial.Net contrib, which is been ported by Robert Chapman or this code
if found here:

http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Markus





</pre>
</div>
</content>
</entry>
<entry>
<title>AW:  Re: System.TypeInitializationException on linux (vbnc)</title>
<author><name>&quot;Johannes Drachenfels&quot; &lt;johannes@drachenfels.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c93515712821C364AA2A6A889A3497D5838B3A4@Bleich32.intern.drachenfels.de%3e"/>
<id>urn:uuid:%3c93515712821C364AA2A6A889A3497D5838B3A4@Bleich32-intern-drachenfels-de%3e</id>
<updated>2009-12-09T15:13:56Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Thanks a lot! This works perfect! I added a line

System.Environment.SetEnvironmentVariable("OS", "linux")

Do you know how to file a bug to the developers?!

Thanks &amp; Regards,

Johannes



Johannes von Drachenfels
Phone: +49-7231-9223800
Mobile: +49-171-6710815
 
Drachenfels GmbH
Bleichstrasse 56
75173 Pforzheim
Germany
 
Geschäftsführer: Johannes von Drachenfels
Sitz der Gesellschaft: Pforzheim
Handelsregister: Registergericht Mannheim: HRB 504389
 
Notice: This transmittal and/or attachments may be privileged or confidential. If you are
not the intended recipient, you are hereby notified that you have received this transmittal
in error; any review, dissemination, or copying is strictly prohibited. If you received this
transmittal in error, please notify us immediately by reply and immediately delete this message
and all its attachments. Thank you. 


-----Ursprüngliche Nachricht-----
Von: news [mailto:news@ger.gmane.org] Im Auftrag von Robert Jordan
Gesendet: Mittwoch, 9. Dezember 2009 15:17
An: lucene-net-user@incubator.apache.org
Betreff: Re: System.TypeInitializationException on linux (vbnc)

On 09.12.2009 14:57, Johannes Drachenfels wrote:
&gt; Hi,
&gt;
&gt;
&gt;
&gt; sorry for re-posting - I am new to this list...
&gt;
&gt;
&gt;
&gt; I am switching from lucene.net 2.0.0.4 to lucene.net 2.9.1.1 and I have
&gt; problems in a very early state.
&gt;
&gt;
&gt;
&gt; With version 2.0.0.4 it works fine!
&gt;
&gt; With version 2.9.1.1 on Microsoft it works fine
&gt;
&gt; With version 2.9.1.1 on linux I always get the following errors:
&gt;
&gt;
&gt;
&gt; ################################
&gt;
&gt; System.TypeInitializationException: An exception was thrown by the type
&gt; initializer for Lucene.Net.Store.FSDirectory ---&gt;
&gt; System.TypeInitializationException: An exception was thrown by the type
&gt; initializer for Lucene.Net.Util.Constants ---&gt;
&gt; System.NullReferenceException: Object reference not set to an instance
&gt; of an object
&gt;
&gt;    at Lucene.Net.Util.Constants..cctor () [0x00000]
&gt;
&gt;    --- End of inner exception stack trace ---
&gt;
&gt;    at Lucene.Net.Store.FSDirectory..cctor () [0x00000]
&gt;
&gt;    --- End of inner exception stack trace ---
&gt;
&gt;    at ConsoleApplication1.Module1.Main () [0x00000]


This is probably caused by this line in Lucene.Net.Util.Constants.cs:

public static readonly System.String OS_NAME = 
System.Environment.GetEnvironmentVariable("OS");

Try to assign some value to the env var "OS":

	OS=foo mono yourapp.exe

Robert



</pre>
</div>
</content>
</entry>
<entry>
<title>Re: System.TypeInitializationException on linux (vbnc)</title>
<author><name>Robert Jordan &lt;robertj@gmx.net&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3chfobgq$g7u$1@ger.gmane.org%3e"/>
<id>urn:uuid:%3chfobgq$g7u$1@ger-gmane-org%3e</id>
<updated>2009-12-09T14:16:57Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
On 09.12.2009 14:57, Johannes Drachenfels wrote:
&gt; Hi,
&gt;
&gt;
&gt;
&gt; sorry for re-posting - I am new to this list...
&gt;
&gt;
&gt;
&gt; I am switching from lucene.net 2.0.0.4 to lucene.net 2.9.1.1 and I have
&gt; problems in a very early state.
&gt;
&gt;
&gt;
&gt; With version 2.0.0.4 it works fine!
&gt;
&gt; With version 2.9.1.1 on Microsoft it works fine
&gt;
&gt; With version 2.9.1.1 on linux I always get the following errors:
&gt;
&gt;
&gt;
&gt; ################################
&gt;
&gt; System.TypeInitializationException: An exception was thrown by the type
&gt; initializer for Lucene.Net.Store.FSDirectory ---&gt;
&gt; System.TypeInitializationException: An exception was thrown by the type
&gt; initializer for Lucene.Net.Util.Constants ---&gt;
&gt; System.NullReferenceException: Object reference not set to an instance
&gt; of an object
&gt;
&gt;    at Lucene.Net.Util.Constants..cctor () [0x00000]
&gt;
&gt;    --- End of inner exception stack trace ---
&gt;
&gt;    at Lucene.Net.Store.FSDirectory..cctor () [0x00000]
&gt;
&gt;    --- End of inner exception stack trace ---
&gt;
&gt;    at ConsoleApplication1.Module1.Main () [0x00000]


This is probably caused by this line in Lucene.Net.Util.Constants.cs:

public static readonly System.String OS_NAME = 
System.Environment.GetEnvironmentVariable("OS");

Try to assign some value to the env var "OS":

	OS=foo mono yourapp.exe

Robert



</pre>
</div>
</content>
</entry>
<entry>
<title>System.TypeInitializationException on linux (vbnc)</title>
<author><name>&quot;Johannes Drachenfels&quot; &lt;johannes@drachenfels.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c93515712821C364AA2A6A889A3497D5838B3A1@Bleich32.intern.drachenfels.de%3e"/>
<id>urn:uuid:%3c93515712821C364AA2A6A889A3497D5838B3A1@Bleich32-intern-drachenfels-de%3e</id>
<updated>2009-12-09T13:57:20Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

 

sorry for re-posting - I am new to this list...

 

I am switching from lucene.net 2.0.0.4 to lucene.net 2.9.1.1 and I have
problems in a very early state. 

 

With version 2.0.0.4 it works fine!

With version 2.9.1.1 on Microsoft it works fine

With version 2.9.1.1 on linux I always get the following errors:

 

################################

System.TypeInitializationException: An exception was thrown by the type
initializer for Lucene.Net.Store.FSDirectory ---&gt;
System.TypeInitializationException: An exception was thrown by the type
initializer for Lucene.Net.Util.Constants ---&gt;
System.NullReferenceException: Object reference not set to an instance
of an object

  at Lucene.Net.Util.Constants..cctor () [0x00000]

  --- End of inner exception stack trace ---

  at Lucene.Net.Store.FSDirectory..cctor () [0x00000]

  --- End of inner exception stack trace ---

  at ConsoleApplication1.Module1.Main () [0x00000]

################################

System.TypeInitializationException: An exception was thrown by the type
initializer for Lucene.Net.Index.IndexWriter ---&gt;
System.TypeInitializationException: An exception was thrown by the type
initializer for Lucene.Net.Index.DocumentsWriter ---&gt;
System.TypeInitializationException: An exception was thrown by the type
initializer for Lucene.Net.Util.Constants ---&gt;
System.NullReferenceException: Object reference not set to an instance
of an object

  at Lucene.Net.Util.Constants..cctor () [0x00000]

  --- End of inner exception stack trace ---

  at Lucene.Net.Store.FSDirectory..cctor () [0x00000]

  --- End of inner exception stack trace ---

  at Lucene.Net.Index.IndexWriter..cctor () [0x00000]

  --- End of inner exception stack trace ---

  at ConsoleApplication1.Module1.Main () [0x00000]

 

&gt;From debugging I found the reason for the errors:

 

            dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, True)

 

and

 

            indexWriter = New Lucene.Net.Index.IndexWriter(dirR,
analyzer, True)

 

but I do not have any idea what's happen here.... I have attached the
source of my test application! 

 

Thanks for help &amp; Regards,

 

Johannes

 



</pre>
</div>
</content>
</entry>
<entry>
<title>Spatial search with Lucene.Net</title>
<author><name>&quot;Markus Wolters&quot; &lt;markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c00cc01ca78ab$f7ee70e0$e7cb52a0$@com%3e"/>
<id>urn:uuid:%3c00cc01ca78ab$f7ee70e0$e7cb52a0$@com%3e</id>
<updated>2009-12-09T08:45:32Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
What is the best way to incorporate spatial search data into lucene?

The Spatial.Net contrib, which is been ported by Robert Chapman or this code
if found here:

http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Markus





</pre>
</div>
</content>
</entry>
<entry>
<title>AW: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;Markus Wolters&quot; &lt;markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c00bf01ca78ab$83d84550$8b88cff0$@com%3e"/>
<id>urn:uuid:%3c00bf01ca78ab$83d84550$8b88cff0$@com%3e</id>
<updated>2009-12-09T08:42:13Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I actually need especially the numeric range feature, because I've got a
fulltext search within a specified distance, so I need to filter the results
by spatial data.

Is the Lucene Spatial contrib the right way to go? It's been ported by Roger
Chapman.

Markus


-----Ursprüngliche Nachricht-----
Von: Michael Garski [mailto:mgarski@myspace-inc.com] 
Gesendet: Mittwoch, 9. Dezember 2009 02:26
An: lucene-net-user@incubator.apache.org
Betreff: RE: Safe to use Release 2.9.1 Beta for production?

Two of the features I've been looking forward to and am now testing are
numeric fields, which provide significant performance improvements on
numeric range queries, and the underlying change to per-segment searching
and caching.  The latter sounds rather innocuous at first glance, but when
coupled with a custom merge policy can improve heap utilization for large
indexes.

You can certainly build your application against 2.4 and then migrate to 2.9
at a later time if you are under time constraints to deliver something
quickly.

Michael

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Tuesday, December 08, 2009 1:24 PM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

Thanks for your advice.

Are there any mature features I might miss when using version 2.4.0?

Markus

-----Ursprüngliche Nachricht-----
Von: xzxz@mail.ru [mailto:xzxz@mail.ru] 
Gesendet: Dienstag, 8. Dezember 2009 18:23
An: lucene-net-user@incubator.apache.org
Betreff: Re: Safe to use Release 2.9.1 Beta for production?

I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   









</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93AE0@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93AE0@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-12-09T01:25:35Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Two of the features I've been looking forward to and am now testing are numeric fields, which
provide significant performance improvements on numeric range queries, and the underlying
change to per-segment searching and caching.  The latter sounds rather innocuous at first
glance, but when coupled with a custom merge policy can improve heap utilization for large
indexes.

You can certainly build your application against 2.4 and then migrate to 2.9 at a later time
if you are under time constraints to deliver something quickly.

Michael

-----Original Message-----
From: Markus Wolters [mailto:markus@naxma.com] 
Sent: Tuesday, December 08, 2009 1:24 PM
To: lucene-net-user@incubator.apache.org
Subject: AW: Safe to use Release 2.9.1 Beta for production?

Thanks for your advice.

Are there any mature features I might miss when using version 2.4.0?

Markus

-----Ursprüngliche Nachricht-----
Von: xzxz@mail.ru [mailto:xzxz@mail.ru] 
Gesendet: Dienstag, 8. Dezember 2009 18:23
An: lucene-net-user@incubator.apache.org
Betreff: Re: Safe to use Release 2.9.1 Beta for production?

I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   







</pre>
</div>
</content>
</entry>
<entry>
<title>AW: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;Markus Wolters&quot; &lt;markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c019701ca784c$cba29d40$62e7d7c0$@com%3e"/>
<id>urn:uuid:%3c019701ca784c$cba29d40$62e7d7c0$@com%3e</id>
<updated>2009-12-08T21:24:15Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Thanks for your advice.

Are there any mature features I might miss when using version 2.4.0?

Markus

-----Ursprüngliche Nachricht-----
Von: xzxz@mail.ru [mailto:xzxz@mail.ru] 
Gesendet: Dienstag, 8. Dezember 2009 18:23
An: lucene-net-user@incubator.apache.org
Betreff: Re: Safe to use Release 2.9.1 Beta for production?

I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   





</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;xzxz@mail.ru&quot; &lt;xzxz@mail.ru&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c4B1E8B80.9080503@mail.ru%3e"/>
<id>urn:uuid:%3c4B1E8B80-9080503@mail-ru%3e</id>
<updated>2009-12-08T17:23:12Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I would not recommend use 2.9.1 in the production right now (it may be 
ok, but may not). It is safer to use 2.4.0.

---
Andrei

Markus Wolters wrote:
&gt; Hi @all,
&gt;
&gt;  
&gt;
&gt; I'm new to Lucene(.NET) and want to include it into my current ASP.NET
&gt; project I'm working on to support fuzzy full-text searches.
&gt;
&gt;  
&gt;
&gt; I am unsure which version to use. Is it safe already to use the /trunk
&gt; version, which is 2.9.1 Beta right now? Or should I better use the tag
&gt; 2.4.0? I need special querys  including spatial data, so I thought the
&gt; latest versions would be a good starting point...
&gt;
&gt;  
&gt;
&gt; Anyways, keep up the good work!
&gt;
&gt;  
&gt;
&gt; I did some comparison to Sphinx and Xapian, but I think at least for me,
&gt; Lucene is the right way to go.
&gt;
&gt;  
&gt;
&gt; Cheers,
&gt;
&gt; Markus
&gt;
&gt;  
&gt;
&gt;
&gt;   



</pre>
</div>
</content>
</entry>
<entry>
<title>Safe to use Release 2.9.1 Beta for production?</title>
<author><name>&quot;Markus Wolters&quot; &lt;Markus@naxma.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3cA65FF87DCF246B4D9302C35FFC6050749AB3@dumbledore.hogwarts.naxma.net%3e"/>
<id>urn:uuid:%3cA65FF87DCF246B4D9302C35FFC6050749AB3@dumbledore-hogwarts-naxma-net%3e</id>
<updated>2009-12-08T17:10:58Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi @all,

 

I'm new to Lucene(.NET) and want to include it into my current ASP.NET
project I'm working on to support fuzzy full-text searches.

 

I am unsure which version to use. Is it safe already to use the /trunk
version, which is 2.9.1 Beta right now? Or should I better use the tag
2.4.0? I need special querys  including spatial data, so I thought the
latest versions would be a good starting point...

 

Anyways, keep up the good work!

 

I did some comparison to Sphinx and Xapian, but I think at least for me,
Lucene is the right way to go.

 

Cheers,

Markus

 



</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to search number in text field?</title>
<author><name>Floyd Wu &lt;floyd.wu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c103c59c90912061944g6eb78ca4i1bdb4dbbc7a6643a@mail.gmail.com%3e"/>
<id>urn:uuid:%3c103c59c90912061944g6eb78ca4i1bdb4dbbc7a6643a@mail-gmail-com%3e</id>
<updated>2009-12-07T03:44:12Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Michael

The query is constructed using QueryText.
Partial of my code as floowing

string queryText = "title:2009123";
QueryParser parser = new QueryParser(SpecialFields.Title, Analyzer);
            parser.SetLowercaseExpandedTerms(false);
            Lucene.Net.Search.Query query
=parser.Parse(queryText);
searcher.Search(query); //searcher is an IndexSearcher instance

Floyd



2009/12/5 Michael Garski &lt;mgarski@myspace-inc.com&gt;

&gt; Floyd,
&gt;
&gt; How are you constructing the query?
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Floyd Wu [mailto:floyd.wu@gmail.com]
&gt; Sent: Thursday, December 03, 2009 10:33 PM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: How to search number in text field?
&gt;
&gt; Hi all,
&gt; When using StandardAnalyzer indexing documents, Lucene.Net did not
&gt; search
&gt; for numbers in text fields with numbers.
&gt; For example, I have built a index which title is 2009123.
&gt; Fire the querystring as "title:2009" returns no records and even if
&gt; using
&gt; title:2009*
&gt; There does exist record with title value=2009123;
&gt; How can I search this record when my client really want to use 2009 as
&gt; keyword for searching?
&gt;
&gt; Thanks
&gt;
&gt; Floyd
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: How to search number in text field?</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93A33@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93A33@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-12-04T18:04:44Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Floyd,

How are you constructing the query?

Michael

-----Original Message-----
From: Floyd Wu [mailto:floyd.wu@gmail.com] 
Sent: Thursday, December 03, 2009 10:33 PM
To: lucene-net-user@incubator.apache.org
Subject: How to search number in text field?

Hi all,
When using StandardAnalyzer indexing documents, Lucene.Net did not
search
for numbers in text fields with numbers.
For example, I have built a index which title is 2009123.
Fire the querystring as "title:2009" returns no records and even if
using
title:2009*
There does exist record with title value=2009123;
How can I search this record when my client really want to use 2009 as
keyword for searching?

Thanks

Floyd



</pre>
</div>
</content>
</entry>
<entry>
<title>How to search number in text field?</title>
<author><name>Floyd Wu &lt;floyd.wu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200912.mbox/%3c103c59c90912032233w58aa9bddua09420769815025@mail.gmail.com%3e"/>
<id>urn:uuid:%3c103c59c90912032233w58aa9bddua09420769815025@mail-gmail-com%3e</id>
<updated>2009-12-04T06:33:19Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi all,
When using StandardAnalyzer indexing documents, Lucene.Net did not search
for numbers in text fields with numbers.
For example, I have built a index which title is 2009123.
Fire the querystring as "title:2009" returns no records and even if using
title:2009*
There does exist record with title value=2009123;
How can I search this record when my client really want to use 2009 as
keyword for searching?

Thanks

Floyd


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: [ask] index out of range exceptions.</title>
<author><name>Marcelino Ponty &lt;marcelino_ponty@yahoo.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c392053.43134.qm@web56602.mail.re3.yahoo.com%3e"/>
<id>urn:uuid:%3c392053-43134-qm@web56602-mail-re3-yahoo-com%3e</id>
<updated>2009-11-27T04:42:41Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Michael, thx for your assistance! 

I haven't tried the Lucene 2.4 though I have download it from the svn, since I prefer a .dll
file. I haven't tried build it from visual studio, I'm affraid building it by hand will take
some more time for me. I've finally tricks the program, by indexing 30000 doc, close the writer,
then index the next 30000 docs, and it seems to work, though I still not understand why it
goes that way. I think it's because my single document have too much field and each field's
content is rather long. I tried to index one field only, and it works even for 100000 docs,
so it must have something to do with the content of the doc..

And for the catch statement, thanks for the link, it give me some insight how to use it properly!

Thanks again Michael, for your suggestions!




________________________________
From: Michael Garski &lt;mgarski@myspace-inc.com&gt;
To: lucene-net-user@incubator.apache.org
Sent: Fri, November 27, 2009 4:05:59 AM
Subject: RE: [ask] index out of range exceptions.

Marcelino - 

Give it a try with Lucene 2.4 - there is a tag in the SVN repository for it.  If it still
occurs, a stack trace with line numbers inside of Lucene would be helpful.  You may want to
remove your catch statement completely as you are just re-throwing the exception anyways and
could be suppressing the source.

http://blogs.msdn.com/jmstall/archive/2007/02/07/catch-rethrow.aspx

Michael


-----Original Message-----
From: Marcelino Ponty [mailto:marcelino_ponty@yahoo.com]
Sent: Thu 11/26/2009 12:51 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: [ask] index out of range exceptions.

Hi Michael!

Thanks for your fast response!

In the stack trace:
at Lucene.Net.Index.DocumentsWriter.Abort(AbortException ae) at Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document
doc, Analyzer analyzer, Term delTerm) at Lucene.Net.Index.DocumentsWriter.AddDocument(Document
doc, Analyzer analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer
analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) at Atmalib.Lucene.Indexing.ExecuteIndexing(Directory
dir) in C:\Documents and Settings\Administrator\My Documents\Visual Studio 2005\Projects\CreateIndex\CreateIndex\App_Code\Indexing.cs:line
95

I use Lucene.NET version 2.3.1.2. 

And my program is absolutely simple: 

//main program
        protected static void ExecuteIndexing(Directory dir)
        {
        IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
            try
            {
                string query = "select top 40000 * from v_simple_artikel";
//use public class method to query database
                SqlDataReader row = Atmalib.DataAccess.GetDataReaderFromQuery(query); 

//read query results and add document
                while (row.Read())
                {
                   Document doc = new Document();
               doc.Add(new Field("kode_koleksi", row["kode_koleksi"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));
               doc.Add(new Field("kode_artikel", row["kode_artikel"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));

//....
//adding many other fields
//...

Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
               doc.Add(new Field("kata_kunci_jurnal", row["kata_kunci_jurnal"].ToString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
                   writer.AddDocument(doc);
                }
                writer.Optimize();
            }
            catch (Exception exp)
            {
                throw exp;
            }

            finally
            {
                writer.Close();
            }
        }
//end of main program

I'll be very grateful if you can give me assistance. Thank you!!



________________________________
From: Michael Garski &lt;mgarski@myspace-inc.com&gt;
To: lucene-net-user@incubator.apache.org
Sent: Fri, November 27, 2009 2:51:19 AM
Subject: Re: [ask] index out of range exceptions.

Hi Marcelino,

Can you provide tha stack trace from the exception and a code snippet/description of what
you are doing when it is thrown along with the version of Lucene.net you are using?

Michael

On Nov 26, 2009, at 11:48 AM, "Marcelino Ponty" &lt;marcelino_ponty@yahoo.com&gt; wrote:

&gt; Hi all!
&gt; 
&gt; I'm a new user of Lucene.NET, but I have experience once in using Ferret which is written
in Ruby language. I've come to a problem and hope any of you can help me.
&gt; 
&gt; I'm going to index 50,000 doc, but it failed and give
&gt; 
&gt; System.IndexOutOfRangeExceptions: Index was outside the bound of the array.
&gt; 
&gt; I test with 30,000 doc and it succeed. I think this has something to do with basic predefined
parameters while setting up the indexing process which I should have set, but I don't know
which parameter it is and where should I set it. Do you guys have any idea? I think the answer
should be simple. In Ferret, I didn't meet this problem.
&gt; 
&gt; Thanks for any assistance!
&gt; Regards,
&gt; Marcelino Ponty
&gt; (Phone +62819 - 3223 54 84)
&gt; "Ad Maiorem Dei Gloriam"
&gt; 
&gt; 


      

</pre>
</div>
</content>
</entry>
<entry>
<title>RE: [ask] index out of range exceptions.</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E046CFB@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E046CFB@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-26T21:05:59Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Marcelino - 

Give it a try with Lucene 2.4 - there is a tag in the SVN repository for it.  If it still
occurs, a stack trace with line numbers inside of Lucene would be helpful.  You may want to
remove your catch statement completely as you are just re-throwing the exception anyways and
could be suppressing the source.

http://blogs.msdn.com/jmstall/archive/2007/02/07/catch-rethrow.aspx

Michael


-----Original Message-----
From: Marcelino Ponty [mailto:marcelino_ponty@yahoo.com]
Sent: Thu 11/26/2009 12:51 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: [ask] index out of range exceptions.
 
Hi Michael!

Thanks for your fast response!

In the stack trace:
at Lucene.Net.Index.DocumentsWriter.Abort(AbortException ae) at Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document
doc, Analyzer analyzer, Term delTerm) at Lucene.Net.Index.DocumentsWriter.AddDocument(Document
doc, Analyzer analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer
analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) at Atmalib.Lucene.Indexing.ExecuteIndexing(Directory
dir) in C:\Documents and Settings\Administrator\My Documents\Visual Studio 2005\Projects\CreateIndex\CreateIndex\App_Code\Indexing.cs:line
95

I use Lucene.NET version 2.3.1.2. 

And my program is absolutely simple: 

//main program
        protected static void ExecuteIndexing(Directory dir)
        {
        IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
            try
            {
                string query = "select top 40000 * from v_simple_artikel";
//use public class method to query database
                SqlDataReader row = Atmalib.DataAccess.GetDataReaderFromQuery(query); 

//read query results and add document
                while (row.Read())
                {
                   Document doc = new Document();
               doc.Add(new Field("kode_koleksi", row["kode_koleksi"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));
               doc.Add(new Field("kode_artikel", row["kode_artikel"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));

//....
//adding many other fields
//...

Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
               doc.Add(new Field("kata_kunci_jurnal", row["kata_kunci_jurnal"].ToString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
                   writer.AddDocument(doc);
                }
                writer.Optimize();
            }
            catch (Exception exp)
            {
                throw exp;
            }

            finally
            {
                writer.Close();
            }
        }
//end of main program

I'll be very grateful if you can give me assistance. Thank you!!



________________________________
From: Michael Garski &lt;mgarski@myspace-inc.com&gt;
To: lucene-net-user@incubator.apache.org
Sent: Fri, November 27, 2009 2:51:19 AM
Subject: Re: [ask] index out of range exceptions.

Hi Marcelino,

Can you provide tha stack trace from the exception and a code snippet/description of what
you are doing when it is thrown along with the version of Lucene.net you are using?

Michael

On Nov 26, 2009, at 11:48 AM, "Marcelino Ponty" &lt;marcelino_ponty@yahoo.com&gt; wrote:

&gt; Hi all!
&gt; 
&gt; I'm a new user of Lucene.NET, but I have experience once in using Ferret which is written
in Ruby language. I've come to a problem and hope any of you can help me.
&gt; 
&gt; I'm going to index 50,000 doc, but it failed and give
&gt; 
&gt; System.IndexOutOfRangeExceptions: Index was outside the bound of the array.
&gt; 
&gt; I test with 30,000 doc and it succeed. I think this has something to do with basic predefined
parameters while setting up the indexing process which I should have set, but I don't know
which parameter it is and where should I set it. Do you guys have any idea? I think the answer
should be simple. In Ferret, I didn't meet this problem.
&gt; 
&gt; Thanks for any assistance!
&gt; Regards,
&gt; Marcelino Ponty
&gt; (Phone +62819 - 3223 54 84)
&gt; "Ad Maiorem Dei Gloriam"
&gt; 
&gt; 


      

 


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: [ask] index out of range exceptions.</title>
<author><name>Marcelino Ponty &lt;marcelino_ponty@yahoo.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c39217.72311.qm@web56608.mail.re3.yahoo.com%3e"/>
<id>urn:uuid:%3c39217-72311-qm@web56608-mail-re3-yahoo-com%3e</id>
<updated>2009-11-26T20:51:37Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Michael!

Thanks for your fast response!

In the stack trace:
at Lucene.Net.Index.DocumentsWriter.Abort(AbortException ae) at Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document
doc, Analyzer analyzer, Term delTerm) at Lucene.Net.Index.DocumentsWriter.AddDocument(Document
doc, Analyzer analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer
analyzer) at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) at Atmalib.Lucene.Indexing.ExecuteIndexing(Directory
dir) in C:\Documents and Settings\Administrator\My Documents\Visual Studio 2005\Projects\CreateIndex\CreateIndex\App_Code\Indexing.cs:line
95

I use Lucene.NET version 2.3.1.2. 

And my program is absolutely simple: 

//main program
        protected static void ExecuteIndexing(Directory dir)
        {
        IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
            try
            {
                string query = "select top 40000 * from v_simple_artikel";
//use public class method to query database
                SqlDataReader row = Atmalib.DataAccess.GetDataReaderFromQuery(query); 

//read query results and add document
                while (row.Read())
                {
                   Document doc = new Document();
               doc.Add(new Field("kode_koleksi", row["kode_koleksi"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));
               doc.Add(new Field("kode_artikel", row["kode_artikel"].ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));

//....
//adding many other fields
//...

Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
               doc.Add(new Field("kata_kunci_jurnal", row["kata_kunci_jurnal"].ToString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
                   writer.AddDocument(doc);
                }
                writer.Optimize();
            }
            catch (Exception exp)
            {
                throw exp;
            }

            finally
            {
                writer.Close();
            }
        }
//end of main program

I'll be very grateful if you can give me assistance. Thank you!!



________________________________
From: Michael Garski &lt;mgarski@myspace-inc.com&gt;
To: lucene-net-user@incubator.apache.org
Sent: Fri, November 27, 2009 2:51:19 AM
Subject: Re: [ask] index out of range exceptions.

Hi Marcelino,

Can you provide tha stack trace from the exception and a code snippet/description of what
you are doing when it is thrown along with the version of Lucene.net you are using?

Michael

On Nov 26, 2009, at 11:48 AM, "Marcelino Ponty" &lt;marcelino_ponty@yahoo.com&gt; wrote:

&gt; Hi all!
&gt; 
&gt; I'm a new user of Lucene.NET, but I have experience once in using Ferret which is written
in Ruby language. I've come to a problem and hope any of you can help me.
&gt; 
&gt; I'm going to index 50,000 doc, but it failed and give
&gt; 
&gt; System.IndexOutOfRangeExceptions: Index was outside the bound of the array.
&gt; 
&gt; I test with 30,000 doc and it succeed. I think this has something to do with basic predefined
parameters while setting up the indexing process which I should have set, but I don't know
which parameter it is and where should I set it. Do you guys have any idea? I think the answer
should be simple. In Ferret, I didn't meet this problem.
&gt; 
&gt; Thanks for any assistance!
&gt; Regards,
&gt; Marcelino Ponty
&gt; (Phone +62819 - 3223 54 84)
&gt; "Ad Maiorem Dei Gloriam"
&gt; 
&gt; 


      

</pre>
</div>
</content>
</entry>
<entry>
<title>Re: [ask] index out of range exceptions.</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7C62CEB0-6DD0-452D-BB73-E2AB15A6AD47@myspace-inc.com%3e"/>
<id>urn:uuid:%3c7C62CEB0-6DD0-452D-BB73-E2AB15A6AD47@myspace-inc-com%3e</id>
<updated>2009-11-26T19:51:19Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Marcelino,

Can you provide tha stack trace from the exception and a code snippet/ 
description of what you are doing when it is thrown along with the  
version of Lucene.net you are using?

Michael

On Nov 26, 2009, at 11:48 AM, "Marcelino Ponty" &lt;marcelino_ponty@yahoo.com 
 &gt; wrote:

&gt; Hi all!
&gt;
&gt; I'm a new user of Lucene.NET, but I have experience once in using  
&gt; Ferret which is written in Ruby language. I've come to a problem and  
&gt; hope any of you can help me.
&gt;
&gt; I'm going to index 50,000 doc, but it failed and give
&gt;
&gt; System.IndexOutOfRangeExceptions: Index was outside the bound of the  
&gt; array.
&gt;
&gt; I test with 30,000 doc and it succeed. I think this has something to  
&gt; do with basic predefined parameters while setting up the indexing  
&gt; process which I should have set, but I don't know which parameter it  
&gt; is and where should I set it. Do you guys have any idea? I think the  
&gt; answer should be simple. In Ferret, I didn't meet this problem.
&gt;
&gt; Thanks for any assistance!
&gt; Regards,
&gt; Marcelino Ponty
&gt; (Phone +62819 - 3223 54 84)
&gt; "Ad Maiorem Dei Gloriam"
&gt;
&gt;



</pre>
</div>
</content>
</entry>
<entry>
<title>[ask] index out of range exceptions.</title>
<author><name>Marcelino Ponty &lt;marcelino_ponty@yahoo.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c714151.16138.qm@web56601.mail.re3.yahoo.com%3e"/>
<id>urn:uuid:%3c714151-16138-qm@web56601-mail-re3-yahoo-com%3e</id>
<updated>2009-11-26T19:47:46Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi all!

I'm a new user of Lucene.NET, but I have experience once in using Ferret which is written
in Ruby language. I've come to a problem and hope any of you can help me.

I'm going to index 50,000 doc, but it failed and give 

System.IndexOutOfRangeExceptions: Index was outside the bound of the array.

I test with 30,000 doc and it succeed. I think this has something to do with basic predefined
parameters while setting up the indexing process which I should have set, but I don't know
which parameter it is and where should I set it. Do you guys have any idea? I think the answer
should be simple. In Ferret, I didn't meet this problem.

Thanks for any assistance!
Regards,
Marcelino Ponty
(Phone +62819 - 3223 54 84)
"Ad Maiorem Dei Gloriam"


      

</pre>
</div>
</content>
</entry>
<entry>
<title>RE: IndexWriter is slow when reader is open</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D9378A@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D9378A@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-17T20:20:57Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Eran,

The transactional functionality can rollback changes to an index should
something happen during a commit.  Refer to the methods PrepareCommit &amp;
Rollback.  You would have to implement your own logic to re-process any
changes that were rolled back.

Michael

-----Original Message-----
From: Eran Sevi [mailto:eransevi@gmail.com] 
Sent: Tuesday, November 17, 2009 9:30 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: IndexWriter is slow when reader is open

Thanks Michael for the detailed explanation.It's much more clearer now.

By "transactional capabilities" do you mean that if in the middle of a
commit something happens, it is guaranteed that either all the data
added
from the last commit is in index or all the data is discarded?

We have a steady stream of documents for indexing coming in
(unfortunately
only one at a time, but at a rate of up to 50 per second) and I hoped I
could guarantee that when the add method returns, the document is
secured on
disk. We keep a status for each document in our DB and want to discard
the
original data.

We'll just have to hang on to the original data until each commit has
finished and in case of a crash or error reindex the original data.

Eran.

On Tue, Nov 17, 2009 at 5:59 PM, Michael Garski
&lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Eran,
&gt;
&gt; Make no mistake, the poor performance you are experiencing is due to
&gt; calling commit on every document addition and not due to internal
'coding by
&gt; exception'.  There are transactional capabilities of Lucene that will
ensure
&gt; that your documents are added and persisted to disk.  Check out the
&gt; IndexWriter documentation for more information.
&gt;
&gt; The only 'connection' between the reader and the writer are the files
on
&gt; disk.  The writer writes them once, they are not updated, and the
reader
&gt; holds a reference to the file to ensure it is not deleted out from
&gt; underneath it as it still needs to read from it to perform searches.
&gt;
&gt; During a commit, all of your changes are written to disk and any
necessary
&gt; segment merges take place, which leaves the older segments that were
merged
&gt; together as 'orphans' that are no longer referenced by the segments
file and
&gt; are cleaned up during the final stage of the commit process after all
of the
&gt; new segments have been written.  An attempt is made to then clean up
the
&gt; older segments that are no longer necessary, which will fail as your
reader
&gt; still has them open.  It fails gracefully in that the file names are
&gt; persisted internally to attempt to delete again later, hopefully after
the
&gt; reader has been reopened and a reference to the orphaned files is no
longer
&gt; being held.
&gt;
&gt; I suggest you step through the commit process in a debugger or use a
&gt; profiler to demonstrate this issue.
&gt;
&gt; Michael
&gt;
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; Sent: Tue 11/17/2009 4:55 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: IndexWriter is slow when reader is open
&gt;
&gt; Michael,
&gt; Thanks for the answer.
&gt;
&gt; I thought the reader was less connected to the writer. Basically what
your
&gt; saying is that as long as at least one reader is open, exceptions are
&gt; thrown
&gt; when trying to commit changes (or more accurately, when trying to
merge
&gt; segments) ?
&gt; Can you point me to the place in the source code where that happens?
&gt;
&gt; What happens to the new documents that were added? are they still
saved in
&gt; another segments?
&gt;
&gt; It's very important to us to make sure every document is persistent in
the
&gt; index so working in batches could be a problem.
&gt; But if there's a way to save each added document to disk without
merging
&gt; the
&gt; segment with older segments, this can solve our problem. And since the
&gt; reader can't see the new segments anyway until it's reopened, I don't
see a
&gt; problem continuing writing documents to new segments without
performing a
&gt; merge. I'll try to change the merge policy/scheduler and see what
happens.
&gt;
&gt; Anyway, coding by exception is quite bad practice. Since we're
following
&gt; the
&gt; java versions I guess it'll take time to be able to change that.
&gt;
&gt; Eran.
&gt;
&gt; On Mon, Nov 16, 2009 at 8:56 PM, Michael Garski
&lt;mgarski@myspace-inc.com
&gt; &gt;wrote:
&gt;
&gt; &gt; Eran,
&gt; &gt;
&gt; &gt; The root cause of the issue is due to calling commit after every
document
&gt; &gt; addition while having a reader open.  Calls to commit should be
batched
&gt; up -
&gt; &gt; we frequently use batches of 100 or 1000 between commits.
&gt; &gt;
&gt; &gt; This is by design within Lucene.  Adding documents will cause
segments to
&gt; &gt; merge and the writer will then delete the older segments that have
been
&gt; &gt; merged together to create a new one, however with an open reader the
&gt; writer
&gt; &gt; will not be able to delete the older segment due to a file lock held
by
&gt; the
&gt; &gt; reader.  On the call to delete the file an exception is thrown and
&gt; swallowed
&gt; &gt; internally and the name of the file that the delete was attempted
upon is
&gt; &gt; added to a list of files that can be deleted on another call.
&gt; &gt;
&gt; &gt; I suggest you refrain from calling commit so often, as that is why
you
&gt; are
&gt; &gt; experiencing performance issues.
&gt; &gt;
&gt; &gt; Michael
&gt; &gt;
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; Sent: Mon 11/16/2009 5:07 AM
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: Re: IndexWriter is slow when reader is open
&gt; &gt;
&gt; &gt; I've tried to use it with read-only mode and it looks like it's even
&gt; worse
&gt; &gt; right now.
&gt; &gt;
&gt; &gt; I must admit that we're abusing the indexing a bit by commiting
after
&gt; each
&gt; &gt; document addition, but still when there's no reader open, each
document
&gt; is
&gt; &gt; indexed in about 30-50ms and when there's a read-only reader open
then
&gt; each
&gt; &gt; document is indexed in about 150-500ms.
&gt; &gt; Why should an open reader affect the commit process so deeply?
&gt; &gt;
&gt; &gt; I wonder if no one encountered this phenomena before.
&gt; &gt;
&gt; &gt;
&gt; &gt; On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt
&lt;mbhoneycutt@gmail.com
&gt; &gt; &gt;wrote:
&gt; &gt;
&gt; &gt; &gt; 2.4 does indeed support read-only mode. I don't know how much it
will
&gt; &gt; &gt; help, but I would definitely try it.
&gt; &gt; &gt;
&gt; &gt; &gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; &gt; &gt; I'm still using version 2.4 so I think there's still no read
only
&gt; mode.
&gt; &gt; &gt; &gt; Is there no other way to prevent this slow down in previous
versions?
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Eran.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;&gt; Eran,
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; What version of Lucene are you using?  Are you opening the
&gt; IndexReader
&gt; &gt; &gt; &gt;&gt; in read-only mode?
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Michael
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; -----Original Message-----
&gt; &gt; &gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; &gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt; &gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Hi,
&gt; &gt; &gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index
&gt; &gt; documents
&gt; &gt; &gt; &gt;&gt; while there's at least one IndexReader open on that index (even
&gt; &gt; without
&gt; &gt; &gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3
to 5.
&gt; &gt; &gt; &gt;&gt; When
&gt; &gt; &gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt; &gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;  My index is going to be updated regularly and there's going to
be a
&gt; &gt; &gt; &gt;&gt; reader/searcher in use almost all the time so this might be a
big
&gt; &gt; &gt; &gt;&gt; problem
&gt; &gt; &gt; &gt;&gt; for me.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it
&gt; happen
&gt; &gt; &gt; &gt;&gt; and
&gt; &gt; &gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Thanks,
&gt; &gt; &gt; &gt;&gt; Eran.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt;
&gt;
&gt;



</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Eran Sevi &lt;eransevi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c74f928500911170929r5dab9d6fu7d8222d1666a93d1@mail.gmail.com%3e"/>
<id>urn:uuid:%3c74f928500911170929r5dab9d6fu7d8222d1666a93d1@mail-gmail-com%3e</id>
<updated>2009-11-17T17:29:47Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Thanks Michael for the detailed explanation.It's much more clearer now.

By "transactional capabilities" do you mean that if in the middle of a
commit something happens, it is guaranteed that either all the data added
from the last commit is in index or all the data is discarded?

We have a steady stream of documents for indexing coming in (unfortunately
only one at a time, but at a rate of up to 50 per second) and I hoped I
could guarantee that when the add method returns, the document is secured on
disk. We keep a status for each document in our DB and want to discard the
original data.

We'll just have to hang on to the original data until each commit has
finished and in case of a crash or error reindex the original data.

Eran.

On Tue, Nov 17, 2009 at 5:59 PM, Michael Garski &lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Eran,
&gt;
&gt; Make no mistake, the poor performance you are experiencing is due to
&gt; calling commit on every document addition and not due to internal 'coding by
&gt; exception'.  There are transactional capabilities of Lucene that will ensure
&gt; that your documents are added and persisted to disk.  Check out the
&gt; IndexWriter documentation for more information.
&gt;
&gt; The only 'connection' between the reader and the writer are the files on
&gt; disk.  The writer writes them once, they are not updated, and the reader
&gt; holds a reference to the file to ensure it is not deleted out from
&gt; underneath it as it still needs to read from it to perform searches.
&gt;
&gt; During a commit, all of your changes are written to disk and any necessary
&gt; segment merges take place, which leaves the older segments that were merged
&gt; together as 'orphans' that are no longer referenced by the segments file and
&gt; are cleaned up during the final stage of the commit process after all of the
&gt; new segments have been written.  An attempt is made to then clean up the
&gt; older segments that are no longer necessary, which will fail as your reader
&gt; still has them open.  It fails gracefully in that the file names are
&gt; persisted internally to attempt to delete again later, hopefully after the
&gt; reader has been reopened and a reference to the orphaned files is no longer
&gt; being held.
&gt;
&gt; I suggest you step through the commit process in a debugger or use a
&gt; profiler to demonstrate this issue.
&gt;
&gt; Michael
&gt;
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; Sent: Tue 11/17/2009 4:55 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: IndexWriter is slow when reader is open
&gt;
&gt; Michael,
&gt; Thanks for the answer.
&gt;
&gt; I thought the reader was less connected to the writer. Basically what your
&gt; saying is that as long as at least one reader is open, exceptions are
&gt; thrown
&gt; when trying to commit changes (or more accurately, when trying to merge
&gt; segments) ?
&gt; Can you point me to the place in the source code where that happens?
&gt;
&gt; What happens to the new documents that were added? are they still saved in
&gt; another segments?
&gt;
&gt; It's very important to us to make sure every document is persistent in the
&gt; index so working in batches could be a problem.
&gt; But if there's a way to save each added document to disk without merging
&gt; the
&gt; segment with older segments, this can solve our problem. And since the
&gt; reader can't see the new segments anyway until it's reopened, I don't see a
&gt; problem continuing writing documents to new segments without performing a
&gt; merge. I'll try to change the merge policy/scheduler and see what happens.
&gt;
&gt; Anyway, coding by exception is quite bad practice. Since we're following
&gt; the
&gt; java versions I guess it'll take time to be able to change that.
&gt;
&gt; Eran.
&gt;
&gt; On Mon, Nov 16, 2009 at 8:56 PM, Michael Garski &lt;mgarski@myspace-inc.com
&gt; &gt;wrote:
&gt;
&gt; &gt; Eran,
&gt; &gt;
&gt; &gt; The root cause of the issue is due to calling commit after every document
&gt; &gt; addition while having a reader open.  Calls to commit should be batched
&gt; up -
&gt; &gt; we frequently use batches of 100 or 1000 between commits.
&gt; &gt;
&gt; &gt; This is by design within Lucene.  Adding documents will cause segments to
&gt; &gt; merge and the writer will then delete the older segments that have been
&gt; &gt; merged together to create a new one, however with an open reader the
&gt; writer
&gt; &gt; will not be able to delete the older segment due to a file lock held by
&gt; the
&gt; &gt; reader.  On the call to delete the file an exception is thrown and
&gt; swallowed
&gt; &gt; internally and the name of the file that the delete was attempted upon is
&gt; &gt; added to a list of files that can be deleted on another call.
&gt; &gt;
&gt; &gt; I suggest you refrain from calling commit so often, as that is why you
&gt; are
&gt; &gt; experiencing performance issues.
&gt; &gt;
&gt; &gt; Michael
&gt; &gt;
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; Sent: Mon 11/16/2009 5:07 AM
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: Re: IndexWriter is slow when reader is open
&gt; &gt;
&gt; &gt; I've tried to use it with read-only mode and it looks like it's even
&gt; worse
&gt; &gt; right now.
&gt; &gt;
&gt; &gt; I must admit that we're abusing the indexing a bit by commiting after
&gt; each
&gt; &gt; document addition, but still when there's no reader open, each document
&gt; is
&gt; &gt; indexed in about 30-50ms and when there's a read-only reader open then
&gt; each
&gt; &gt; document is indexed in about 150-500ms.
&gt; &gt; Why should an open reader affect the commit process so deeply?
&gt; &gt;
&gt; &gt; I wonder if no one encountered this phenomena before.
&gt; &gt;
&gt; &gt;
&gt; &gt; On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com
&gt; &gt; &gt;wrote:
&gt; &gt;
&gt; &gt; &gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt; &gt; &gt; help, but I would definitely try it.
&gt; &gt; &gt;
&gt; &gt; &gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; &gt; &gt; I'm still using version 2.4 so I think there's still no read only
&gt; mode.
&gt; &gt; &gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Eran.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;&gt; Eran,
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; What version of Lucene are you using?  Are you opening the
&gt; IndexReader
&gt; &gt; &gt; &gt;&gt; in read-only mode?
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Michael
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; -----Original Message-----
&gt; &gt; &gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; &gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt; &gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Hi,
&gt; &gt; &gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index
&gt; &gt; documents
&gt; &gt; &gt; &gt;&gt; while there's at least one IndexReader open on that index (even
&gt; &gt; without
&gt; &gt; &gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to
5.
&gt; &gt; &gt; &gt;&gt; When
&gt; &gt; &gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt; &gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;  My index is going to be updated regularly and there's going to be
a
&gt; &gt; &gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt; &gt; &gt; &gt;&gt; problem
&gt; &gt; &gt; &gt;&gt; for me.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it
&gt; happen
&gt; &gt; &gt; &gt;&gt; and
&gt; &gt; &gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt; Thanks,
&gt; &gt; &gt; &gt;&gt; Eran.
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt;
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: IndexWriter is slow when reader is open</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E046CF6@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E046CF6@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-17T15:59:19Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Eran,

Make no mistake, the poor performance you are experiencing is due to calling commit on every
document addition and not due to internal 'coding by exception'.  There are transactional
capabilities of Lucene that will ensure that your documents are added and persisted to disk.
 Check out the IndexWriter documentation for more information.

The only 'connection' between the reader and the writer are the files on disk.  The writer
writes them once, they are not updated, and the reader holds a reference to the file to ensure
it is not deleted out from underneath it as it still needs to read from it to perform searches.

During a commit, all of your changes are written to disk and any necessary segment merges
take place, which leaves the older segments that were merged together as 'orphans' that are
no longer referenced by the segments file and are cleaned up during the final stage of the
commit process after all of the new segments have been written.  An attempt is made to then
clean up the older segments that are no longer necessary, which will fail as your reader still
has them open.  It fails gracefully in that the file names are persisted internally to attempt
to delete again later, hopefully after the reader has been reopened and a reference to the
orphaned files is no longer being held.

I suggest you step through the commit process in a debugger or use a profiler to demonstrate
this issue. 

Michael



-----Original Message-----
From: Eran Sevi [mailto:eransevi@gmail.com]
Sent: Tue 11/17/2009 4:55 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: IndexWriter is slow when reader is open
 
Michael,
Thanks for the answer.

I thought the reader was less connected to the writer. Basically what your
saying is that as long as at least one reader is open, exceptions are thrown
when trying to commit changes (or more accurately, when trying to merge
segments) ?
Can you point me to the place in the source code where that happens?

What happens to the new documents that were added? are they still saved in
another segments?

It's very important to us to make sure every document is persistent in the
index so working in batches could be a problem.
But if there's a way to save each added document to disk without merging the
segment with older segments, this can solve our problem. And since the
reader can't see the new segments anyway until it's reopened, I don't see a
problem continuing writing documents to new segments without performing a
merge. I'll try to change the merge policy/scheduler and see what happens.

Anyway, coding by exception is quite bad practice. Since we're following the
java versions I guess it'll take time to be able to change that.

Eran.

On Mon, Nov 16, 2009 at 8:56 PM, Michael Garski &lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Eran,
&gt;
&gt; The root cause of the issue is due to calling commit after every document
&gt; addition while having a reader open.  Calls to commit should be batched up -
&gt; we frequently use batches of 100 or 1000 between commits.
&gt;
&gt; This is by design within Lucene.  Adding documents will cause segments to
&gt; merge and the writer will then delete the older segments that have been
&gt; merged together to create a new one, however with an open reader the writer
&gt; will not be able to delete the older segment due to a file lock held by the
&gt; reader.  On the call to delete the file an exception is thrown and swallowed
&gt; internally and the name of the file that the delete was attempted upon is
&gt; added to a list of files that can be deleted on another call.
&gt;
&gt; I suggest you refrain from calling commit so often, as that is why you are
&gt; experiencing performance issues.
&gt;
&gt; Michael
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; Sent: Mon 11/16/2009 5:07 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: IndexWriter is slow when reader is open
&gt;
&gt; I've tried to use it with read-only mode and it looks like it's even worse
&gt; right now.
&gt;
&gt; I must admit that we're abusing the indexing a bit by commiting after each
&gt; document addition, but still when there's no reader open, each document is
&gt; indexed in about 30-50ms and when there's a read-only reader open then each
&gt; document is indexed in about 150-500ms.
&gt; Why should an open reader affect the commit process so deeply?
&gt;
&gt; I wonder if no one encountered this phenomena before.
&gt;
&gt;
&gt; On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com
&gt; &gt;wrote:
&gt;
&gt; &gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt; &gt; help, but I would definitely try it.
&gt; &gt;
&gt; &gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; &gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt; &gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt; &gt; &gt;
&gt; &gt; &gt; Eran.
&gt; &gt; &gt;
&gt; &gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt; &gt;
&gt; &gt; &gt;&gt; Eran,
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt; &gt; &gt;&gt; in read-only mode?
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Michael
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; -----Original Message-----
&gt; &gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Hi,
&gt; &gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index
&gt; documents
&gt; &gt; &gt;&gt; while there's at least one IndexReader open on that index (even
&gt; without
&gt; &gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt; &gt; &gt;&gt; When
&gt; &gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt; &gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt; &gt; &gt;&gt; problem
&gt; &gt; &gt;&gt; for me.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt; &gt; &gt;&gt; and
&gt; &gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Thanks,
&gt; &gt; &gt;&gt; Eran.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;
&gt; &gt;
&gt;
&gt;
&gt;

 


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Eran Sevi &lt;eransevi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c74f928500911170455s2a492c48n58b800cb1ac18d42@mail.gmail.com%3e"/>
<id>urn:uuid:%3c74f928500911170455s2a492c48n58b800cb1ac18d42@mail-gmail-com%3e</id>
<updated>2009-11-17T12:55:46Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Michael,
Thanks for the answer.

I thought the reader was less connected to the writer. Basically what your
saying is that as long as at least one reader is open, exceptions are thrown
when trying to commit changes (or more accurately, when trying to merge
segments) ?
Can you point me to the place in the source code where that happens?

What happens to the new documents that were added? are they still saved in
another segments?

It's very important to us to make sure every document is persistent in the
index so working in batches could be a problem.
But if there's a way to save each added document to disk without merging the
segment with older segments, this can solve our problem. And since the
reader can't see the new segments anyway until it's reopened, I don't see a
problem continuing writing documents to new segments without performing a
merge. I'll try to change the merge policy/scheduler and see what happens.

Anyway, coding by exception is quite bad practice. Since we're following the
java versions I guess it'll take time to be able to change that.

Eran.

On Mon, Nov 16, 2009 at 8:56 PM, Michael Garski &lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Eran,
&gt;
&gt; The root cause of the issue is due to calling commit after every document
&gt; addition while having a reader open.  Calls to commit should be batched up -
&gt; we frequently use batches of 100 or 1000 between commits.
&gt;
&gt; This is by design within Lucene.  Adding documents will cause segments to
&gt; merge and the writer will then delete the older segments that have been
&gt; merged together to create a new one, however with an open reader the writer
&gt; will not be able to delete the older segment due to a file lock held by the
&gt; reader.  On the call to delete the file an exception is thrown and swallowed
&gt; internally and the name of the file that the delete was attempted upon is
&gt; added to a list of files that can be deleted on another call.
&gt;
&gt; I suggest you refrain from calling commit so often, as that is why you are
&gt; experiencing performance issues.
&gt;
&gt; Michael
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; Sent: Mon 11/16/2009 5:07 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: IndexWriter is slow when reader is open
&gt;
&gt; I've tried to use it with read-only mode and it looks like it's even worse
&gt; right now.
&gt;
&gt; I must admit that we're abusing the indexing a bit by commiting after each
&gt; document addition, but still when there's no reader open, each document is
&gt; indexed in about 30-50ms and when there's a read-only reader open then each
&gt; document is indexed in about 150-500ms.
&gt; Why should an open reader affect the commit process so deeply?
&gt;
&gt; I wonder if no one encountered this phenomena before.
&gt;
&gt;
&gt; On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com
&gt; &gt;wrote:
&gt;
&gt; &gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt; &gt; help, but I would definitely try it.
&gt; &gt;
&gt; &gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; &gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt; &gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt; &gt; &gt;
&gt; &gt; &gt; Eran.
&gt; &gt; &gt;
&gt; &gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt; &gt;
&gt; &gt; &gt;&gt; Eran,
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt; &gt; &gt;&gt; in read-only mode?
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Michael
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; -----Original Message-----
&gt; &gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Hi,
&gt; &gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index
&gt; documents
&gt; &gt; &gt;&gt; while there's at least one IndexReader open on that index (even
&gt; without
&gt; &gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt; &gt; &gt;&gt; When
&gt; &gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt; &gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt; &gt; &gt;&gt; problem
&gt; &gt; &gt;&gt; for me.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt; &gt; &gt;&gt; and
&gt; &gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt; Thanks,
&gt; &gt; &gt;&gt; Eran.
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;&gt;
&gt; &gt; &gt;
&gt; &gt;
&gt;
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: IndexWriter is slow when reader is open</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E046CF0@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E046CF0@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-16T18:56:01Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Eran,

The root cause of the issue is due to calling commit after every document addition while having
a reader open.  Calls to commit should be batched up - we frequently use batches of 100 or
1000 between commits.

This is by design within Lucene.  Adding documents will cause segments to merge and the writer
will then delete the older segments that have been merged together to create a new one, however
with an open reader the writer will not be able to delete the older segment due to a file
lock held by the reader.  On the call to delete the file an exception is thrown and swallowed
internally and the name of the file that the delete was attempted upon is added to a list
of files that can be deleted on another call.

I suggest you refrain from calling commit so often, as that is why you are experiencing performance
issues.

Michael


-----Original Message-----
From: Eran Sevi [mailto:eransevi@gmail.com]
Sent: Mon 11/16/2009 5:07 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: IndexWriter is slow when reader is open
 
I've tried to use it with read-only mode and it looks like it's even worse
right now.

I must admit that we're abusing the indexing a bit by commiting after each
document addition, but still when there's no reader open, each document is
indexed in about 30-50ms and when there's a read-only reader open then each
document is indexed in about 150-500ms.
Why should an open reader affect the commit process so deeply?

I wonder if no one encountered this phenomena before.


On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com&gt;wrote:

&gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt; help, but I would definitely try it.
&gt;
&gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt; &gt;
&gt; &gt; Eran.
&gt; &gt;
&gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt;
&gt; &gt;&gt; Eran,
&gt; &gt;&gt;
&gt; &gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt; &gt;&gt; in read-only mode?
&gt; &gt;&gt;
&gt; &gt;&gt; Michael
&gt; &gt;&gt;
&gt; &gt;&gt; -----Original Message-----
&gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt;&gt;
&gt; &gt;&gt; Hi,
&gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index documents
&gt; &gt;&gt; while there's at least one IndexReader open on that index (even without
&gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt; &gt;&gt; When
&gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt;&gt;
&gt; &gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt; &gt;&gt; problem
&gt; &gt;&gt; for me.
&gt; &gt;&gt;
&gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt; &gt;&gt; and
&gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;&gt; Thanks,
&gt; &gt;&gt; Eran.
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;
&gt;

 


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Eran Sevi &lt;eransevi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c74f928500911160706l5d83ebf0j809f178f86b65b18@mail.gmail.com%3e"/>
<id>urn:uuid:%3c74f928500911160706l5d83ebf0j809f178f86b65b18@mail-gmail-com%3e</id>
<updated>2009-11-16T15:06:28Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I've noticed that the slow down only happens when the reader created is
"ReadOnlyMultiSegmentReader".
When the index is fully optimized (thus the reader created is
"ReadOnlySegmentReader"), the writer that is opened afterwards still
functions at full speed.
Since most of the time the index is far from being optimized, this is still
a major problem.

I can only guess that it's because of locking issues. I'll continue to
research it and update if I find something new.

On Mon, Nov 16, 2009 at 3:07 PM, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:

&gt; I've tried to use it with read-only mode and it looks like it's even worse
&gt; right now.
&gt;
&gt; I must admit that we're abusing the indexing a bit by commiting after each
&gt; document addition, but still when there's no reader open, each document is
&gt; indexed in about 30-50ms and when there's a read-only reader open then each
&gt; document is indexed in about 150-500ms.
&gt; Why should an open reader affect the commit process so deeply?
&gt;
&gt; I wonder if no one encountered this phenomena before.
&gt;
&gt;
&gt; On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com&gt;wrote:
&gt;
&gt;&gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt;&gt; help, but I would definitely try it.
&gt;&gt;
&gt;&gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt;&gt; &gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt;&gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt;&gt; &gt;
&gt;&gt; &gt; Eran.
&gt;&gt; &gt;
&gt;&gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt;&gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt;&gt; &gt;
&gt;&gt; &gt;&gt; Eran,
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt;&gt; &gt;&gt; in read-only mode?
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; Michael
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; -----Original Message-----
&gt;&gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt;&gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt;&gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt;&gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; Hi,
&gt;&gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index documents
&gt;&gt; &gt;&gt; while there's at least one IndexReader open on that index (even without
&gt;&gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt;&gt; &gt;&gt; When
&gt;&gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt;&gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt;&gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt;&gt; &gt;&gt; problem
&gt;&gt; &gt;&gt; for me.
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt;&gt; &gt;&gt; and
&gt;&gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt; Thanks,
&gt;&gt; &gt;&gt; Eran.
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;&gt;
&gt;&gt; &gt;
&gt;&gt;
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Eran Sevi &lt;eransevi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c74f928500911160507m58f25abcu8a03d3e94d5b8edf@mail.gmail.com%3e"/>
<id>urn:uuid:%3c74f928500911160507m58f25abcu8a03d3e94d5b8edf@mail-gmail-com%3e</id>
<updated>2009-11-16T13:07:05Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I've tried to use it with read-only mode and it looks like it's even worse
right now.

I must admit that we're abusing the indexing a bit by commiting after each
document addition, but still when there's no reader open, each document is
indexed in about 30-50ms and when there's a read-only reader open then each
document is indexed in about 150-500ms.
Why should an open reader affect the commit process so deeply?

I wonder if no one encountered this phenomena before.


On Sat, Nov 14, 2009 at 8:27 PM, Matt Honeycutt &lt;mbhoneycutt@gmail.com&gt;wrote:

&gt; 2.4 does indeed support read-only mode. I don't know how much it will
&gt; help, but I would definitely try it.
&gt;
&gt; On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; &gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt; &gt; Is there no other way to prevent this slow down in previous versions?
&gt; &gt;
&gt; &gt; Eran.
&gt; &gt;
&gt; &gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt;
&gt; &gt;&gt; Eran,
&gt; &gt;&gt;
&gt; &gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt; &gt;&gt; in read-only mode?
&gt; &gt;&gt;
&gt; &gt;&gt; Michael
&gt; &gt;&gt;
&gt; &gt;&gt; -----Original Message-----
&gt; &gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; &gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; &gt;&gt; To: lucene-net-user@incubator.apache.org
&gt; &gt;&gt; Subject: IndexWriter is slow when reader is open
&gt; &gt;&gt;
&gt; &gt;&gt; Hi,
&gt; &gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index documents
&gt; &gt;&gt; while there's at least one IndexReader open on that index (even without
&gt; &gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt; &gt;&gt; When
&gt; &gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt; &gt;&gt; I'm not doing any deletes, only adds.
&gt; &gt;&gt;
&gt; &gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt; &gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt; &gt;&gt; problem
&gt; &gt;&gt; for me.
&gt; &gt;&gt;
&gt; &gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt; &gt;&gt; and
&gt; &gt;&gt; how can I avoid such a big loss in performance?
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;&gt; Thanks,
&gt; &gt;&gt; Eran.
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Matt Honeycutt &lt;mbhoneycutt@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3cf9ca0e290911141027g3f976aaua3eb1759db981c26@mail.gmail.com%3e"/>
<id>urn:uuid:%3cf9ca0e290911141027g3f976aaua3eb1759db981c26@mail-gmail-com%3e</id>
<updated>2009-11-14T18:27:23Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
2.4 does indeed support read-only mode. I don't know how much it will
help, but I would definitely try it.

On 11/14/09, Eran Sevi &lt;eransevi@gmail.com&gt; wrote:
&gt; I'm still using version 2.4 so I think there's still no read only mode.
&gt; Is there no other way to prevent this slow down in previous versions?
&gt;
&gt; Eran.
&gt;
&gt; On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski
&gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt;
&gt;&gt; Eran,
&gt;&gt;
&gt;&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt;&gt; in read-only mode?
&gt;&gt;
&gt;&gt; Michael
&gt;&gt;
&gt;&gt; -----Original Message-----
&gt;&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt;&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt;&gt; To: lucene-net-user@incubator.apache.org
&gt;&gt; Subject: IndexWriter is slow when reader is open
&gt;&gt;
&gt;&gt; Hi,
&gt;&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index documents
&gt;&gt; while there's at least one IndexReader open on that index (even without
&gt;&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt;&gt; When
&gt;&gt; closing the reader, the indexing speed goes back to normal.
&gt;&gt; I'm not doing any deletes, only adds.
&gt;&gt;
&gt;&gt;  My index is going to be updated regularly and there's going to be a
&gt;&gt; reader/searcher in use almost all the time so this might be a big
&gt;&gt; problem
&gt;&gt; for me.
&gt;&gt;
&gt;&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt;&gt; and
&gt;&gt; how can I avoid such a big loss in performance?
&gt;&gt;
&gt;&gt;
&gt;&gt; Thanks,
&gt;&gt; Eran.
&gt;&gt;
&gt;&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: IndexWriter is slow when reader is open</title>
<author><name>Eran Sevi &lt;eransevi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c74f928500911140908o2340c6eew70a1f5009ee97177@mail.gmail.com%3e"/>
<id>urn:uuid:%3c74f928500911140908o2340c6eew70a1f5009ee97177@mail-gmail-com%3e</id>
<updated>2009-11-14T17:08:39Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I'm still using version 2.4 so I think there's still no read only mode.
Is there no other way to prevent this slow down in previous versions?

Eran.

On Thu, Nov 12, 2009 at 8:16 PM, Michael Garski &lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Eran,
&gt;
&gt; What version of Lucene are you using?  Are you opening the IndexReader
&gt; in read-only mode?
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Eran Sevi [mailto:eransevi@gmail.com]
&gt; Sent: Thursday, November 12, 2009 9:06 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: IndexWriter is slow when reader is open
&gt;
&gt; Hi,
&gt; I'm using Lucene.Net 2.4 and I just noticed that when I index documents
&gt; while there's at least one IndexReader open on that index (even without
&gt; doing anything), the indexing speed is slower by a factor of 3 to 5.
&gt; When
&gt; closing the reader, the indexing speed goes back to normal.
&gt; I'm not doing any deletes, only adds.
&gt;
&gt;  My index is going to be updated regularly and there's going to be a
&gt; reader/searcher in use almost all the time so this might be a big
&gt; problem
&gt; for me.
&gt;
&gt; Does anyone have a clue if this is normal behavior? why does it happen
&gt; and
&gt; how can I avoid such a big loss in performance?
&gt;
&gt;
&gt; Thanks,
&gt; Eran.
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: FieldLookup for field with multiple values</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93680@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93680@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-12T20:48:07Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
It scales well with tens of millions of documents or more with enough
RAM, provided you have a mechanism for expiring the cache when a reader
is no longer in use and pool cached items for re-use so your process
doesn't incur massive GC sweeps when the cache is expired.

Michael

-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com] 
Sent: Thursday, November 12, 2009 12:35 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: FieldLookup for field with multiple values

I shudder to think about what the higher-ups will say about the cost of
faster storage.  They're very stingy on things that make our lives
easier.
:)

I like the idea of caching the term vectors somehow, though I wonder how
that would scale with millions of documents.  I will add that to the
list of
things to prototype for our long-term solution.

Thanks for the feedback,

---Matt

On Thu, Nov 12, 2009 at 12:14 PM, Michael Garski
&lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Matt,
&gt;
&gt; Metadata can be collected during the course of a search in a
&gt; HitCollector (note that 2.9 will deprecate the current HitCollector in
&gt; favor of the more flexible Collector).  One approach to collecting
&gt; facets is to index the field with the desired facet metadata with
&gt; TermVectors enabled.  During collection you can retrieve the
&gt; TermFreqVector for a document with Reader.GetTermFreqVector for each
&gt; document and collect/aggregate metadata however you like.
&gt;
&gt; Making the call to get the TermFreqVector for each hit will be a
&gt; performance hit due to IO and deserialization (even though it's very
&gt; fast, 50K times in one search is a lot).  At that point you can add
&gt; faster IO (Fusion IO rules this space) or during an index warmup
create
&gt; an in-memory structure with lookups to the vectors keyed off the
&gt; document id.
&gt;
&gt; Hope that helps.
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; Sent: Thursday, November 12, 2009 7:13 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: FieldLookup for field with multiple values
&gt;
&gt; I can elaborate a little on what our *planned* approach for utilizing
&gt; SQL
&gt; Server is.  I don't know if this will work, but I've done similar
things
&gt; with SQL CLR and haven't had it explode (yet), so I'm hopeful.
Anyway:
&gt;
&gt; Our system needs two types of output, basically: the full-text reports
&gt; (I'll
&gt; call them documents to reduce ambiguity), and then statistical reports
&gt; built
&gt; on those documents.  The documents can easily be retrieved from
Lucene,
&gt; so
&gt; the challenge is building the reports.  For that, my plan is to submit
&gt; the
&gt; same query string to SQL Server that I sent to Lucene.  Internally,
SQL
&gt; Server would then pass the query back to Lucene and retrieve a list of
&gt; document IDs that matched.  The communication may be over WCF or
&gt; something
&gt; similar, and will be compressed during transit to reduce IO overhead.
&gt; Once
&gt; SQL Server has the IDs, they will be loaded into a temporary table
(with
&gt; indexes) or a table variable, which will then be used to filter the
&gt; metadata
&gt; that the statistical reports are built from.
&gt;
&gt; I have no idea how such a system would perform.  I do hope to do some
&gt; feasibility tests sometime Real Soon (like in the next few weeks), and
&gt; I'll
&gt; post my results if I manage to get it working.
&gt;
&gt; If anyone has any other suggestions, please do share.
&gt;
&gt; On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie &lt;
&gt; mmcconna@oxford-analytica.com&gt; wrote:
&gt;
&gt; &gt; &gt;While we're discussing this, anyone have any advice or suggestions
&gt; for
&gt; &gt; a better solution?  We've considered a few things for our long-term
&gt; &gt; solution.
&gt; &gt;
&gt; &gt; I'd be very interested to hear thoughts on intersecting SQL and
Lucene
&gt; &gt; too, as in our case we have very large lists of organisations which
&gt; have
&gt; &gt; different permissions (stored in SQL) for different documents stored
&gt; in
&gt; &gt; Lucene. Showing in the search results only those documents to which
&gt; the
&gt; &gt; organisation has permission is quite expensive for queries with lots
&gt; of
&gt; &gt; results. Storing it in the documents is not manageable because they
&gt; need
&gt; &gt; to be updated frequently across multiple documents. Currently we
&gt; &gt; precompute a list for each organisation, cache that in memory, and
&gt; &gt; recache it every time that organisation is updated. However this too
&gt; is
&gt; &gt; costly.
&gt; &gt;
&gt; &gt; Storing the Lucene document nos in SQL during indexing, and then
&gt; passing
&gt; &gt; the list of Lucene document nos matching a search to SQL for
filtering
&gt; &gt; seems the right way to go. But Matt is right, the problem is with
&gt; &gt; scaling this to searches returning many thousands of documents.
&gt; &gt;
&gt; &gt; Yours,
&gt; &gt; M.
&gt; &gt; -------------------------------------
&gt; &gt; Moray McConnachie
&gt; &gt; Director of IT    +44 1865 261 600
&gt; &gt; Oxford Analytica  http://www.oxan.com
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; Sent: 12 November 2009 13:53
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: Re: FieldLookup for field with multiple values
&gt; &gt;
&gt; &gt; Yeah, it is sort of like your standard faceting scenario, except
there
&gt; &gt; are about 20,000 facets (organizations), and there's complex
&gt; &gt; relationships among the facets.
&gt; &gt;
&gt; &gt; The reports we're dealing with only occasionally break the funding
up
&gt; by
&gt; &gt; organization, so we decided (for now) to just store a single funding
&gt; &gt; value, then break it up after-the-fact by dividing it by the number
of
&gt; &gt; organizations.  So no, the funding is only stored once.
&gt; &gt;
&gt; &gt; While we're discussing this, anyone have any advice or suggestions
for
&gt; a
&gt; &gt; better solution?  We've considered a few things for our long-term
&gt; &gt; solution.
&gt; &gt; One is to put this metadata in a SQL Server instance, and use SQL
CLR
&gt; to
&gt; &gt; build a temporary table based on document IDs from a Lucene index
&gt; &gt; (hosted over WCF or something similar), then do the reporting within
&gt; SQL
&gt; &gt; Server.  We plan to compress the list of IDs going back from Lucene
to
&gt; &gt; SQL Server to cut down on IO overhead, but we're still concerned
that
&gt; &gt; approach won't scale as we go from hundreds of thousands to millions
&gt; of
&gt; &gt; reports.
&gt; &gt;
&gt; &gt; Another option we've discussed is to precompute data cubes and use
&gt; these
&gt; &gt; to calculate reporting information.  The concern here is the high
&gt; &gt; dimensionality of the data (we have about 20,000 distinct
&gt; organizations
&gt; &gt; now, but fully expect that to increase by an order of magnitude) as
&gt; well
&gt; &gt; as the accuracy of the generated reports, since there's (probably)
not
&gt; a
&gt; &gt; good way to divide the cube based on arbitrary Lucene queries.
&gt; &gt;
&gt; &gt; On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski
&gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt;
&gt; &gt; &gt; Sounds like a full-text search with the results simply being
facets
&gt; on
&gt; &gt;
&gt; &gt; &gt; the organizations sorted by the funding amount?
&gt; &gt; &gt;
&gt; &gt; &gt; You mentioned adding the org ID once for each document.  Do you do
&gt; the
&gt; &gt;
&gt; &gt; &gt; same for the funding, with the funding for each corresponding
&gt; &gt; organization?
&gt; &gt; &gt;
&gt; &gt; &gt; Michael
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; -----Original Message-----
&gt; &gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; &gt; Sent: Wed 11/11/2009 10:17 PM
&gt; &gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; Subject: Re: FieldLookup for field with multiple values
&gt; &gt; &gt;
&gt; &gt; &gt; Well, let me prefix what I'm about to describe by saying that I
know
&gt; &gt; &gt; that I'm doing something with Lucene that it wasn't meant to do.
&gt; This
&gt; &gt;
&gt; &gt; &gt; is for a "proof of concept" system that I'm helping put together
on
&gt; a
&gt; &gt; &gt; tight schedule with very limited resources, and we're trying to
get
&gt; to
&gt; &gt;
&gt; &gt; &gt; a mostly-working state as quickly as possible.
&gt; &gt; &gt;
&gt; &gt; &gt; That said, we are basically storing reports in Lucene.  The
reports
&gt; &gt; &gt; are fairly standard documents for the most part: they have a
title,
&gt; &gt; &gt; body, abstract, etc, all of which we index and search with Lucene.
&gt; &gt; &gt; However, they also have a few fields that aren't standard,
including
&gt; a
&gt; &gt;
&gt; &gt; &gt; list of involved organizations as well as a dollar amount for each
&gt; &gt; &gt; report.  The organizations are stored as IDs, and we add the org
ID
&gt; &gt; &gt; field multiple times, once for each organization involved in the
&gt; &gt; &gt; report.  The funding is also stored as a non-indexed field on the
&gt; &gt; &gt; Lucene document.
&gt; &gt; &gt;
&gt; &gt; &gt; What I'm trying to do is build a quick-and-dirty org-by-dollar
&gt; report
&gt; &gt; &gt; off of the reports that match the user's query.  So, a query for
&gt; &gt; &gt; "aerospace" might match 50,000 documents, and I want to show the
&gt; user
&gt; &gt; &gt; the top 5 organizations in terms of dollars.  Again, I know
&gt; reporting
&gt; &gt; &gt; like this isn't what Lucene was meant for, and we do have some
ideas
&gt; &gt; &gt; on how to handle it long-term, but for now, I'm trying to get it
&gt; &gt; &gt; working as well as I can using Lucene alone, and Lucene does do a
&gt; &gt; &gt; great job of finding the relevant set of documents to build a
report
&gt; &gt; &gt; from.
&gt; &gt; &gt;
&gt; &gt; &gt; On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski
&gt; &gt; &gt; &lt;mgarski@myspace-inc.com
&gt; &gt; &gt; &gt;wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; Matt,
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; StringIndex is for use when a field has only one value in it for
&gt; the
&gt; &gt;
&gt; &gt; &gt; &gt; purposes of sorting results, not for tokenized fields with
&gt; multiple
&gt; &gt; &gt; &gt; values.  TermVectors might be a better approach, but for 50K
docs,
&gt; &gt; &gt; &gt; you'll encounter an IO hit on reading them.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; I'm curious why you are looking to grab all of the terms for a
&gt; &gt; &gt; &gt; ScoreDoc...  can you shed some light on that?
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Michael
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; -----Original Message-----
&gt; &gt; &gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; &gt; &gt; Sent: Wednesday, November 11, 2009 4:57 PM
&gt; &gt; &gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; &gt; Subject: FieldLookup for field with multiple values
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; It seems that the StringIndex returned by
&gt; &gt; &gt; &gt; FieldCache.Fields.Default.GetStringIndex() only indexes one
value
&gt; &gt; &gt; &gt; for a document even when the document has multiple values for
the
&gt; &gt; &gt; &gt; field.  Is there a performant want to get all the values for a
&gt; &gt; &gt; &gt; particular field in a ScoreDoc?  I'm having to do this across
the
&gt; &gt; &gt; &gt; entire result set of ScoreDocs (up to 50,000), and retrieving
the
&gt; &gt; &gt; &gt; values through LuceneDocument.GetFields is not going to cut it.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt;
&gt;
&gt;



</pre>
</div>
</content>
</entry>
<entry>
<title>Re: FieldLookup for field with multiple values</title>
<author><name>Matt Honeycutt &lt;mbhoneycutt@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3cf9ca0e290911121234s7a7d821dq3deda65c9fad94d0@mail.gmail.com%3e"/>
<id>urn:uuid:%3cf9ca0e290911121234s7a7d821dq3deda65c9fad94d0@mail-gmail-com%3e</id>
<updated>2009-11-12T20:34:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I shudder to think about what the higher-ups will say about the cost of
faster storage.  They're very stingy on things that make our lives easier.
:)

I like the idea of caching the term vectors somehow, though I wonder how
that would scale with millions of documents.  I will add that to the list of
things to prototype for our long-term solution.

Thanks for the feedback,

---Matt

On Thu, Nov 12, 2009 at 12:14 PM, Michael Garski &lt;mgarski@myspace-inc.com&gt;wrote:

&gt; Matt,
&gt;
&gt; Metadata can be collected during the course of a search in a
&gt; HitCollector (note that 2.9 will deprecate the current HitCollector in
&gt; favor of the more flexible Collector).  One approach to collecting
&gt; facets is to index the field with the desired facet metadata with
&gt; TermVectors enabled.  During collection you can retrieve the
&gt; TermFreqVector for a document with Reader.GetTermFreqVector for each
&gt; document and collect/aggregate metadata however you like.
&gt;
&gt; Making the call to get the TermFreqVector for each hit will be a
&gt; performance hit due to IO and deserialization (even though it's very
&gt; fast, 50K times in one search is a lot).  At that point you can add
&gt; faster IO (Fusion IO rules this space) or during an index warmup create
&gt; an in-memory structure with lookups to the vectors keyed off the
&gt; document id.
&gt;
&gt; Hope that helps.
&gt;
&gt; Michael
&gt;
&gt; -----Original Message-----
&gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; Sent: Thursday, November 12, 2009 7:13 AM
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: FieldLookup for field with multiple values
&gt;
&gt; I can elaborate a little on what our *planned* approach for utilizing
&gt; SQL
&gt; Server is.  I don't know if this will work, but I've done similar things
&gt; with SQL CLR and haven't had it explode (yet), so I'm hopeful.  Anyway:
&gt;
&gt; Our system needs two types of output, basically: the full-text reports
&gt; (I'll
&gt; call them documents to reduce ambiguity), and then statistical reports
&gt; built
&gt; on those documents.  The documents can easily be retrieved from Lucene,
&gt; so
&gt; the challenge is building the reports.  For that, my plan is to submit
&gt; the
&gt; same query string to SQL Server that I sent to Lucene.  Internally, SQL
&gt; Server would then pass the query back to Lucene and retrieve a list of
&gt; document IDs that matched.  The communication may be over WCF or
&gt; something
&gt; similar, and will be compressed during transit to reduce IO overhead.
&gt; Once
&gt; SQL Server has the IDs, they will be loaded into a temporary table (with
&gt; indexes) or a table variable, which will then be used to filter the
&gt; metadata
&gt; that the statistical reports are built from.
&gt;
&gt; I have no idea how such a system would perform.  I do hope to do some
&gt; feasibility tests sometime Real Soon (like in the next few weeks), and
&gt; I'll
&gt; post my results if I manage to get it working.
&gt;
&gt; If anyone has any other suggestions, please do share.
&gt;
&gt; On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie &lt;
&gt; mmcconna@oxford-analytica.com&gt; wrote:
&gt;
&gt; &gt; &gt;While we're discussing this, anyone have any advice or suggestions
&gt; for
&gt; &gt; a better solution?  We've considered a few things for our long-term
&gt; &gt; solution.
&gt; &gt;
&gt; &gt; I'd be very interested to hear thoughts on intersecting SQL and Lucene
&gt; &gt; too, as in our case we have very large lists of organisations which
&gt; have
&gt; &gt; different permissions (stored in SQL) for different documents stored
&gt; in
&gt; &gt; Lucene. Showing in the search results only those documents to which
&gt; the
&gt; &gt; organisation has permission is quite expensive for queries with lots
&gt; of
&gt; &gt; results. Storing it in the documents is not manageable because they
&gt; need
&gt; &gt; to be updated frequently across multiple documents. Currently we
&gt; &gt; precompute a list for each organisation, cache that in memory, and
&gt; &gt; recache it every time that organisation is updated. However this too
&gt; is
&gt; &gt; costly.
&gt; &gt;
&gt; &gt; Storing the Lucene document nos in SQL during indexing, and then
&gt; passing
&gt; &gt; the list of Lucene document nos matching a search to SQL for filtering
&gt; &gt; seems the right way to go. But Matt is right, the problem is with
&gt; &gt; scaling this to searches returning many thousands of documents.
&gt; &gt;
&gt; &gt; Yours,
&gt; &gt; M.
&gt; &gt; -------------------------------------
&gt; &gt; Moray McConnachie
&gt; &gt; Director of IT    +44 1865 261 600
&gt; &gt; Oxford Analytica  http://www.oxan.com
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; Sent: 12 November 2009 13:53
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: Re: FieldLookup for field with multiple values
&gt; &gt;
&gt; &gt; Yeah, it is sort of like your standard faceting scenario, except there
&gt; &gt; are about 20,000 facets (organizations), and there's complex
&gt; &gt; relationships among the facets.
&gt; &gt;
&gt; &gt; The reports we're dealing with only occasionally break the funding up
&gt; by
&gt; &gt; organization, so we decided (for now) to just store a single funding
&gt; &gt; value, then break it up after-the-fact by dividing it by the number of
&gt; &gt; organizations.  So no, the funding is only stored once.
&gt; &gt;
&gt; &gt; While we're discussing this, anyone have any advice or suggestions for
&gt; a
&gt; &gt; better solution?  We've considered a few things for our long-term
&gt; &gt; solution.
&gt; &gt; One is to put this metadata in a SQL Server instance, and use SQL CLR
&gt; to
&gt; &gt; build a temporary table based on document IDs from a Lucene index
&gt; &gt; (hosted over WCF or something similar), then do the reporting within
&gt; SQL
&gt; &gt; Server.  We plan to compress the list of IDs going back from Lucene to
&gt; &gt; SQL Server to cut down on IO overhead, but we're still concerned that
&gt; &gt; approach won't scale as we go from hundreds of thousands to millions
&gt; of
&gt; &gt; reports.
&gt; &gt;
&gt; &gt; Another option we've discussed is to precompute data cubes and use
&gt; these
&gt; &gt; to calculate reporting information.  The concern here is the high
&gt; &gt; dimensionality of the data (we have about 20,000 distinct
&gt; organizations
&gt; &gt; now, but fully expect that to increase by an order of magnitude) as
&gt; well
&gt; &gt; as the accuracy of the generated reports, since there's (probably) not
&gt; a
&gt; &gt; good way to divide the cube based on arbitrary Lucene queries.
&gt; &gt;
&gt; &gt; On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski
&gt; &gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt; &gt;
&gt; &gt; &gt; Sounds like a full-text search with the results simply being facets
&gt; on
&gt; &gt;
&gt; &gt; &gt; the organizations sorted by the funding amount?
&gt; &gt; &gt;
&gt; &gt; &gt; You mentioned adding the org ID once for each document.  Do you do
&gt; the
&gt; &gt;
&gt; &gt; &gt; same for the funding, with the funding for each corresponding
&gt; &gt; organization?
&gt; &gt; &gt;
&gt; &gt; &gt; Michael
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; -----Original Message-----
&gt; &gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; &gt; Sent: Wed 11/11/2009 10:17 PM
&gt; &gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; Subject: Re: FieldLookup for field with multiple values
&gt; &gt; &gt;
&gt; &gt; &gt; Well, let me prefix what I'm about to describe by saying that I know
&gt; &gt; &gt; that I'm doing something with Lucene that it wasn't meant to do.
&gt; This
&gt; &gt;
&gt; &gt; &gt; is for a "proof of concept" system that I'm helping put together on
&gt; a
&gt; &gt; &gt; tight schedule with very limited resources, and we're trying to get
&gt; to
&gt; &gt;
&gt; &gt; &gt; a mostly-working state as quickly as possible.
&gt; &gt; &gt;
&gt; &gt; &gt; That said, we are basically storing reports in Lucene.  The reports
&gt; &gt; &gt; are fairly standard documents for the most part: they have a title,
&gt; &gt; &gt; body, abstract, etc, all of which we index and search with Lucene.
&gt; &gt; &gt; However, they also have a few fields that aren't standard, including
&gt; a
&gt; &gt;
&gt; &gt; &gt; list of involved organizations as well as a dollar amount for each
&gt; &gt; &gt; report.  The organizations are stored as IDs, and we add the org ID
&gt; &gt; &gt; field multiple times, once for each organization involved in the
&gt; &gt; &gt; report.  The funding is also stored as a non-indexed field on the
&gt; &gt; &gt; Lucene document.
&gt; &gt; &gt;
&gt; &gt; &gt; What I'm trying to do is build a quick-and-dirty org-by-dollar
&gt; report
&gt; &gt; &gt; off of the reports that match the user's query.  So, a query for
&gt; &gt; &gt; "aerospace" might match 50,000 documents, and I want to show the
&gt; user
&gt; &gt; &gt; the top 5 organizations in terms of dollars.  Again, I know
&gt; reporting
&gt; &gt; &gt; like this isn't what Lucene was meant for, and we do have some ideas
&gt; &gt; &gt; on how to handle it long-term, but for now, I'm trying to get it
&gt; &gt; &gt; working as well as I can using Lucene alone, and Lucene does do a
&gt; &gt; &gt; great job of finding the relevant set of documents to build a report
&gt; &gt; &gt; from.
&gt; &gt; &gt;
&gt; &gt; &gt; On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski
&gt; &gt; &gt; &lt;mgarski@myspace-inc.com
&gt; &gt; &gt; &gt;wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; Matt,
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; StringIndex is for use when a field has only one value in it for
&gt; the
&gt; &gt;
&gt; &gt; &gt; &gt; purposes of sorting results, not for tokenized fields with
&gt; multiple
&gt; &gt; &gt; &gt; values.  TermVectors might be a better approach, but for 50K docs,
&gt; &gt; &gt; &gt; you'll encounter an IO hit on reading them.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; I'm curious why you are looking to grab all of the terms for a
&gt; &gt; &gt; &gt; ScoreDoc...  can you shed some light on that?
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Michael
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; -----Original Message-----
&gt; &gt; &gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; &gt; &gt; Sent: Wednesday, November 11, 2009 4:57 PM
&gt; &gt; &gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; &gt; Subject: FieldLookup for field with multiple values
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; It seems that the StringIndex returned by
&gt; &gt; &gt; &gt; FieldCache.Fields.Default.GetStringIndex() only indexes one value
&gt; &gt; &gt; &gt; for a document even when the document has multiple values for the
&gt; &gt; &gt; &gt; field.  Is there a performant want to get all the values for a
&gt; &gt; &gt; &gt; particular field in a ScoreDoc?  I'm having to do this across the
&gt; &gt; &gt; &gt; entire result set of ScoreDocs (up to 50,000), and retrieving the
&gt; &gt; &gt; &gt; values through LuceneDocument.GetFields is not going to cut it.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt;
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: IndexWriter is slow when reader is open</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93671@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93671@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-12T18:16:24Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Eran,

What version of Lucene are you using?  Are you opening the IndexReader
in read-only mode?

Michael

-----Original Message-----
From: Eran Sevi [mailto:eransevi@gmail.com] 
Sent: Thursday, November 12, 2009 9:06 AM
To: lucene-net-user@incubator.apache.org
Subject: IndexWriter is slow when reader is open

Hi,
I'm using Lucene.Net 2.4 and I just noticed that when I index documents
while there's at least one IndexReader open on that index (even without
doing anything), the indexing speed is slower by a factor of 3 to 5.
When
closing the reader, the indexing speed goes back to normal.
I'm not doing any deletes, only adds.

 My index is going to be updated regularly and there's going to be a
reader/searcher in use almost all the time so this might be a big
problem
for me.

Does anyone have a clue if this is normal behavior? why does it happen
and
how can I avoid such a big loss in performance?


Thanks,
Eran.



</pre>
</div>
</content>
</entry>
<entry>
<title>RE: FieldLookup for field with multiple values</title>
<author><name>Michael Garski &lt;mgarski@myspace-inc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200911.mbox/%3c7112862FD2F84D49927A5A5E0758451E01D93670@fegplmsexmb14.ffe.foxeg.com%3e"/>
<id>urn:uuid:%3c7112862FD2F84D49927A5A5E0758451E01D93670@fegplmsexmb14-ffe-foxeg-com%3e</id>
<updated>2009-11-12T18:14:41Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Matt,

Metadata can be collected during the course of a search in a
HitCollector (note that 2.9 will deprecate the current HitCollector in
favor of the more flexible Collector).  One approach to collecting
facets is to index the field with the desired facet metadata with
TermVectors enabled.  During collection you can retrieve the
TermFreqVector for a document with Reader.GetTermFreqVector for each
document and collect/aggregate metadata however you like.

Making the call to get the TermFreqVector for each hit will be a
performance hit due to IO and deserialization (even though it's very
fast, 50K times in one search is a lot).  At that point you can add
faster IO (Fusion IO rules this space) or during an index warmup create
an in-memory structure with lookups to the vectors keyed off the
document id.

Hope that helps.

Michael

-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com] 
Sent: Thursday, November 12, 2009 7:13 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: FieldLookup for field with multiple values

I can elaborate a little on what our *planned* approach for utilizing
SQL
Server is.  I don't know if this will work, but I've done similar things
with SQL CLR and haven't had it explode (yet), so I'm hopeful.  Anyway:

Our system needs two types of output, basically: the full-text reports
(I'll
call them documents to reduce ambiguity), and then statistical reports
built
on those documents.  The documents can easily be retrieved from Lucene,
so
the challenge is building the reports.  For that, my plan is to submit
the
same query string to SQL Server that I sent to Lucene.  Internally, SQL
Server would then pass the query back to Lucene and retrieve a list of
document IDs that matched.  The communication may be over WCF or
something
similar, and will be compressed during transit to reduce IO overhead.
Once
SQL Server has the IDs, they will be loaded into a temporary table (with
indexes) or a table variable, which will then be used to filter the
metadata
that the statistical reports are built from.

I have no idea how such a system would perform.  I do hope to do some
feasibility tests sometime Real Soon (like in the next few weeks), and
I'll
post my results if I manage to get it working.

If anyone has any other suggestions, please do share.

On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie &lt;
mmcconna@oxford-analytica.com&gt; wrote:

&gt; &gt;While we're discussing this, anyone have any advice or suggestions
for
&gt; a better solution?  We've considered a few things for our long-term
&gt; solution.
&gt;
&gt; I'd be very interested to hear thoughts on intersecting SQL and Lucene
&gt; too, as in our case we have very large lists of organisations which
have
&gt; different permissions (stored in SQL) for different documents stored
in
&gt; Lucene. Showing in the search results only those documents to which
the
&gt; organisation has permission is quite expensive for queries with lots
of
&gt; results. Storing it in the documents is not manageable because they
need
&gt; to be updated frequently across multiple documents. Currently we
&gt; precompute a list for each organisation, cache that in memory, and
&gt; recache it every time that organisation is updated. However this too
is
&gt; costly.
&gt;
&gt; Storing the Lucene document nos in SQL during indexing, and then
passing
&gt; the list of Lucene document nos matching a search to SQL for filtering
&gt; seems the right way to go. But Matt is right, the problem is with
&gt; scaling this to searches returning many thousands of documents.
&gt;
&gt; Yours,
&gt; M.
&gt; -------------------------------------
&gt; Moray McConnachie
&gt; Director of IT    +44 1865 261 600
&gt; Oxford Analytica  http://www.oxan.com
&gt;
&gt; -----Original Message-----
&gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; Sent: 12 November 2009 13:53
&gt; To: lucene-net-user@incubator.apache.org
&gt; Subject: Re: FieldLookup for field with multiple values
&gt;
&gt; Yeah, it is sort of like your standard faceting scenario, except there
&gt; are about 20,000 facets (organizations), and there's complex
&gt; relationships among the facets.
&gt;
&gt; The reports we're dealing with only occasionally break the funding up
by
&gt; organization, so we decided (for now) to just store a single funding
&gt; value, then break it up after-the-fact by dividing it by the number of
&gt; organizations.  So no, the funding is only stored once.
&gt;
&gt; While we're discussing this, anyone have any advice or suggestions for
a
&gt; better solution?  We've considered a few things for our long-term
&gt; solution.
&gt; One is to put this metadata in a SQL Server instance, and use SQL CLR
to
&gt; build a temporary table based on document IDs from a Lucene index
&gt; (hosted over WCF or something similar), then do the reporting within
SQL
&gt; Server.  We plan to compress the list of IDs going back from Lucene to
&gt; SQL Server to cut down on IO overhead, but we're still concerned that
&gt; approach won't scale as we go from hundreds of thousands to millions
of
&gt; reports.
&gt;
&gt; Another option we've discussed is to precompute data cubes and use
these
&gt; to calculate reporting information.  The concern here is the high
&gt; dimensionality of the data (we have about 20,000 distinct
organizations
&gt; now, but fully expect that to increase by an order of magnitude) as
well
&gt; as the accuracy of the generated reports, since there's (probably) not
a
&gt; good way to divide the cube based on arbitrary Lucene queries.
&gt;
&gt; On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski
&gt; &lt;mgarski@myspace-inc.com&gt;wrote:
&gt;
&gt; &gt; Sounds like a full-text search with the results simply being facets
on
&gt;
&gt; &gt; the organizations sorted by the funding amount?
&gt; &gt;
&gt; &gt; You mentioned adding the org ID once for each document.  Do you do
the
&gt;
&gt; &gt; same for the funding, with the funding for each corresponding
&gt; organization?
&gt; &gt;
&gt; &gt; Michael
&gt; &gt;
&gt; &gt;
&gt; &gt; -----Original Message-----
&gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; Sent: Wed 11/11/2009 10:17 PM
&gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; Subject: Re: FieldLookup for field with multiple values
&gt; &gt;
&gt; &gt; Well, let me prefix what I'm about to describe by saying that I know
&gt; &gt; that I'm doing something with Lucene that it wasn't meant to do.
This
&gt;
&gt; &gt; is for a "proof of concept" system that I'm helping put together on
a
&gt; &gt; tight schedule with very limited resources, and we're trying to get
to
&gt;
&gt; &gt; a mostly-working state as quickly as possible.
&gt; &gt;
&gt; &gt; That said, we are basically storing reports in Lucene.  The reports
&gt; &gt; are fairly standard documents for the most part: they have a title,
&gt; &gt; body, abstract, etc, all of which we index and search with Lucene.
&gt; &gt; However, they also have a few fields that aren't standard, including
a
&gt;
&gt; &gt; list of involved organizations as well as a dollar amount for each
&gt; &gt; report.  The organizations are stored as IDs, and we add the org ID
&gt; &gt; field multiple times, once for each organization involved in the
&gt; &gt; report.  The funding is also stored as a non-indexed field on the
&gt; &gt; Lucene document.
&gt; &gt;
&gt; &gt; What I'm trying to do is build a quick-and-dirty org-by-dollar
report
&gt; &gt; off of the reports that match the user's query.  So, a query for
&gt; &gt; "aerospace" might match 50,000 documents, and I want to show the
user
&gt; &gt; the top 5 organizations in terms of dollars.  Again, I know
reporting
&gt; &gt; like this isn't what Lucene was meant for, and we do have some ideas
&gt; &gt; on how to handle it long-term, but for now, I'm trying to get it
&gt; &gt; working as well as I can using Lucene alone, and Lucene does do a
&gt; &gt; great job of finding the relevant set of documents to build a report
&gt; &gt; from.
&gt; &gt;
&gt; &gt; On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski
&gt; &gt; &lt;mgarski@myspace-inc.com
&gt; &gt; &gt;wrote:
&gt; &gt;
&gt; &gt; &gt; Matt,
&gt; &gt; &gt;
&gt; &gt; &gt; StringIndex is for use when a field has only one value in it for
the
&gt;
&gt; &gt; &gt; purposes of sorting results, not for tokenized fields with
multiple
&gt; &gt; &gt; values.  TermVectors might be a better approach, but for 50K docs,
&gt; &gt; &gt; you'll encounter an IO hit on reading them.
&gt; &gt; &gt;
&gt; &gt; &gt; I'm curious why you are looking to grab all of the terms for a
&gt; &gt; &gt; ScoreDoc...  can you shed some light on that?
&gt; &gt; &gt;
&gt; &gt; &gt; Michael
&gt; &gt; &gt;
&gt; &gt; &gt; -----Original Message-----
&gt; &gt; &gt; From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
&gt; &gt; &gt; Sent: Wednesday, November 11, 2009 4:57 PM
&gt; &gt; &gt; To: lucene-net-user@incubator.apache.org
&gt; &gt; &gt; Subject: FieldLookup for field with multiple values
&gt; &gt; &gt;
&gt; &gt; &gt; It seems that the StringIndex returned by
&gt; &gt; &gt; FieldCache.Fields.Default.GetStringIndex() only indexes one value
&gt; &gt; &gt; for a document even when the document has multiple values for the
&gt; &gt; &gt; field.  Is there a performant want to get all the values for a
&gt; &gt; &gt; particular field in a ScoreDoc?  I'm having to do this across the
&gt; &gt; &gt; entire result set of ScoreDocs (up to 50,000), and retrieving the
&gt; &gt; &gt; values through LuceneDocument.GetFields is not going to cut it.
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt;
&gt;



</pre>
</div>
</content>
</entry>
</feed>
