Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 65427 invoked from network); 28 Feb 2007 21:39:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Feb 2007 21:39:49 -0000 Received: (qmail 61808 invoked by uid 500); 28 Feb 2007 21:39:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61776 invoked by uid 500); 28 Feb 2007 21:39:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61765 invoked by uid 99); 28 Feb 2007 21:39:50 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Feb 2007 13:39:50 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [66.104.95.4] (HELO listing.marketingbrokers.com) (66.104.95.4) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 28 Feb 2007 13:39:38 -0800 Received: from ip66-104-95-21.z95-104-66.customer.algx.net ([66.104.95.21]) by listing.marketingbrokers.com (JAMES SMTP Server 2.2.0) with SMTP ID 79 for ; Wed, 28 Feb 2007 13:38:55 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <6B6D3063DF0A7A4DB8A989D40C61BEB303EFDBDC@EMAIL02.wescodist.com> References: <6B6D3063DF0A7A4DB8A989D40C61BEB303EFDBDC@EMAIL02.wescodist.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: "Peter W." Subject: Re: Soliciting Design Thoughts on Date Searching Date: Wed, 28 Feb 2007 13:38:56 -0800 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org Hello, There are a few ways to solve this but no Date extraction filter I know of. Adding a hundred fields for each Lucene doc seems bloated. First, get your text out of the various source documents (.doc,.pdf,.html) using available tools out there described in the Lucene in Action book. It sounds like you know Perl, so next try regexes to pull out dates from the text using java.util.regex and make sure to remove extra whitespace. Put your clean date Strings into a java TreeMap or TreeSet collection to eliminate duplicates. Finally, loop thru the collection adding items to a StringBuffer delimited by commas, then make one long String (holding all your dates) and add to the Lucene doc as one Field.Text. You might be able to set that Field to indexed, but not stored to save space. Regards, Peter W. On Feb 28, 2007, at 11:22 AM, Aigner, Thomas wrote: > Walt, > I am no expert, but it sounds like you need to associate many > dates to a single record. > ... > > Tom > > > -----Original Message----- > From: Walt Stoneburner [mailto:walt.stoneburner@gmail.com] > Sent: Wednesday, February 28, 2007 2:13 PM > To: java-user@lucene.apache.org > Subject: Re: Soliciting Design Thoughts on Date Searching > > ... > The issue isn't in using DateRange on a Date Field, but in knowing if > there is some filter that already exists which extracts dates from a > body of text to put into a Date Field. If not, the DateTool solution > is a helpful step in building my own filter; I just don't want to > reinvent the wheel if it already exists. > > Now this is where my personal knowledge of Lucene breaks down. > Assuming I can extract each date from a source's body and convert it > to a usable format, can a Lucene Date Field hold more than one date? > For example, is a strict name/value pair, or can the value be a array > of dates, or can I append additional dates under the same name? > > Super generalizing, to break the discussion from a date specific > example, suppose I did this: > document.add( Field.Text( "title", "Learning Perl, Fourth Edition" ) > ); // real title > document.add( Field.Text( "title", "Camel Book" ) ); // my wife knows > it by the cover > > Could I do a search for both the long and short title against the > title > field? > > If the answer is yes, problem solved! I'll just pile on a ton of > dates as I find them and add them to the document. (Note, I could > easily have hundreds.) > > for ( Date somedate : allDatesFoundInSource[] ) { > document.add( Field.Text( "embeddedDates", somedate ) ); // Right > way to do this? > } > > > If the answer is no, it better illustrates the problem I face: > searching across an arbitrary collection of dates. > > > Erick, if I've missed something obvious in the archives, I'll happily > accept my public flogging. Thanks for your help so far. > > -wls --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org