Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 30693 invoked from network); 14 Jan 2010 10:04:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Jan 2010 10:04:57 -0000 Received: (qmail 10565 invoked by uid 500); 14 Jan 2010 10:04:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10491 invoked by uid 500); 14 Jan 2010 10:04:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10474 invoked by uid 99); 14 Jan 2010 10:04:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 10:04:54 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [213.86.41.125] (HELO mail.truvo.com) (213.86.41.125) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 10:04:46 +0000 X-ASG-Debug-ID: 1263463458-1a8e00e40004-H1Y1KX X-Barracuda-URL: http://10.255.0.215:8000/cgi-bin/mark.cgi Received: from exchq-001.wdnet.org (localhost [127.0.0.1]) by mail.truvo.com (Spam Firewall) with ESMTP id DF9E4350D7 for ; Thu, 14 Jan 2010 11:04:24 +0100 (CET) Received: from exchq-001.wdnet.org (exchq-001.wdnet.org [10.100.10.5]) by mail.truvo.com with ESMTP id GMDWTMyrHwRlsrRS for ; Thu, 14 Jan 2010 11:04:24 +0100 (CET) X-ASG-Whitelist: Client X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-ASG-Orig-Subj: RE: Extracting contact data Subject: RE: Extracting contact data Date: Thu, 14 Jan 2010 11:04:15 +0100 Message-ID: <08FC5E29EDEA6247B7E499F2EC96BAD3869394@exchq-001.wdnet.org> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Extracting contact data Thread-Index: AcqUcr9DQlQTJPFhTzux+Xz2EHUgEAAi/pyA References: <08FC5E29EDEA6247B7E499F2EC96BAD3869392@exchq-001.wdnet.org> <359a92831001130905o3477165dg6202efe9c2ea079b@mail.gmail.com> From: "Ortelli, Gian Luca" To: X-Barracuda-Connect: exchq-001.wdnet.org[10.100.10.5] X-Barracuda-Start-Time: 1263463464 X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at truvo.com X-Virus-Checked: Checked by ClamAV on apache.org Well, the exact definition we're going to find out empirically,=20 as we run an implementation through our data and look at the quality=20 of results... For now, I would use the number of tokens between the finding ("abc@def.com") and the word that gives context ("Contact"). Anyway, replying to karl: I'm not searching for a given email/street/time interval/etc., I need to extract EVERY email/street/time interval/etc. from the text. The kind of need for which you suggest a natural language processing tool. Gianluca -----Original Message----- From: Erick Erickson [mailto:erickerickson@gmail.com]=20 Sent: Wednesday, January 13, 2010 6:06 PM To: java-user@lucene.apache.org Subject: Re: Extracting contact data Before answering, how to you measure "proximity"? You can make Lucene work with locations (there's an example in Lucene In Action) readily enough though.... HTH Erick On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca < gianluca.ortelli@truvo.com> wrote: > Hi community, > > > > I have a general understanding of Lucene concepts, and I'm wondering if > it's the right tool for my job: > > > > - I need to extract data like e.g. time intervals ("8am - 12pm"), street > addresses from a set of files. The common issue with this data unit is > that they contain spaces and are not always definable through regexes. > > > > - the extraction must take into consideration the "proximity": for > example, a mail address which is close to the work "Contacts" will > receive a higher rank, since I'm looking for contact data. > > > > Do you think I can get any advantage from building a solution on Lucene? > > > > Gianluca > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org