Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88978952E for ; Thu, 14 Jun 2012 16:50:02 +0000 (UTC) Received: (qmail 80265 invoked by uid 500); 14 Jun 2012 16:50:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 80213 invoked by uid 500); 14 Jun 2012 16:50:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 80204 invoked by uid 99); 14 Jun 2012 16:50:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 16:50:00 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [204.194.77.24] (HELO mailserver2.caci.com) (204.194.77.24) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 16:49:52 +0000 Received: from ex2010ch01.caci.com ([172.16.247.26]) by mailserver2.caci.com with ESMTP/TLS/AES128-SHA; 14 Jun 2012 12:48:50 -0400 Received: from EX2010MB01-1.caci.com ([fe80::d5c4:c244:1486:79fc]) by ex2010ch01.caci.com ([::1]) with mapi id 14.01.0355.002; Thu, 14 Jun 2012 12:49:24 -0400 From: Ilya Zavorin To: "java-user@lucene.apache.org" Subject: RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers Thread-Topic: need to find locations of query hits in doc: works fine for regular text but not for phone numbers Thread-Index: AQHNSdi2ViY/fYrV10y1IE2UK8esopb5bp0AgABxueA= Date: Thu, 14 Jun 2012 16:49:23 +0000 Message-ID: References: <1339635547170-3989548.post@n3.nabble.com> <6CA6F368A005475EA26A2833C8CAD86A@JackKrupansky> In-Reply-To: <6CA6F368A005475EA26A2833C8CAD86A@JackKrupansky> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.29.10.35] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 OK, so I figured out what the problem was. It wasn't with the digits but ra= ther with the various delimiters like "(" and "-" that I use. Essentially, the statement=20 String[] subTerms =3D qstr.split("\\s+"); Does not split a query the same way as the query parser would do it. And th= anks, query.toString(), helped me see that. My question now is this: is there a way of easily extracting a sequence of = substrings from query to use in place of the subTerms array I get from spli= t? I see that sometimes query.toString() returns things like=20 "contents:800 contents:555 contents:1212"=20 but other times it's somehting like "contents:800 (contents:555 contents:1212)" So instead of trying to guess what other formats query.toString can produce= and trying to parse those, can I somehow extract the substrings of the que= ry reliably? Thanks! -----Original Message----- From: Jack Krupansky [mailto:jack@basetechnology.com]=20 Sent: Wednesday, June 13, 2012 11:42 PM To: java-user@lucene.apache.org Subject: Re: need to find locations of query hits in doc: works fine for re= gular text but not for phone numbers Try putting the phone number in quotes in the query: String qstr =3D "\"800-555-1212\""; And check query.toString to see how the query parser analyzed the term, bot= with and without quotes. And make sure you initialized the query parser with "contents" as the defau= lt field. -- Jack Krupansky -----Original Message----- From: Ilya Zavorin Sent: Wednesday, June 13, 2012 10:52 PM To: java-user@lucene.apache.org Subject: need to find locations of query hits in doc: works fine for regula= r text but not for phone numbers Hello All, I am using 3.4. I need to find locations of query hits in a document. What = I've implemented works fine for textual queries but does not work for phone= numbers. Here's how I index my docs: String oc =3D "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new = Field("contents", oc, Field.Store.NO, Field.Index.ANALYZED, Field.TermVecto= r.WITH_POSITIONS_OFFSETS)); Now, here how I find locations. I search for a query. If I get a hit, I spl= it my query (in case it's multi-word) into words and search for each of the= m using TermFreqVector like this: //String qstr =3D "my multiword query"; // for queries like this it works f= ine... String qstr =3D "800-555-1212"; // ...but not for ones like this Query quer= y =3D parser.parse(qstr); TopDocs results =3D searcher.search(query, Intege= r.MAX_VALUE); ScoreDoc[] hits =3D results.scoreDocs; String[] subTerms =3D qstr.split("\\s+"); // phone string stays intact here for (int i =3D 0; i < hits.length; i++) { int docId =3D hits[i].doc; Document doc =3D searcher.doc(docId); TermFreqVector tfvector =3D reader.getTermFreqVector(docId, "contents"); Te= rmPositionVector tpvector =3D (TermPositionVector)tfvector; for (String subTerm : subTerms) { String subq =3D subTerm.toLowerCase(); int termidx =3D tfvector.indexOf(subq); // get termidx =3D -1 here TermVectorOffsetInfo[] tvoffsetinfo =3D tpvector.getOffsets(termidx); for (int j=3D0;j