lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers
Date Thu, 14 Jun 2012 16:57:25 GMT
Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin@caci.com]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: need to find locations of query hits in doc: works fine for
regular
> text but not for phone numbers
> 
> OK, so I figured out what the problem was. It wasn't with the digits but
rather
> with the various delimiters like "(" and "-" that I use.
> 
> Essentially, the statement
> 
> 	String[] subTerms = qstr.split("\\s+");
> 
> Does not split a query the same way as the query parser would do it. And
> thanks, query.toString(), helped me see that.
> 
> My question now is this: is there a way of easily extracting a sequence of
> substrings from query to use in place of the subTerms array I get from
split?
> 
> I see that sometimes query.toString() returns things like
> 
> 	"contents:800 contents:555 contents:1212"
> 
> but other times it's somehting like
> 
> 	"contents:800 (contents:555 contents:1212)"
> 
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
> 
> Thanks!
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: need to find locations of query hits in doc: works fine for
regular
> text but not for phone numbers
> 
> Try putting the phone number in quotes in the query:
> 
> String qstr = "\"800-555-1212\"";
> 
> And check query.toString to see how the query parser analyzed the term,
bot
> with and without quotes.
> 
> And make sure you initialized the query parser with "contents" as the
default
> field.
> 
> -- Jack Krupansky
> 
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user@lucene.apache.org
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
> 
> Hello All,
> 
> I am using 3.4. I need to find locations of query hits in a document. What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
> 
> Here's how I index my docs:
> 
> String oc = "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new
> Field("contents", oc, Field.Store.NO, Field.Index.ANALYZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS));
> 
> 
> Now, here how I find locations. I search for a query. If I get a hit, I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
> 
> 
> //String qstr = "my multiword query"; // for queries like this it works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
> 
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact here
> 
> for (int i = 0; i < hits.length; i++) {
> int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
> 
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
> 
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
> 
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
>             for (int j=0;j<tvoffsetinfo.length;j++) {
>             int offsetStart = tvoffsetinfo[j].getStartOffset();
>             int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
> 
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am I
doing
> wrong?
> 
> Thanks,
> 
> Ilya Zavorin
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message