lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <izavo...@caci.com>
Subject RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers
Date Thu, 14 Jun 2012 18:36:35 GMT


Uwe, sorry but I am having trouble understanding this. Can you point me to a place in documentation
that explains this in more detail (I've read http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
but still am confused) or some example code?

Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for regular text but
not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses (sub-queries like
TermQuery, PhraseQuery, other BooleanQuery...). By that you get all query components. In most
cases some recursive instanceof checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin@caci.com]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: need to find locations of query hits in doc: works fine 
> for
regular
> text but not for phone numbers
> 
> OK, so I figured out what the problem was. It wasn't with the digits 
> but
rather
> with the various delimiters like "(" and "-" that I use.
> 
> Essentially, the statement
> 
> 	String[] subTerms = qstr.split("\\s+");
> 
> Does not split a query the same way as the query parser would do it. 
> And thanks, query.toString(), helped me see that.
> 
> My question now is this: is there a way of easily extracting a 
> sequence of substrings from query to use in place of the subTerms 
> array I get from
split?
> 
> I see that sometimes query.toString() returns things like
> 
> 	"contents:800 contents:555 contents:1212"
> 
> but other times it's somehting like
> 
> 	"contents:800 (contents:555 contents:1212)"
> 
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
> 
> Thanks!
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: need to find locations of query hits in doc: works fine 
> for
regular
> text but not for phone numbers
> 
> Try putting the phone number in quotes in the query:
> 
> String qstr = "\"800-555-1212\"";
> 
> And check query.toString to see how the query parser analyzed the 
> term,
bot
> with and without quotes.
> 
> And make sure you initialized the query parser with "contents" as the
default
> field.
> 
> -- Jack Krupansky
> 
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user@lucene.apache.org
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
> 
> Hello All,
> 
> I am using 3.4. I need to find locations of query hits in a document. 
> What
I've
> implemented works fine for textual queries but does not work for phone 
> numbers.
> 
> Here's how I index my docs:
> 
> String oc = "Joe dialed 800-555-1212 but got a busy signal"; 
> doc.add(new Field("contents", oc, Field.Store.NO, 
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
> 
> 
> Now, here how I find locations. I search for a query. If I get a hit, 
> I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
> 
> 
> //String qstr = "my multiword query"; // for queries like this it 
> works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query 
> query
=
> parser.parse(qstr); TopDocs results = searcher.search(query, 
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
> 
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact 
> here
> 
> for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; 
> Document doc = searcher.doc(docId);
> 
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents"); 
> TermPositionVector tpvector = (TermPositionVector)tfvector;
> 
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
> 
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
>             for (int j=0;j<tvoffsetinfo.length;j++) {
>             int offsetStart = tvoffsetinfo[j].getStartOffset();
>             int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
> 
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am 
> I
doing
> wrong?
> 
> Thanks,
> 
> Ilya Zavorin
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message