lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers
Date Thu, 14 Jun 2012 19:30:23 GMT
Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html

-- Jack Krupansky

-----Original Message----- 
From: Ilya Zavorin
Sent: Thursday, June 14, 2012 2:36 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers



Uwe, sorry but I am having trouble understanding this. Can you point me to a 
place in documentation that explains this in more detail (I've read 
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html

but still am confused) or some example code?

Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses 
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that 
you get all query components. In most cases some recursive instanceof 
checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin@caci.com]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> OK, so I figured out what the problem was. It wasn't with the digits
> but
rather
> with the various delimiters like "(" and "-" that I use.
>
> Essentially, the statement
>
> String[] subTerms = qstr.split("\\s+");
>
> Does not split a query the same way as the query parser would do it.
> And thanks, query.toString(), helped me see that.
>
> My question now is this: is there a way of easily extracting a
> sequence of substrings from query to use in place of the subTerms
> array I get from
split?
>
> I see that sometimes query.toString() returns things like
>
> "contents:800 contents:555 contents:1212"
>
> but other times it's somehting like
>
> "contents:800 (contents:555 contents:1212)"
>
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
>
> Thanks!
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> Try putting the phone number in quotes in the query:
>
> String qstr = "\"800-555-1212\"";
>
> And check query.toString to see how the query parser analyzed the
> term,
bot
> with and without quotes.
>
> And make sure you initialized the query parser with "contents" as the
default
> field.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user@lucene.apache.org
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
>
> Hello All,
>
> I am using 3.4. I need to find locations of query hits in a document.
> What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
>
> Here's how I index my docs:
>
> String oc = "Joe dialed 800-555-1212 but got a busy signal";
> doc.add(new Field("contents", oc, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>
>
> Now, here how I find locations. I search for a query. If I get a hit,
> I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
>
>
> //String qstr = "my multiword query"; // for queries like this it
> works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query
> query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
>
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact
> here
>
> for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
>
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
>
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
>
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
>             for (int j=0;j<tvoffsetinfo.length;j++) {
>             int offsetStart = tvoffsetinfo[j].getStartOffset();
>             int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
>
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am
> I
doing
> wrong?
>
> Thanks,
>
> Ilya Zavorin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message