lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers
Date Thu, 14 Jun 2012 03:41:48 GMT
Try putting the phone number in quotes in the query:

String qstr = "\"800-555-1212\"";

And check query.toString to see how the query parser analyzed the term, bot 
with and without quotes.

And make sure you initialized the query parser with "contents" as the 
default field.

-- Jack Krupansky

-----Original Message----- 
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user@lucene.apache.org
Subject: need to find locations of query hits in doc: works fine for regular 
text but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document. What 
I've implemented works fine for textual queries but does not work for phone 
numbers.

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal";
doc.add(new Field("contents",
oc,
Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I 
split my query (in case it's multi-word) into words and search for each of 
them using TermFreqVector like this:


//String qstr = "my multiword query"; // for queries like this it works 
fine...
String qstr = "800-555-1212"; // ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+"); // phone string stays intact here

for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
            for (int j=0;j<tvoffsetinfo.length;j++) {
            int offsetStart = tvoffsetinfo[j].getStartOffset();
            int offsetEnd = tvoffsetinfo[j].getEndOffset();
// ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am I 
doing wrong?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message