lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darin McBeath <ddmcbe...@yahoo.com.INVALID>
Subject Re: SpanQuery not working as expected
Date Mon, 09 Jun 2014 18:19:40 GMT
Hi Tim.

Thanks for your help.  I had a friend provide me some code (some snippets below) that could
dump the supposed matching spans (this provided some more insight).  Perhaps, some of my
findings could help someone potentially fix the bug.

So, I added my 2 documents

public static String [] DOCS = {
  "bauthors  bauthor blname  mcbeath elname slname  bfname  darin william efname sfname 
eauthor sauthor  bauthor blname  fulford elname slname  bfname  darby efname sfname  eauthor
sauthor  bauthor blname  mcbeath elname slname  bfname  darby efname sfname  eauthor sauthor
 eauthors sauthors",
  "bauthors  bauthor blname  mcbeath elname slname  bfname  darin efname sfname  eauthor
sauthor  bauthor blname fulford elname slname bfname darin efname sfname eauthor sauthor 
eauthors sauthors",
 };

I then coded the following SpanQuery.

  // Simple query for fname:darin and lname:fulford 
  ArrayList<SpanQuery> innerSpans = new ArrayList<SpanQuery>();
  
  // Construct the last name span
  ArrayList<SpanQuery> spansln = new ArrayList<SpanQuery>();
  spansln.add(new SpanTermQuery(new Term("content", "blname")));
  spansln.add(new SpanTermQuery(new Term("content", "fulford")));    
  spansln.add(new SpanTermQuery(new Term("content", "elname")));
  
  SpanNearQuery lnInnerIncludeQuery = new SpanNearQuery(spansln.toArray(new SpanQuery[spansln.size()]),
Integer.MAX_VALUE, true);   
  // Add the sep marker to the not clause
  SpanQuery lnInnerExcludeQuery = new SpanTermQuery(new Term("content", "slname"));
  innerSpans.add(new SpanNotQuery(lnInnerIncludeQuery,lnInnerExcludeQuery));
  
  // Construct the first name span
  ArrayList<SpanQuery> spansfn = new ArrayList<SpanQuery>();
  spansfn.add(new SpanTermQuery(new Term("content", "bfname")));
  spansfn.add(new SpanTermQuery(new Term("content", "darin")));    
  spansfn.add(new SpanTermQuery(new Term("content", "efname")));
  SpanNearQuery fnInnerIncludeQuery = new SpanNearQuery(spansfn.toArray(new SpanQuery[spansfn.size()]),
Integer.MAX_VALUE, true);   
  // Add the sep marker to the not clause
  SpanQuery fnInnerExcludeQuery = new SpanTermQuery(new Term("content", "sfname"));
  innerSpans.add(new SpanNotQuery(fnInnerIncludeQuery,fnInnerExcludeQuery));
  
  // Make the first/last name spans unordered
  SpanNearQuery innerSpanQuery = new SpanNearQuery(innerSpans.toArray(new SpanQuery[innerSpans.size()]),
Integer.MAX_VALUE, false);
  
  ArrayList<SpanQuery> outerSpanQuery = new ArrayList<SpanQuery>();
  outerSpanQuery.add(new SpanTermQuery(new Term("content", "bauthor")));
  outerSpanQuery.add(innerSpanQuery);
  outerSpanQuery.add(new SpanTermQuery(new Term("content", "eauthor")));
  SpanNearQuery includeQuery = new SpanNearQuery(outerSpanQuery.toArray(new SpanQuery[outerSpanQuery.size()]),
Integer.MAX_VALUE, true);
   
  // Add the sep marker to the not clause
  SpanQuery excludeQuery = new SpanTermQuery(new Term("content", "sauthor"));
  SpanNotQuery finalQuery = new SpanNotQuery(includeQuery,excludeQuery);  
  doSpanQuery(finalQuery, searcher, "fname:darin AND lname:fulford"); 

And noticed this incorrectly matches  DOC 4 (results are below).

BEGIN QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor, spanNear([spanNot(spanNear([content:blname,
content:fulford, content:elname], 2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)], 2147483647, false),
content:eauthor], 2147483647, true), content:sauthor, 0, 0)
Score Doc: doc=5 score=1.0829407 shardIndex=-1
'bauthors  bauthor blname  mcbeath elname slname  bfname  darin efname sfname  eauthor sauthor
 bauthor blname fulford elname slname bfname darin efname sfname eauthor sauthor  eauthors
sauthors'

Score Doc: doc=4 score=0.610962 shardIndex=-1
'bauthors  bauthor blname  mcbeath elname slname  bfname  darin william efname sfname  eauthor
sauthor  bauthor blname  fulford elname slname  bfname  darby efname sfname  eauthor sauthor
 bauthor blname  mcbeath elname slname  bfname  darby efname sfname  eauthor sauthor  eauthors
sauthors'

Doc: 4 Start: 1 End: 12
Doc: 5 Start: 1 End: 11
Doc: 5 Start: 12 End: 22
END QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor, spanNear([spanNot(spanNear([content:blname,
content:fulford, content:elname], 2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)], 2147483647, false),
content:eauthor], 2147483647, true), content:sauthor, 0, 0)

I then made one small change (made this SpanNearQuery 'ordered')

// Make the first/last name spans ordered
  SpanNearQuery innerSpanQuery = new SpanNearQuery(innerSpans.toArray(new SpanQuery[innerSpans.size()]),
Integer.MAX_VALUE, true);

And I get the correct results.

BEGIN QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor, spanNear([spanNot(spanNear([content:blname,
content:fulford, content:elname], 2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)], 2147483647, true),
content:eauthor], 2147483647, true), content:sauthor, 0, 0)
Score Doc: doc=5 score=0.76575476 shardIndex=-1
'bauthors  bauthor blname  mcbeath elname slname  bfname  darin efname sfname  eauthor sauthor
 bauthor blname fulford elname slname bfname darin efname sfname eauthor sauthor  eauthors
sauthors'

Doc: 5 Start: 12 End: 22
END QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor, spanNear([spanNot(spanNear([content:blname,
content:fulford, content:elname], 2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)], 2147483647, true),
content:eauthor], 2147483647, true), content:sauthor, 0, 0)

Not sure why 'ordered' vs 'unordered' makes it work correctly, but certainly sounds like a
bug with Lucene. 

If you have any thoughts for a workaround, I would be interested.

Thanks again.

Darin.







----- Original Message -----
From: "Allison, Timothy B." <tallison@mitre.org>
To: Darin McBeath <ddmcbeath@yahoo.com>; "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Cc: 
Sent: Monday, June 9, 2014 2:10 PM
Subject: RE: SpanQuery not working as expected

Darin,
  I confirmed the behavior you reported.  This is probably the same bug that was reported
in LUCENE-5331. The trigger there seems to be multiple examples of the same token (which you
have plenty of).  I tested with just this:

[[darin fulford]~100 sauthor]!~0,0

darin fulford (non-directional) but no intervening sauthor

And that works correctly.

I also tested:
[[darin fulford]~100 (bauthor sauthor)]!~0,0

Same as above but with a SpanOr for bauthor|sauthor. And that works correctly, too.

So, yes, I think what you've found is a bug, unfortunately a known one that hasn't been fixed. 
There's also a chance that something else is going on...when I took your query and removed
b[lf]name and e[fl]name, the query still brought back both docs.  So, if you want to go this
route, I'd recommend flattening the markup as much as possible, but it still just might not
be possible.

I'm not sure that I understand all of your use cases, but, in general, the more you can do
with adding non-hierarchical meta-fields and the less you have to hack markup, the better. 
That said, it sounds like your problem is what the child/parent block join queries were built
for, and given your response, it sounds like you've already gone that route and you've found
performance not to be sufficient.

I'm sorry that I can't be of more help.

Best,

         Tim




-----Original Message-----
From: Darin McBeath [mailto:ddmcbeath@yahoo.com] 
Sent: Friday, June 06, 2014 1:03 PM
To: Allison, Timothy B.; java-user@lucene.apache.org
Subject: Re: SpanQuery not working as expected

Thanks Tim.

I have thought about this for the author field (and like you suggest) it would probably work. 
I was actually going to experiment with this later today.

But, I have another field that has a bit more nesting (and it contains authors)

For example, within a given document, I have the following:

References [ one or more]
  Authors [one or more]
     First Name
     Last Name

 
So, I would need to search for a specific author (matching first name and last name) within
a specific reference for a document.  With this double level of nesting, I don't think the
multivalued field approach would work (please correct me if I'm wrong).  That's why I decided
to use span queries.  My index has more than 100 fields, but I only have 2 or 3 fields that
require this structure search capability.  There are also many documents (100M) so I didn't
really want to get into a parent-child type approach.

Plus, there are also many other fields (both within an author) and within an individual reference
that need to be scoped.  For example, there is a 'source tittle' at the reference level and
an 'article title' at the reference level. I would need to search within a given reference
where the 'source title' contains some words, where the 'article title' contains some words,
and within this reference where a specific author contains 'john' for the first name and 'smith'
in the last name.

I guess I'm curious if what I was doing with the SpanQuery should have worked, whether I misunderstood
something, or if this is a bug.

Darin.




________________________________
From: "Allison, Timothy B." <tallison@mitre.org>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>; Darin McBeath <ddmcbeath@yahoo.com>

Sent: Friday, June 6, 2014 10:12 AM
Subject: RE: SpanQuery not working as expected


Hi Darin,

Have you thought about using multivalued fields?  If you set the positionIncrementGap to
something kind of big (well > 1, say :) ), and you know that your data is always authorfirst,
authorlast,  you could just search for "darin fulford".

The positionincrementgap will prevent matching on doc2 below.

Doc1
Authorsfield:
    Darin fulford

Doc2 
Authorsfield:
    Matilda darin
    Fulford alexandria

Don't get me wrong, I love the capabilities of SpanQuery, but will this simple solution meet
your needs?





-----Original Message-----
From: Darin McBeath [mailto:ddmcbeath@yahoo.com.INVALID] 
Sent: Thursday, June 05, 2014 7:17 PM
To: java-user@lucene.apache.org
Subject: SpanQuery not working as expected

I read through the http://searchhub.org/2009/07/18/the-spanquery/ which provided a good
overview for how one can construct fairly complex span queries.  I was particularly interested
in the ability to construct nested span queries.  I'm trying to apply this concept to search
a field that contains some structure (as below).  I have a couple of other fields that will
have a bit more nesting, but this should give the general idea.  

authors
  author [one or more]
    first name
    last name

Prior to indexing the content with Lucene, I added some 'markers' around the various bits
I might want to search.  For example 'bauthor' implies beginning author, 'eauthor' implies
ending author, and 'sauthor' implies a separator between individual authors (that would be
used as part of the exclude clause in a not span query).  I do similar things for 'first
name' and 'last name'.

My constructed query (as interpreted by Lucene) is included below.  This was extracted from
the 'parsed string' returned from the query when I set debug=true.  Within a given 'authscope'
field, I'm trying to find a situation where the author first name is 'darin' and the last
name is 'fulford' within a given 'author'.   

spanNot(
    spanNear(
        [authscope:bauthor, 
        spanNear(
            [spanNot(
                spanNear(
                    [authscope:bfname, 
                    authscope:darin, 
                    authscope:efname], 
                    2147483647, true), 
                authscope:sfname, 0, 0), 
             spanNot(
                spanNear(
                    [authscope:blname, 
                    authscope:fulford, 
                    authscope:elname], 
                    2147483647, true), 
                authscope:slname, 0, 0)], 
             2147483647, false), 
         authscope:eauthor], 
         2147483647, true), 
     authscope:sauthor, 0, 0)",

I have loaded the following  2 documents into my index.

[
  {"id":"1", "authscope":" bauthors  bauthor blname mcbeath elname slname  bfname 
darin efname sfname  eauthor sauthor  bauthor blname  fulford elname slname  bfname 
darby efname sfname  eauthor sauthor  bauthor blname  mcbeath elname slname  bfname 
darby efname sfname  eauthor sauthor  eauthors sauthors "},
  {"id":"2", "authscope":" bauthors  bauthor blname  mcbeath elname slname  bfname 
darin efname sfname  eauthor sauthor  bauthor blname  fulford elname slname  bfname 
darin efname sfname  eauthor sauthor  eauthors sauthors "}
]

What I can't figure out is why the above query would match on both documents.  It should
only match the document with id:2.


Any insights would be appreciated.  I'm using Lucene 4.7.2.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message