lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?
Date Tue, 03 Nov 2015 13:14:34 GMT
Thank you.  Y, the Map<Integer, OffsetAttribute> charOffsets is a map of token position
(Integer) and the character offsets for that token...so I think I'm good?

As part of LUCENE-5317, I have a DocTokenOffsetsVisitor interface and SpanCrawler that runs
that visitor against an IndexSearcher...

The visitor sees a Document and a list of character offsets for hits in that document.

Once I update that to include the more modern code below (optionally for those storing offsets)...is
there any interest in integrating LUCENE-5317 or components of it so that others don't have
to reinvent the wheel below?


Cheers,

           Tim
-----Original Message-----
From: Alan Woodward [mailto:alan@flax.co.uk] 
Sent: Tuesday, November 03, 2015 4:25 AM
To: java-user@lucene.apache.org
Subject: Re: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

The second parameter passed to SpanCollector.collectLeaf() is the position, rather than an
index of any kind, which I think is going to mess things up for you.  But other than that,
you've got the right idea. :-)

Alan Woodward
www.flax.co.uk


On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote:

> All,
> 
>  I'm trying to find all spans in a given String via stored offsets in Lucene 5.3.1. 
I wanted to use the Highlighter with a NullFragmenter, but that is highlighting only the matching
terms, not the full Spans (related to LUCENE-6796?).
> 
>  My Current code iterates through the spans, stores the span positions in one array and
gathers the character offsets via a SpanCollector in a Map<Integer, OffsetAttribute>.
 Is there a simpler way?
> 
> Something like this:
> 
> String s = "the quick brown fox jumped over the lazy dog"; String 
> field = "f"; Analyzer analyzer = new StandardAnalyzer();
> 
> SpanQuery spanQuery = new SpanNearQuery(
>        new SpanQuery[] {
>                new SpanTermQuery(new Term(field, "fox")),
>                new SpanTermQuery(new Term(field, "quick"))
>        },
>        3,
>        false
> );
> 
> 
> MemoryIndex index = new MemoryIndex(true);
> 
> 
> index.addField(field, s, analyzer);
> index.freeze();
> 
> IndexSearcher searcher = index.createSearcher(); IndexReader reader = 
> searcher.getIndexReader(); spanQuery = (SpanQuery) 
> spanQuery.rewrite(reader); SpanWeight weight = (SpanWeight) 
> searcher.createWeight(spanQuery, false); Spans spans = 
> weight.getSpans(reader.leaves().get(0),
>        SpanWeight.Postings.OFFSETS);
> 
> if (spans == null) {
> //do something with full string
>     return;
> }
> 
> OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector(); 
> List<OffsetAttribute> spanPositions = new ArrayList<>(); while 
> (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
>    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(spans.startPosition(), spans.endPosition()-1);
>        spanPositions.add(offsetAttribute);
>        spans.collect(offsetSpanCollector);
>    }
> }
> Map<Integer, OffsetAttribute> charOffsets = 
> offsetSpanCollector.getOffsets(); //now iterate through the list of 
> spanPositions and grab the character offsets for the start and end 
> tokens of each //span from the charOffsets ...
> 
> 
> 
> 
> private class OffsetSpanCollector implements SpanCollector {
>    Map<Integer, Offset> charOffsets = new HashMap<>();
> 
>    @Override
>    public void collectLeaf(PostingsEnum postingsEnum, int i, Term 
> term) throws IOException {
> 
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(postingsEnum.startOffset(), 
> postingsEnum.endOffset());
> 
>        charOffsets.put(i, offsetAttribute);
>    }
> 
>    @Override
>    public void reset() {
> 
>      //don't think I need to do anything with this?
>    }
> 
>    public Map<Integer, OffsetAttribute> getOffsets() {
>        return charOffsets;
>    }
> }
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message