lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Lucene searching across documents
Date Wed, 08 Apr 2009 16:58:06 GMT
Hi Dan,

My guess, though you didn't directly say so, is that you're representing each sentence/"line"
as a separate Lucene document.  To directly answer your question about whether inter-document
relations (like database joins) are queryable in Lucene, I don't think so, other than performing
multiple searches, where you feed the results of one query into another one (e.g.: first query
for all lines with tag X, retrieve the line-ID and transcript-ID field values, then query
for tag Y, requiring the same transcript-ID field value, and any one of the line-ID values
that are within the window you want).

If instead (or perhaps in addition, depending on your other needs), each full transcript is
a Lucene document, you can perform the kinds of searches you're talking about with tools available
in Lucene.

I'm thinking of a lucene document with a "line-tags" field, populated with the tags you've
associated with each line, and with the position of each line tag adjusted so that two tags
assigned to the same line are given the same position (sometimes Lucene users call terms with
the same position "synonyms", because that's the most common thing this capability is used
for).

Then you can run a SpanNearQuery over the line-tags field, to return matches where tag X is
within N lines of tag Y.

(See <http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/spans/package-summary.html>
for info on the Lucene Span Query family.)

Steve

On 4/8/2009 at 9:33 AM, Dan Scrima wrote:
> So I have a requirement where I have a directory filled with xml files.
> I wrote a parser to parse these files, and index all of the xml
> attributes and properties into documents. An example of one of these
> documents is below. I'm parsing sentences into words, and tagging the
> sentences based on certain criteria.
> 
> My issue is trying to find out if lucene can handle cross-document
> searching. So below is indexed as a single document... and there will
> be multiple sentences before, after, and throughout an entire
> transcript. Is it possible somehow to say, "I want a result where one
> line marked as Symptom is 5 lines away from another line marked as
> Brand." So in essence, I'm trying to search across multiple lucene
> documents.
> 
> Any thoughts or literature out there?
> 
> <transcript>
>   <line id="1">
>     <tag id="10" type="Symptom" />
>     <tag id="12" type="Brand" />
>     <word>
>       <token>Coughing</token>
>       <part-of-speech>SBJ</part-of-speech>
>     </word>
>     <word>
>       <token>is</token>
>       <part-of-speech>VB</part-of-speech>
>     </word>
>     <word>
>       <token>caused</token>
>       <part-of-speech>NP</part-of-speech>
>     </word>
>     <word>
>       <token>by</token>
>       <part-of-speech>PP</part-of-speech>
>     </word>
>     <word>
>       <token>Mucinex</token>
>       <part-of-speech>PDC</part-of-speech>
>     </word>
>   </line>
> </transcript>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message