lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: getting started
Date Sat, 02 Aug 2008 17:45:43 GMT
Well, this could get to be a really ugly query. Let's say you have 10 lines.
Then the
doc would have 10 different fields? ("line1", "line2" etc.)? Then to search
it
you have to have an or clause across all fields. And a file with 100,000
lines would be
a 100,000 term query...... Or I misunderstand you completely.

Calling doc.add with the *same* field (say "text") is a possibility,
especially if you
provide your own tokenizer that returns a large increment gap, say 1000.
This offset
gets added to each call to doc.add on a field. So say you have 10 lines,
each with 5 tokens.
The first token of each line would be at offsets
0, 15, 30, 45...

You have a couple of choices here. Say you can guarantee that no line will
be longer than 100 terms.
Each line could begin on an even 100 offset (assuming you're not indexing
something with many millions
of lines). Now, to find the line you just divide the offset by 100.

Another possibility is to keep a field in the document that correlates
offsets to lines and read that
in when you need to.

It all depends upon what the purpose of needing to keep track of lines. If
it's for a single document,
this kind of thing can work. But if you want line information for all the
hits, it could be too expensive.

The increment gap will play interesting games with Span queries (or slop in
phrase queries). If you need
proximity to span lines, this scheme needs some modification. Say I want
hits when "firstname" is within 10
terms of "lastname". Well, if you have a large increment gap this won't
work.

So it would be a good thing to tell us a bit more about why you want to
distinguish lines to get
better advice <G>.

Best
Erick

On Fri, Aug 1, 2008 at 9:59 AM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh
S) <
nageshblore@gmail.com> wrote:

> Why should each line be a Document ? If there is a single document having
> each line as a Field, then the search would result in a single Document as
> a
> 'hit' not the individual lines matching it. Is this right ?
>
> Nagesh
>
> On Fri, Aug 1, 2008 at 7:21 PM, <roy-lucene-user@xemaps.com> wrote:
>
> > Hello Brittany,
> >
> > I think the easiest thing for you to do is make each line a Document.
>  You
> > might want a FileName and LineNumber field on top of a "Text" field, this
> > way if you need to gather all the lines of your File back together again
> > you
> > can do a search on the FileName.
> >
> > So in your case:
> >
> > Document 1
> >  FileName: [the file]
> >  LineNumber: 1
> >  Text: I like apples
> > Document 2
> >  ...etc
> >
> > Regards,
> > Roy
> >
> > On Fri, Aug 1, 2008 at 9:28 AM, Brittany Jacobs <
> bjacobs@jbmanagement.com
> > >wrote:
> >
> > > Just trying to grasp the concept.
> > >
> > >
> > >
> > > I want to search a text file where each line is a separate item to be
> > > searched.  When text it entered by the user, I want to return all the
> > lines
> > > in which that text appears.
> > >
> > > For example, if the text file has:
> > >
> > > I like apples.
> > >
> > > I went to the store.
> > >
> > > I bought an apple.
> > >
> > >
> > >
> > > If the user searches "apple", I want it to return the first and third
> > > sentences.
> > >
> > >
> > >
> > > Is each sentence a Token?  Is the user input going to be a QueryParser?
> > >  How
> > > should I read in the file so that each line of text is a token to
> search?
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Brittany Jacobs
> > >
> > > Java Developer
> > >
> > > JBManagement, Inc.
> > >
> > > 12 Christopher Way, Suite 103
> > >
> > > Eatontown, NJ 07724
> > >
> > > ph: 732-542-9200 ext. 229
> > >
> > > fax: 732-380-0678
> > >
> > > email:  <mailto:bjacobs@jbmanagement.com> bjacobs@jbmanagement.com
> > >
> > >
> > >
> > >
> >
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message