lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brittany Jacobs" <bjac...@jbmanagement.com>
Subject RE: getting started
Date Mon, 04 Aug 2008 18:30:33 GMT
Ok, say each line is an address.  So the text file would look like:
123 Water St. Somerville, GA 12345
456 Easy St. Hope, CA 45676
34 Ocean Blvd. Staten Island, NY 93843

The file would have hundreds of thousands of addresses.

So the user would type "34, St" in the search box and press a "Search" button.
In the table below the search box, the first and third record from the addresses above would
be displayed because they both have a "34" somewhere in them, and they both have a "St" somewhere
in them.

So the table would show:
123 Water St. Somerville, GA 12345
34 Ocean Blvd. Staten Island, NY 93843

because they match both criteria as pointed out here:
123 Water "St". Somerville, GA 12"34"5
"34" Ocean Blvd. "St"aten Island, NY 93843


Thanks.
Brittany 



Well, this could get to be a really ugly query. Let's say you have 10 lines.
Then the
doc would have 10 different fields? ("line1", "line2" etc.)? Then to search
it
you have to have an or clause across all fields. And a file with 100,000
lines would be
a 100,000 term query...... Or I misunderstand you completely.

Calling doc.add with the *same* field (say "text") is a possibility,
especially if you
provide your own tokenizer that returns a large increment gap, say 1000.
This offset
gets added to each call to doc.add on a field. So say you have 10 lines,
each with 5 tokens.
The first token of each line would be at offsets
0, 15, 30, 45...

You have a couple of choices here. Say you can guarantee that no line will
be longer than 100 terms.
Each line could begin on an even 100 offset (assuming you're not indexing
something with many millions
of lines). Now, to find the line you just divide the offset by 100.

Another possibility is to keep a field in the document that correlates
offsets to lines and read that
in when you need to.

It all depends upon what the purpose of needing to keep track of lines. If
it's for a single document,
this kind of thing can work. But if you want line information for all the
hits, it could be too expensive.

The increment gap will play interesting games with Span queries (or slop in
phrase queries). If you need
proximity to span lines, this scheme needs some modification. Say I want
hits when "firstname" is within 10
terms of "lastname". Well, if you have a large increment gap this won't
work.

So it would be a good thing to tell us a bit more about why you want to
distinguish lines to get
better advice <G>.

Best
Erick

On Fri, Aug 1, 2008 at 9:59 AM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh
S) <
nageshblore@gmail.com> wrote:

> Why should each line be a Document ? If there is a single document having
> each line as a Field, then the search would result in a single Document as
> a
> 'hit' not the individual lines matching it. Is this right ?
>
> Nagesh
>
> On Fri, Aug 1, 2008 at 7:21 PM, <roy-lucene-user@xemaps.com> wrote:
>
> > Hello Brittany,
> >
> > I think the easiest thing for you to do is make each line a Document.
>  You
> > might want a FileName and LineNumber field on top of a "Text" field, this
> > way if you need to gather all the lines of your File back together again
> > you
> > can do a search on the FileName.
> >
> > So in your case:
> >
> > Document 1
> >  FileName: [the file]
> >  LineNumber: 1
> >  Text: I like apples
> > Document 2
> >  ...etc
> >
> > Regards,
> > Roy
> >
> > On Fri, Aug 1, 2008 at 9:28 AM, Brittany Jacobs <
> bjacobs@jbmanagement.com
> > >wrote:
> >
> > > Just trying to grasp the concept.
> > >
> > >
> > >
> > > I want to search a text file where each line is a separate item to be
> > > searched.  When text it entered by the user, I want to return all the
> > lines
> > > in which that text appears.
> > >
> > > For example, if the text file has:
> > >
> > > I like apples.
> > >
> > > I went to the store.
> > >
> > > I bought an apple.
> > >
> > >
> > >
> > > If the user searches "apple", I want it to return the first and third
> > > sentences.
> > >
> > >
> > >
> > > Is each sentence a Token?  Is the user input going to be a QueryParser?
> > >  How
> > > should I read in the file so that each line of text is a token to
> search?
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Brittany Jacobs
> > >
> > > Java Developer
> > >
> > > JBManagement, Inc.
> > >
> > > 12 Christopher Way, Suite 103
> > >
> > > Eatontown, NJ 07724
> > >
> > > ph: 732-542-9200 ext. 229
> > >
> > > fax: 732-380-0678
> > >
> > > email:  <mailto:bjacobs@jbmanagement.com> bjacobs@jbmanagement.com
> > >
> > >
> > >
> > >
> >
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message