Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: error (idunn.apache.osuosl.org: domain isoco.com from
 62.81.148.13 cause and error)
From: "mcarcelen" <mcarcelen@isoco.com>
To: <java-user@lucene.apache.org>
Subject: RE: Queries in Lucene
Date: Wed, 13 Sep 2006 14:57:00 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <359a92830609130520o4517a0abx9badfa30dcb94b0b@mail.gmail.com>
Thread-Index: AcbXLwOSSbKjSQ8wRkuSeEqaO70NtAAA7ehg
Message-Id: <20060913125703.AB8D54F5A2@mad.isoco.net>

Thank you very much.

Yes, I=B4m very new to Lucene. I=B4m sorry

With the help of Lucene we want to classify 724.827 legal files that in =
the
first line contained the word "Auto" or "Providencia". We can to =
separate in
two groups. That=B4s why I=B4ve indexed these files with Lucene before, =
and we
thought that we could reused the index and apply a special query for =
that

Thanks for your help.
Best regards
Teresa


=20


-----Mensaje original-----
De: Erick Erickson [mailto:erickerickson@gmail.com]=20
Enviado el: mi=E9rcoles, 13 de septiembre de 2006 14:20
Para: java-user@lucene.apache.org
Asunto: Re: Queries in Lucene

I'm assuming that you're new to Lucene, so if you're an old pro you =
probably
already know all this....

I think you'll have difficulty here. Lucene has no concept of lines, =
just
tokens and offsets. So here are a couple of suggestions off the top  of =
my
head...

If the first line is the *only* way you want to restrict this, index the
tokens in the first line in a separate field for each document, and =
search
on that field (call it "firstline" <G>). Obviously, this won't work for
searching lines 2-n.

If you're going to want to ask if terms are in line 2, 3, 4..., you =
could
bump your term position at the start of each line by, say, 500 and then =
do
some fancy dancing with TermPositions to get terms from a particular =
line.
This is going to be complicated though to get right, especially when you
want to do arbitrary boolean queries.

You could creatively index things. Index a document with fields line1,
line2, line3, line4...., and when you wanted to search in a particular =
line,
form your query with a field corresponding to the correct line. You =
could
even index the full text of the document in a "fulltext" field if you =
wanted
to search over an entire document.

There are space tradeoffs to all this, so be sure you understand
Field.Store.YES and NO as they apply to your problem, and what effect
analyzers have on your indexing AND search streams. Lots of people are
confused by this issue.

If you haven't already, get a copy of Luke so you can poke around at =
your
index. Google luke lucene and it'll pop right up.

Before diving into this as stated, is there a way to re-think the =
problem to
make it easier? What question are you *really* trying to answer by =
asking
whether certain tokens are in a particular line?

Best
Erick


On 9/13/06, mcarcelen <mcarcelen@isoco.com> wrote:
>
>
> Hi all,
> I=B4ve got a index and now I=B4m trying to create a query with =
lucene-2.0.0,
> I=B4d like to find files that in the first line get the following:
>
> <DIV class=3DMy-Word1> AND Word2
>
> I=B4m tried with the package org.apache.lucene.demo.SearchFiles
> but I get files where the word "Word2" is not in the first line.
>
> I don=B4t know how to do the query filtered or if I have to use =
another file
>
> Can anyone help me?
>
> Thanks
>
> Best Regards
> Teresa
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org