lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Kulkarni <kulk...@hawk.iit.edu>
Subject Re: Can lucene index tokenized files?
Date Thu, 25 Sep 2014 20:21:20 GMT
Hi,

I hope I can continue this question further.
I was able to index the tokenized data into lucene as pointed out by Uwe
and Erick.
I see bunch of files getting created in my index folder.
But now as an initial test when I do queries on the index I get no document
matching the queries.
I have a relevance mapping file which tells me which queries should be
matching which documents.

So the question is this, is lucene needing additional information to index
the tokens in order to perform the search?
Another question is that do I have to process the queries in some way to be
able to be searched in the index that I have created?

Thank you in advance.

Regards,
Sachin

On Mon, Sep 15, 2014 at 4:36 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
wrote:

> Hi Erick,
>
> Thank you.
>
> Yes the data is in text form with the space delimited tokens.
> The queries are categories that the documents belong to.
> They are regular text files and will need the transformation at my end.
>
> Regards,
> Sachin
>
> On Mon, Sep 15, 2014 at 12:31 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> How are they delimited? If they're just a text stream, it seems
>> all you need is a whitespace tokenizer. Won'
>>
>> How are you going to search them though? Is your query submission
>> process going to _also_ do the transformations or will you have
>> to construct a query-time analysis chain that mimics the pre-tokenization
>> you have at index time?
>>
>> Best,
>> Erick
>>
>> On Sun, Sep 14, 2014 at 8:34 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
>> wrote:
>> > Hi Uwe,
>> >
>> > Thank you.
>> > I do not have the tokens serialized, so that reduces one step.
>> > I am reading the javadocs and will try it the way you mentioned.
>> >
>> > Regards,
>> > Sachin
>> >
>> > On Sun, Sep 14, 2014 at 5:11 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> >
>> >> Hi,
>> >>
>> >> If you have the serialized tokens in a file, you can write a custom
>> >> TokenStream that unserializes them and feeds them to IndexWriter as a
>> Field
>> >> instance in a Document instance. Please read the javadocs how to write
>> your
>> >> own TokenStream implementation and pass it using "new TextField(name,
>> >> yourTokenStream)".
>> >>
>> >> Uwe
>> >>
>> >> -----
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: uwe@thetaphi.de
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Sachin Kulkarni [mailto:kulksac@hawk.iit.edu]
>> >> > Sent: Sunday, September 14, 2014 10:06 PM
>> >> > To: java-user@lucene.apache.org
>> >> > Subject: Can lucene index tokenized files?
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a dataset which has files in the form of tokens where the
>> >> original data
>> >> > has been tokenized, stemmed, stopworded.
>> >> >
>> >> > Is it possible to skip the lucene analyzers and index this dataset
in
>> >> Lucene?
>> >> >
>> >> > So far the dataset I have dealt with was raw and used Lucene's
>> >> tokenization
>> >> > and stemming schemes.
>> >> >
>> >> > Thank you.
>> >> >
>> >> > Regards,
>> >> > Sachin
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message