lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ghinwa Choueiter <ghi...@csail.mit.edu>
Subject RE: How to index word-pairs and phrases
Date Tue, 19 Feb 2008 16:28:25 GMT
What about
\contrib\analyzers\src\java\org\apache\lucene\analysis\ngram  ??

Does this tokenizer do what I need?

thank you,
-Ghinwa

On Tue, 19 Feb 2008, Steven A Rowe wrote:

> Mark,
>
> The ShingleFilter contrib has not been committed yet - it's still here:
>
>   https://issues.apache.org/jira/browse/LUCENE-400
>
> Steve
>
> On 02/19/2008 at 2:33 AM, markharw00d wrote:
>> Further to Grant's useful background - there is an analyzer specifically
>> for multi-word terms in "contrib". See
>> Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
>>
>> Cheers
>> Mark
>>> Hi Ghinwa,
>>>
>>> A Term is simply a unit of tokenization that has been indexed for a
>>> Field, produced by a TokenStream.   In the demo, on the main site,
>>> this can be seen in the file called IndexFiles.java on line 56:
>>> IndexWriter writer = new IndexWriter(INDEX_DIR, new
>>> StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
>>>
>>> The key being that the StandardAnalyzer is used to get a TokenStream.
>>> The TokenStream.next() method returns a Token.  All a Term is, is a
>>> Token that has been indexed for a given Field.  In a nutshell, though,
>>> a Term is whatever you want it to be, based on your Analyzer.  It could
>>> be a whole document as a single string or it could be a string
>>> containing a single character.  In reality, it usually corresponds to a
>>> single word.
>>>
>>> Have a look at the wiki, also, as there are many talks and articles
>>> explaining all of this.  Lucene In Action, while outdated, is also an
>>> excellent book for explaining this stuff (with the exception that some
>>> of the code examples no longer work on the latest Lucene version)
>>>
>>> Getting beyond that, you will need to look into adding spell checking
>>> (in Lucene's "contrib" area) and other features like stemming and
>>> synonyms to handle the various issues you bring up.
>>>
>>> Hope that helps,
>>> Grant
>>>
>>> On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to Lucene and have been reading the documentation. I would
>>>> like to use Lucene to query a song database by lyrics. The query could
>>>> potentially contain typos, or even wrong words, word contractions
>>>> (can't versus cannot), etc..
>>>>
>>>> I would like to create an inverted list by word pairs and possibly
>>>> phrases and not just by isolated words. For example:
>>>> <w1,w2>   < d1, d10, d27>
>>>> <w2,w3>   <d2, d13>
>>>> ...
>>>>
>>>> OR even
>>>> <phrase 1> <d1, d3,...>
>>>> <phrase 2> <...>
>>>> ...
>>>>
>>>> It seems to me that, by default, the index in Lucene stores statistics
>>>> for isolated words. The Lucene documentation refers to the word "Term"
>>>> all the time and seems to imply that "Term" can be a word or a phrase,
>>>> but I can't see how IndexWriter can read a document and index it by
>>>> word pairs.
>>>>
>>>> thank you in advance for the answers and my apologies if I did not
>>>> get the terminology quite right.
>>>>
>>>> -Ghinwa
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>> http://www.lucenebootcamp.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>>> additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message