lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Wilson <...@sanger.ac.uk>
Subject Is there a limit for a field size in Lucene 3.0.2
Date Thu, 21 Feb 2013 16:59:17 GMT
I am having an issue with an old Search Application we are using.

We have a Search App (using Lucene 3.0.2) that queries an index generated by
Nutch 1.3. There is a really long page (approx 124kb ) that is crawled and
inserted into the index, but when I search for it, (using a web-based
application based on Lucene 3.0.2) only the top ~20% of the page content is
coming back with results.

If I open the index up using Luke-1.0.1, I can see all the contents of the
field, but if I search for a term that I know is in there, and it's not in
the top ~20% of the page, it comes back blank.

So my question is, Is there a size limit for a field in Lucene 3.0.2

Regards Mark


On 21/02/2013 15:14, "Dyer, James" <James.Dyer@ingramcontent.com> wrote:

> Samuel,
> 
> Do you think you could write a failing unit test and open a JIRA issue?  Or at
> the least open a JIRA issue with all the details without a test?
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Samuel García Martínez [mailto:samuelgmartinez@gmail.com]
> Sent: Thursday, February 21, 2013 2:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: possible bug on Spellchecker
> Importance: Low
> 
> I'm using Solr 3.6 and DirectSpellchecker is available only on v4+.
> Moreover, in "big" indexes i prefer using sidekick index rather than
> iterating over term dictionary.
> 
> 
> On Thu, Feb 21, 2013 at 8:19 AM, Jack Krupansky
> <jack@basetechnology.com>wrote:
> 
>> Any reason that you are not using the DirectSpellChecker?
>> 
>> See:
>> http://lucene.apache.org/core/**4_0_0/suggest/org/apache/**
>> lucene/search/spell/**DirectSpellChecker.html<http://lucene.apache.org/core/4
>> _0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html>
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Samuel García Martínez
>> Sent: Wednesday, February 20, 2013 3:34 PM
>> To: java-user@lucene.apache.org
>> Subject: possible bug on Spellchecker
>> 
>> 
>> Hi all,
>> 
>> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
>> Spellchecker) behaviour i think i found a bug when the input is a 6 letter
>> word:
>>  - george
>>  - anthem
>>  - argued
>>  - fluent
>> 
>> Due to the getMin() and getMax() the grams indexed for these terms are 3
>> and 4. So, the fields would be something like this:
>>  - for "*george*"
>> 
>>     - start3: "geo"
>>     - start4: "geor"
>>     - end3: "rge"
>>     - end4: "orge"
>>     - 3: "geo", "eor", "org", "rge"
>>     - 4: "geor", "eorg", "orge"
>>  - for "*anthem*"
>> 
>>     - start3: "ant"
>>     - start4: "anth"
>>     - end3: "tem"
>>     - end4: "them"
>> 
>> The problem shows up when the user swap 3rd a 4th characters, misspelling
>> the word like this:
>>  - geroge
>>  - anhtem
>> 
>> The queries generated for this terms are: (SHOULD boolean queries)
>> - for "*geroge*"
>> 
>>  - start3: "ger"
>>  - start4: "gero"
>>  - end3: "oge"
>>  - end4: "roge"
>>  - 3: "ger", "ero", "rog", "oge"
>>  - 4: "gero", "erog", "roge"
>> - for "*anhtem*"
>> 
>>  - start3: "anh"
>>  - start4: "anht"
>>  - end3: "tem"
>>  - end4: "htem"
>>  - 3: "anh", "nht", "hte", "tem"
>>  - 4: "anht", "nhte", "htem"
>> 
>> So, as you can see, this kind of misspelling never matches the suitable
>> suggestions although the edit distance is 0.95555556.
>> 
>> I think getMin(int l) and getMax(int l) should return 2 and 3,
>> respectively, for l==6. Debugging other values i did not found any problem
>> with any kind of misspelling.
>> 
>> Any thoughts about this?
>> 
>> --
>> Un saludo,
>> Samuel García
>> 
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache
>> .org>
>> For additional commands, e-mail:
>> java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>> 
>> 
> 



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message