lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Storing special characters in Lucene
Date Thu, 21 Aug 2008 22:30:23 GMT
Here's a unit test:
import junit.framework.TestCase;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;


public class SpanishTest extends TestCase {

   public void testSpanish() throws Exception {
     RAMDirectory directory = new RAMDirectory();
     String content = "niños";
     IndexWriter writer = new IndexWriter(directory, new  
StandardAnalyzer(), true);
     Document document = new Document();
     document.add(new Field("name", content, Field.Store.YES,  
Field.Index.TOKENIZED));
     SnowballAnalyzer snowballAnalyzer = new  
SnowballAnalyzer("Spanish");
     writer.addDocument(document, snowballAnalyzer);
     writer.close();

     IndexSearcher searcher = new IndexSearcher(directory);
     QueryParser parser = new QueryParser("name", snowballAnalyzer);
     Query query = parser.parse(content);
     System.out.println("Query: " + query);
     Hits hits = searcher.search(query);
     assertTrue("hits Size: " + hits.length() + " is not: " + 1,  
hits.length() == 1);
     Document theDoc = hits.doc(0);
     String nombre = theDoc.get("name");
     System.out.println("Nombre: " + nombre);
   }
}


When I run this in IntelliJ, I get:

Query: name:niñ
Nombre: niños

Process finished with exit code 0


Are you by chance indexing XML?



On Aug 21, 2008, at 1:16 PM, Juan Pablo Morales wrote:

> I have an index in Spanish and I use Snowball to stem and analyze  
> and it
> works perfectly. However, I am running into trouble storing (not  
> indexing,
> only storing) words that have special characters.
>
> That is, I store the special character but the it comes garbled when  
> I read
> it back.
> To provide an example:
>
> String content = "niños";
> document.add(new Field("name",content,Store.YES, Index.Tokenized));
> writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
> .
> When I read the field back
> String nombre = doc.get("name");
>
> Then name will contain "ni�os"
>
> Looking at the index with Luke it shows me "ni&#65533;os" but when I  
> want to
> see the full text (by right clicking) it shows me ni�os.
>
> I know Lucene is supposed to store fields in UTF8, but then, how can  
> I make
> sure I sotre something and get it back just as it was, including  
> special
> characters?
>
> Thanks
> -- 
> Juan Pablo Morales
> Ingenian Software ltda
> Bogotá, Colombia

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message