lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Estrada <estrada.adam.gro...@gmail.com>
Subject [Free Text] Field Tokenizing
Date Thu, 09 Jun 2011 14:56:36 GMT
All,

I am at a bit of a loss here so any help would be greatly appreciated. I am
using the DIH to grab data from a DB. The field that I am most interested in
has anywhere from 1 word to several paragraphs worth of free text. What I
would really like to do is pull out phrases like "Joe's coffee shop" rather
than the 3 individual words. I have tried the KeywordTokenizerFactory and
that does seem to do what I want it to do but it is not actually tokenizing
anything so it does what I want it to for the most part but it's not
creating the tokens that I need for further analysis in apps like Mahout.

We can play with the combination of tokenizers and filters all day long and
see what the results are after a quick reindex. I typlically just view them
in Solitas as facets which may be the problem for me too. Does anyone have
an example fieldType they can share with me that shows how to
extract phrases if they are there from the data I described earlier. Am I
even going about this the right way? I am using today's trunk build of Solr
and here is what I have munged together this morning.

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
 <analyzer >
 <charFilter class="solr.HTMLStripCharFilterFactory"/>
 <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
 <tokenizer class="solr.KeywordTokenizerFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
 <filter class="solr.ShingleFilterFactory" maxShingleSize="4"
outputUnigrams="true" outputUnigramIfNoNgram="false"/>
 <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
 <filter class="solr.EnglishPossessiveFilterFactory"/>
 <filter class="solr.EnglishMinimalStemFilterFactory"/>
 <filter class="solr.ASCIIFoldingFilterFactory"/>
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 <filter class="solr.TrimFilterFactory"/>
 </analyzer>
</fieldType>

Thanks,
Adam

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message