lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sengly Heng" <sengly.h...@gmail.com>
Subject Re: TF-IDF API
Date Thu, 29 Mar 2007 03:40:07 GMT
Thank you very much for your time. Here is a sample of a vector of terms :

v1 = {"sad", "john", "intelligent", "news", "USA", "disneyland", "MIT",
"cambridge", "marry", ...}

I'll try out your method.

Best regards,

Sengly



On 3/28/07, karl wettin <karl.wettin@gmail.com> wrote:
>
>
> 28 mar 2007 kl. 15.24 skrev Sengly Heng:
>
> > Thank you but I still have have no clue of how to do that by using
> > Weka
> > after taking a look at its API. Let me reformulate my problem :
> >
> > I have a collection of vector of terms (actually each vector of terms
> > represents the list of tokens extracted from a file) and I do not
> > have the
> > original files. I would like to calculate TF as well as TFIDF of
> > each term
> > and sorted them by these value respectively. As suggested by Grant
> > Ingersoll, I could index those vectors of terms again using Lucene
> > and then
> > use its API to measure TF and TFIDF. However I guess there should be a
> > simpler way or API just fit-in this case.
>
> To my knowledge there is no thing in Lucene that makes it simpler for
> you than what Grant suggests. And according to me, Weka really must
> be the simplest way around. However, perhaps you should supply us
> with an example of what these vectors look like. That might change
> everything. Perhaps we are talking of completely different things here.
>
> Let me reformulate my suggestion:
>
> 1. rebuild your vector to a string.
> 2. put the data in a file called myvectors.arff:
>
> @relation termvectors
> @attribute the_vector string
> @data
> "first term vector as a string"
> "second term vector as a string"
>
> 3. open the file in the weka explorer application.
> 4. select filter/unsupervised/attribute/string to word vector
> 5. set your preferences of normalization, et c.
> 6. apply the filter.
> 7. save the output.
>
> All this can be done progamatically too, with only a few lines of code.
>
> >
> > Thanks once again everyone.
> >
> > Best regards,
> >
> > Sengly
> >
> >
> > On 3/28/07, karl wettin <karl.wettin@gmail.com> wrote:
> >>
> >>
> >> 28 mar 2007 kl. 10.36 skrev Sengly Heng:
> >>
> >> > Does anyone of you know any Java API that directly handle this
> >> > problem?
> >> > or I have to implement from scratch.
> >>
> >> You can also try
> >> weka.filters.unsupervised.attribute.StringToWordVector, it has many
> >> neat features you might be interested in. And if applicable to what
> >> you attempt to do, the feature selection algorithms of the same
> >> project (Weka) does a great job reducing the data set.
> >>
> >> http://www.cs.waikato.ac.nz/ml/weka/
> >>
> >> It is GPL.
> >>
> >> --
> >> karl
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message