lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: term vectors
Date Wed, 15 Nov 2006 16:32:49 GMT
Why do you think you need term frequencies in the first place? What is it
that you're trying to do that just searching wouldn't accomplish?

I've often jumped into the middle of something and made it waaaaay too
complex, so I'm asking to see if you're doing something similar <G>.

Lucene has no requirements that each document have identical fields. So you
could simply index a varying number of fields with each document,
corresponding to the attributes of your files. After a search, you could
determine what you needed to by which docs had which attributes (fields). It
seems to me that if you form your query appropriately, the search results
*are* the results you want and this kind of analysis wouldn't be necessary.
Since your search would only return documents that had the fields you
specify and the values you want in those fields (attributes).

But that may just reflect that I don't understand the problem you're trying
to solve very well <G>...

Best
Erick

On 11/15/06, Phil Rosen <prosen@optaros.com> wrote:
>
> Hello,
>
>
>
> Thanks in advance for your help, I am really stumped I feel.
>
>
>
> I am building an application that requires I index a set of documents on
> the scale of hundreds of thousands.
>
>
>
> A document can have a varying number of attribute fields with an unknown
> set of potential values. I realize that just indexing a blob of fields
> would be much faster, however I need to bin the search results based on
> common attributes; as different types of attributes could potentially have
> overlapping values a single blob for all attributes wont work.
>
>
>
> My question is this, is there a way to get term frequencies for a set of
> documents or hits, without using getTermFreqVector() on each document and
> each attribute field? As I could have hundreds of results, each with
> dozens of attribute fields, looping getTermFreqVector() would be very
> slow. If there isn't something inherent to lucene, has anyone seen an
> extension that could accomplish this?
>
>
>
> Thanks!
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message