lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khanzode <sandeep_khanz...@yahoo.com.INVALID>
Subject Re: Facets in Lucene 4.7.2
Date Mon, 16 Jun 2014 13:57:14 GMT
Correction on [4] below. I do get doc/pos/tim/tip/dvd/dvm files in either ase. What I meant
was the number of those files appear different in both cases. Also, does commit() stop the
world and behave serially to flush the contents?
 
-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode


On Monday, June 16, 2014 7:10 PM, Sandeep Khanzode <sandeep_khanzode@yahoo.com.INVALID>
wrote:
 


Hi Shai,

Thanks for the response. Appreciated! I understand that this particular use case has to be
handled in a different way.

Can you please help me with the below questions? 

1.] Is there any API that gives me the count of a specific dimension from FacetCollector in
response to a search query. Currently, I use the getTopChildren() with some value and then
check the FacetResult object for the actual number of dimensions hit along with their occurrences.
Also, the getSpecificValue() does not work without a path attribute to the API.

2.] Can I find the MAX or MIN value of a Numeric type field written to the index?

3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I could determine
that ES does search time faceting and dynamically returns the response without any prior faceting
during indexing time. Is index time lag is not my concern, can I assume that, in general,
performance-wise Lucene facets would be faster?

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not use IndexWriter.commit(),
I get standard files like cfe/cfs/si in the index directory. However, if I do use the commit(),
then as I understand it, the state is persisted to the disk. But this time, there are additional
file extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this difference and
its cause. 

5.] Does the RAMBufferSizeMB() control the commit intervals, so that when the limit is reached across
all writing threads, the contents are flushed to disk periodically?

Appreciate your response to the above queries. Thanks again,

 
-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode



On Sunday, June 15, 2014 10:40 AM, Shai Erera <serera@gmail.com> wrote:



Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai



On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode <
sandeep_khanzode@yahoo.com.invalid> wrote:

> Hi Shai,
>
> Thanks so much for the clear explanation.
>
> I agree on the first question. Taxonomy Writer with a separate index would
> probably be my approach too.
>
> For the second question:
> I am a little new to the Facets API so I will try to figure out the
> approach that you outlined below.
>
> However, the scenario is such: Assume a document corpus that is indexed.
> For a user query, a document is returned and selected by the user for
> editing as part of some use case/workflow. That document is now marked as
> either historically interesting or not, financially relevant, specific to
> media or entertainment domain, etc. by the user. So, essentially the user
> is flagging the document with certain markers.
> Another set of users could possibly want to query on these markers. So,
> lets say, a second user comes along, and wants to see the top documents
> belonging to one category, say, agriculture or farming. Since these markers
> are run time activities, how can I use the facets on them? So, I was
> envisioning facets as the various markers. But, if I constantly re-index or
> update the documents whenever a marker changes, I believe it would not be
> very efficient.
>
> Is there anything, facets or otherwise, in Lucene that can help me solve
> this use case?
>
> Please let me know. And, thanks!
>
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode
>
>
> On Friday, June 13, 2014 9:51 PM, Shai Erera <serera@gmail.com> wrote:
>
>
>
> Hi
>
> You can check the demo code here:
>
> https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
> .
> This code is updated with each release, so you always get a working code
> examples, even when the API changes.
>
> If you don't mind managing the sidecar index, which I agree isn't such a
> big deal, then yes - the taxonomy index currently performs the fastest. I
> plan to explore porting the taxonomy-based approach from BinaryDocValues to
> the new SortedNumericDocValues (coming out in 4.9) since it might perform
> even faster.
>
> I didn't quite get the marker/flag facet. Can you give an example? For
> instance, if you can model that as a NumericDocValuesField added to
> documents (w/ the different markers/flags translated to numbers), then you
> can use Lucene's updatable numeric DocValues and write a custom Facets to
> aggregate on that NumericDocValues field.
>
> Shai
>
>
>
> On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode <
> sandeep_khanzode@yahoo.com.invalid> wrote:
>
> > Hi,
> >
> > I am evaluating Lucene Facets for a project. Since there is a lot of
> > change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
> > me know if there are other sources of information.
> >
> > I have a couple of questions:
> >
> > 1.] All categories in my application are flat, not hierarchical. But, it
> > seems from a few sources, that even that notwithstanding, you would want
> to
> > use a Taxonomy based index for performance reasons. It is faster but uses
> > more RAM. Or is the deterrent to use it is the fact that it is a separate
> > data structure. If one could do with the life-cycle management of the
> extra
> > index, should we go ahead with the taxonomy index for better performance
> > across tens of millions of documents?
> >
> > Another note to add is that I do not see a scenario wherein I would want
> > to re-index my collection over and over again or, in other words, the
> > changes would be spread over time.
> >
> > 2.] I need a type of dynamic facet that allows me to add a flag or marker
> > to the document at runtime since it will change/update every time a user
> > modifies or adds to the list of markers. Is this possible to do with the
> > current implementation? Since I believe, that currently all faceting is
> > done at indexing time.
> >
> >
> > -----------------------
> > Thanks n Regards,
> > Sandeep Ramesh Khanzode
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message