lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Date Sun, 07 Apr 2013 01:43:30 GMT
Solr would not be storing the original source form of the documents in any 
case. Whether you use Tika or SolrCell, only the text stream of the content 
and the metadata would ever get indexed or stored in Solr.

Solr completely decouples "indexing" and "storing" of data values. If you 
don't want to "store" the text stream in Solr, then don't.

If you want to "store" the original blob of the source documents in some 
other data store, that's your choice. You can store the original URL or a 
document ID or URL for some alternate document store. That's your choice to 
make. Solr in no way forces you one way or the other. And whether that URL 
or document ID refers to HBase or a web site, doesn't matter to Solr either.

Whether or not you could more efficiently store the original document bytes 
in Lucene/Solr DocValues vs. HBase is a separate matter - I don't know one 
way or the other whether DocValues help or not. Or whether a Solr 
BinaryField might be suitable for store the original bytes of a document 
(but without indexing the bytes.)

In other words, maybe you could just use two separate Solr servers, one for 
text index and metadata store, and the other for raw store of the original 
document bytes.

-- Jack Krupansky

-----Original Message----- 
From: Furkan KAMACI
Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Pointing to Hbase for Docuements or Directly Saving Documents at 
Hbase

Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)? 


Mime
View raw message