lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian chen" <chenjian1...@gmail.com>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Tue, 02 May 2006 04:15:52 GMT
Hi, Chuck,

Using standard UTF-8 is very important for Lucene index so any program could
read the Lucene index easily, be it written in perl, c/c++ or any new future
programming languages.

It is like storing data in a database for web application. You want to store
it in such a way that other programs can manipulate easily other than only
the web app program. Because there will be cases that you want to mass
update or mass change the data, and you don't want to write only web apps
for doing it, right?

Cheers,

Jian


On 5/1/06, Chuck Williams <chuck@manawiz.com> wrote:
>
> Could someone summarize succinctly why it is considered a major issue
> that Lucene uses the Java modified UTF-8 encoding within its index
> rather than the standard UTF-8 encoding.  Is the only concern
> compatibility with index formats in other Lucene variants?  The API to
> the values is a String, which uses Java's char representation, so I'm
> confused why the encoding in the index is so important.
>
> One possible benefit of a standard UTF-8 index encoding would be
> streaming content into and out of the index with no copying or
> conversions.  This relates to the lazy field loading mechanism.
>
> Thanks for any clarification,
>
> Chuck
>
>
> jian chen wrote on 05/01/2006 04:24 PM:
> > Hi, Marvin,
> >
> > Thanks for your quick response. I am in the camp of fearless
> refactoring,
> > even at the expense of breaking compatibility with previous releases.
> ;-)
> >
> > Compatibility aside, I am trying to identify if changing the
> > implementation
> > of Term is the right way to go for this problem.
> >
> > If it is, I think it would be worthwhile rather than putting band-aid
> > on the
> > existing API.
> >
> > Cheers,
> >
> > Jian
> >
> > Changing the implementation of Term
> >> would have a very broad impact; I'd look for other ways to go about
> >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
> >> doesn't use the same technique for merging.
> >>
> >> My plan was to first submit a patch that made the change to the file
> >> format but didn't touch SegmentMerger, then attack SegmentMerger and
> >> also see if other developers could suggest optimizations.
> >>
> >> However, I have an awful lot on my plate right now, and I basically
> >> get paid to do KinoSearch-related work, but not Lucene-related work.
> >> It's hard for me to break out the time to do the java coding,
> >> especially since I don't have that much experience with java and I'm
> >> slow.  I'm not sure how soon I'll be able to get back to those
> >> bytecount patches.
> >>
> >> Marvin Humphrey
> >> Rectangular Research
> >> http://www.rectangular.com/
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message