lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giovanni Gherdovich <g.gherdov...@gmail.com>
Subject Re: indexing unstructured text (tweets)
Date Mon, 28 May 2012 14:35:21 GMT
Hello Jack and Anuj,

2012/5/28 Jack Krupansky <jack@basetechnology.com>:
> The Twitter API extracts hash tag and user mentions for you, in addition to
> giving you the full raw text. You'll have to read up on the Twitter API.

That's what I thought just after hittind "send" on the message above ;-)
I am pretty sure the Twitter API format maps very nicely to a suitable
input format for Solr, if not even being already good for direct
feeding into Solr.

I am a bit unlucky here because I have been provided with
only the raw text for about 1.5 million tweets; so I would have
to write a few lines of code to restore at least user mentions,
hashtags and URLs.


2012/5/28 Anuj Kumar <anujsays@gmail.com>:
> This is a bit old but provides good information for schema design-
> http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php
>
> Found this link as well- https://gist.github.com/702360
>
> The types of the field may depend on the search requirements.

Anuj you provide very interesting links here, thanks,
even tho those kind of specifics might be already present
in the twitter API doc.
After I'll be done with my first Solr setup, I might
setup the whole pipeline (getting the Twitter feeds myself)
on my machines, so that I can exploit the whole
information content provided by Twitter.

Cheers,
Giovanni

Mime
View raw message