Thursday, August 28, 2014

Lage scale topic detection on social networks

Authors:Nemanja SpasojevicKlout, Inc., San Francisco, CA, USA
Jinyun YanKlout, Inc., San Francisco, CA, USA
Adithya RaoKlout, Inc., San Francisco, CA, USA
Prantik BhattacharyyaKlout, Inc., San Francisco, CA, USA

Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.

Monday, August 25, 2014

Latex to MS Word Doc Shenanigans

# mathplayer

rm -r /tmp/topic_paper_mathplayer/
mkdir -p /tmp/topic_paper_mathplayer/
htlatex indg0731-spasojevic "xhtml,mathplayer" "" -d/tmp/topic_paper_mathplayer/

#Open office

rm -r /tmp/topic_paper_open_office/
mkdir -p /tmp/topic_paper_open_office/
htlatex indg0731-spasojevic "xhtml,ooffice" "oofice/! -cmozhtf"  -d/tmp/topic_paper_open_office/  -coo -cvalidate

htlatex indg0731-spasojevic "xhtml,ooffice,bib-,mathml-" " -cmozhtf" "-coo" 

Wednesday, June 4, 2014

Get MongoDb Master commandline

MONGO_HOST_PORT_MASTER=$(ssh remote@host  "mongo ${MONGO_HOST}:${MONGO_PORT} --eval 'printjson(rs.isMaster())' "| grep primary | cut -d"\"" -f4)

MongoDb <-> Hive Hook

create external table if not exists  mongo_bla_display_order(
  mongo_id string,
  user_id string,
  display_order string
stored by 'org.yong3.hive.mongo.MongoStorageHandler'
with serdeproperties( "mongo.column.mapping" = "_id,uid,displayOrder" )
tblproperties ( "" = "${mongoHost}" , "mongo.port" = "${mongoPort}" ,
     "mongo.db" = "my_db" , "mongo.collection" = "persistence" );

Monday, June 2, 2014

Uncompress *.xz file

# pure xz file
unxz <filename>.xz

# tar xz file
tar -Jxf <filename>.tar.xz

Saturday, April 12, 2014

Play setup:

IntelliJ Idea and Play ambiguous index definitions, and such issues:

This was fix: (Play 2.2 and IntelliJ 12 )!topic/play-framework/X78Ikg9PMyE



I have a problem in IntelliJ when I create a new Java Play application, generate the IDE configuration and open the project.

I see the following error in IntelliJ - "Reference to 'index' is ambiguous, both 'views.html.index$' and 'views.html.index' match"

This only occurs in the following scenarios:

Enable: Play 2.0 Support plugin, Scala plugin, and built-in Playframework Support plugin that comes with IntelliJ Ultimate
Enable: Play 2.0 Support plugin and Scala plugin and Disable: the Playframework Support plugin that comes with IntelliJ Ultimate

There are no issues when I:

Enable: Scala plugin, and Playframework Support plugin that comes with IntelliJ Ultimate and Disable: Play 2.0 Support plugin

If I change the import statement:

import views.html.*; to be: import views.html.index; all of the above configurations work.

Would someone be able to explain why this issue is occurring? I'm happy to submit a PR with the above change if this is a reasonable fix. Before I figured out how to resolve it I did some searching and there are definitely a number of other people experiencing this issue without being able to find a solution, for example:



Tuesday, September 3, 2013

Start HBase rest (stargate)

ssh <user>@<hbase-remote-hoast>-nn1
> hbase rest start -p 7000 &
> disown
> exit

Tuesday, August 13, 2013

Crawl / Curl from Hive

package com.blout.thunder.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.log4j.Logger;


 * Author: Nemanja Spasojevic
    name = "curl",
    value =  " Given url returns content of given web page, if curl failed fow whatever" +
        " reason it returns null. " +
        "string _FUNC_(string) \n"
public class CurlUDF extends UDF {
  private static final Logger LOG = Logger.getLogger(CurlUDF.class);
  private ListObjectInspector listInspector;

  private static int DEFAULT_SLEEP_TIME_MS = 1000;
  private static int DEFAULT_RE_TRIES      = 3;
  private static int LOG_STACK_FIRST_TIMES = 100;
  private static int counter_ = 0;

  public String evaluate(String webPageURL) throws Exception {
    return fetch(webPageURL, DEFAULT_SLEEP_TIME_MS);

  public String evaluate(String webPageURL, int sleepTimeMS) {
    return fetch(webPageURL, sleepTimeMS);

  public String fetch(String webPageURL, int sleepTimeMS) {

    for (int i = 1; i <= DEFAULT_RE_TRIES; ++i) {
      try {
        StringBuffer output = new StringBuffer();
        URL url = new URL(webPageURL);
        System.out.println(counter_ + ") Fetching try [" + i + "]: " + webPageURL);
        InputStream response = url.openStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(response));
        for (String line; (line = reader.readLine()) != null;) {
        System.out.println(counter_ + ") Fetching try [" + i + "]: success");
        return output.toString();
      } catch (Exception e) {
        if (LOG_STACK_FIRST_TIMES > counter_) {
        try { Thread.sleep(sleepTimeMS * i * i ); } catch (Exception et) {};
        return null;
    return null;

CREATE TEMPORARY FUNCTION curl AS 'com.blout.thunder.hive.udf.CurlUDF';

# Get the gradient colors for (hgreen to the red). Alwayes needed so here is example in JS:

getColor :  function(value, maxValue) {
    var h = 120 + 240 - Math.max(0, Math.min(240, 240 * value / maxValue));
    return 'hsl('+ h + ',100%,90%)'

Tuesday, May 21, 2013

Friday, May 10, 2013

Hive: Force UDF execution to happen on reducer side

Doing quick and dirty URL fetch from hive, I wanted for URL"s to be ditributed among 5 jobs. Input is small it's very hard to tune up on mapper side things to heppen on 5 mappers say.


insert overwrite table url_raw_contant partition(dt = 20130606)
select full_url,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from url_queue_table_sharded_temp;

Forced  UDF execution to Reducer (5 reducers):

set mapred.reduce.tasks=5;
insert overwrite table url_raw_contant_table partition(dt = 20130606)
select full_url,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from (
    select full_url, priority
    from url_queue_table_sharded_temp
    distribute by md5(full_url) % 5
    sort by md5(full_url) % 5, priority desc
) d
distribute by md5(full_url) % 5;

Thursday, March 21, 2013

Hive Make mapper re-use JVM

usefull if your UDF has some kind of static initialization (eg. from distributed cache), nd you want given initialized object to be reused acros multiple map tasks.

SET mapred.job.reuse.jvm.num.tasks=100;

Monday, March 18, 2013

Port Forwarding on VM (VMWare Fusion)

Use Case:
Usually you spin your VM on local box but sometimes need to share server link with other folks in the office. By forwarding VM port to your local port you can easily share access your machine and have it forward requests to the VM.

1) Edit 
 /Library/Preferences/VMware Fusion/vmnet8/nat.conf

# Use these with care - anyone can enter into your VM through these...
# The format and example are as follows:
#<external port number> = <VM's IP address>:<VM's port number>
#8080 =
8081 =

2) Apply 
For change to become effective shut down VM then extit VMWare Fusion application, and start your application and VM

Thursday, March 14, 2013

Thursday, February 21, 2013

Maven Java Exec

mvn exec:java -Dexec.mainClass="com.klout.thunder.hive.udf.ExtractDictionaryKeyValuesUDF"  2>&1 |  grep BENCH_DICT

Wednesday, November 7, 2012

Freebase autosuggest

Great jQuery plugin for the auto suggest:

# Schema explorer

Saturday, October 13, 2012

Small cross domain handler for the Scala Play

I rarely use scala and play, but occasionally I need to add some dashboard functionality. In this case I wanted to use HBase REST API to get cells from HBase which were encoded as JSON. You could read HBase from Scala client but using restfull is simpler however requires crossdomain calls which is why I had some kind of crossdomain proxy. It may be usefull for you if you use Play/Scala framework. It works but but, use it at your risk.


object CrossDomainExample extends Controller {

  def crossDomain = Action(parse.json) {
    request =>
      request.body match {
        case JsObject(fields) =>
          val jsonMap = fields.toMap
          val url: JsValue = jsonMap("url")
          val acceptOpt: Option[JsValue] = jsonMap.get("accept")
          val acceptValue = acceptOpt match {
            case Some(header) =>[String]
            case _ => "application/json"
          val request : String =[String]
            for (response <- WS.url(request).withHeaders("Accept" -> acceptValue).get()) yield {
              println("Sending : " + response.body)
        case _ => Ok("received something else: " + request.body + '\n')

// Routes
POST    /crossdomain                  controllers.CrossDomainExample .crossDomain

# Test from Bash 
 curl  --header "Content-type: application/json"  --request POST  --data '{"url": "", "accept" : "application/json"}'  http://localhost:9000/crossdomain -v

Tuesday, October 9, 2012

Proto buffer and json S(D)erialization


Message.Builder builder = SomeProto.newBuilder();
String jsonFormat = _load json document from a source_;
JsonFormat.merge(jsonFormat, builder);


Message someProto = SomeProto.getDefaultInstance();
String jsonFormat = JsonFormat.printToString(someProto)

Tuesday, September 25, 2012


UPDATE table1
JOIN table2
ON table1.sourceId = table2.sourceId
SET table1.sourceInfo = table2.sourceInfo;