Friday, September 19, 2014

Git Merge branch of a branch

Master to branch and then branch of that branch merging ....

  614  git rebase --continue
  615  git status 
  616  git log 
  617  git status 
  618  git push --force origin ns-cd-improvements-idf
  619  git checkout ns-cd-improvements
  620  git merge ns-cd-improvements-idf
  621  git log
  622  git status
  623  git branch
  624  git checkout master; git pull -r; git checkout ns-cd-improvements; git rebase master
  625  git push origin ns-cd-improvements -f
  626  git checkout master;
  627  git pull -r
  628  git merge ns-cd-improvements
  629  git push origin master

  630  git pull --rebase origin master 

Wednesday, September 17, 2014

Simple Web Server to Serve content of directory

As simple as :

python -m SimpleHTTPServer 10808

and it will serve content from your current dir. Perfect for non prod stuff !!!

Tuesday, September 9, 2014

Deep Learning

Great Overview of Caffe

Hive JAR version missmatch / dependency collision (eg. guava 11.0.2 vs 18.0)

If you hit the issue where for example you use guava 18.0 and deploy your udf's utilizing them but then at execution time (not run on local hive but in production eg. via oozie) you get utils using guava fail on missing the interface (eg. hash.hashString(str) in11.0.1 vs hash.hashUnencodedChars(str) in 18.0). This happens because probably map-reduce native dependency depends on older version of given jar, and during the class loading time the mr native libs get registered first. You can fix this buy giving precedence to user defined libs, just add param below to the conf (hive-site.xml):


Thursday, August 28, 2014

Lage scale topic detection on social networks

(download here)

Authors:Nemanja SpasojevicKlout, Inc., San Francisco, CA, USA
Jinyun YanKlout, Inc., San Francisco, CA, USA
Adithya RaoKlout, Inc., San Francisco, CA, USA
Prantik BhattacharyyaKlout, Inc., San Francisco, CA, USA

Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.

Monday, August 25, 2014

Latex to MS Word Doc Shenanigans

# mathplayer

rm -r /tmp/topic_paper_mathplayer/
mkdir -p /tmp/topic_paper_mathplayer/
htlatex indg0731-spasojevic "xhtml,mathplayer" "" -d/tmp/topic_paper_mathplayer/

#Open office

rm -r /tmp/topic_paper_open_office/
mkdir -p /tmp/topic_paper_open_office/
htlatex indg0731-spasojevic "xhtml,ooffice" "oofice/! -cmozhtf"  -d/tmp/topic_paper_open_office/  -coo -cvalidate

htlatex indg0731-spasojevic "xhtml,ooffice,bib-,mathml-" " -cmozhtf" "-coo" 

Wednesday, June 4, 2014

Get MongoDb Master commandline

MONGO_HOST_PORT_MASTER=$(ssh remote@host  "mongo ${MONGO_HOST}:${MONGO_PORT} --eval 'printjson(rs.isMaster())' "| grep primary | cut -d"\"" -f4)

MongoDb <-> Hive Hook

create external table if not exists  mongo_bla_display_order(
  mongo_id string,
  user_id string,
  display_order string
stored by 'org.yong3.hive.mongo.MongoStorageHandler'
with serdeproperties( "mongo.column.mapping" = "_id,uid,displayOrder" )
tblproperties ( "" = "${mongoHost}" , "mongo.port" = "${mongoPort}" ,
     "mongo.db" = "my_db" , "mongo.collection" = "persistence" );

Monday, June 2, 2014

Uncompress *.xz file

# pure xz file
unxz <filename>.xz

# tar xz file
tar -Jxf <filename>.tar.xz

Saturday, April 12, 2014

Play setup:

IntelliJ Idea and Play ambiguous index definitions, and such issues:

This was fix: (Play 2.2 and IntelliJ 12 )!topic/play-framework/X78Ikg9PMyE



I have a problem in IntelliJ when I create a new Java Play application, generate the IDE configuration and open the project.

I see the following error in IntelliJ - "Reference to 'index' is ambiguous, both 'views.html.index$' and 'views.html.index' match"

This only occurs in the following scenarios:

Enable: Play 2.0 Support plugin, Scala plugin, and built-in Playframework Support plugin that comes with IntelliJ Ultimate
Enable: Play 2.0 Support plugin and Scala plugin and Disable: the Playframework Support plugin that comes with IntelliJ Ultimate

There are no issues when I:

Enable: Scala plugin, and Playframework Support plugin that comes with IntelliJ Ultimate and Disable: Play 2.0 Support plugin

If I change the import statement:

import views.html.*; to be: import views.html.index; all of the above configurations work.

Would someone be able to explain why this issue is occurring? I'm happy to submit a PR with the above change if this is a reasonable fix. Before I figured out how to resolve it I did some searching and there are definitely a number of other people experiencing this issue without being able to find a solution, for example:



Tuesday, September 3, 2013

Start HBase rest (stargate)

ssh <user>@<hbase-remote-hoast>-nn1
> hbase rest start -p 7000 &
> disown
> exit

Tuesday, August 13, 2013

Crawl / Curl from Hive

package com.blout.thunder.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.log4j.Logger;


 * Author: Nemanja Spasojevic
    name = "curl",
    value =  " Given url returns content of given web page, if curl failed fow whatever" +
        " reason it returns null. " +
        "string _FUNC_(string) \n"
public class CurlUDF extends UDF {
  private static final Logger LOG = Logger.getLogger(CurlUDF.class);
  private ListObjectInspector listInspector;

  private static int DEFAULT_SLEEP_TIME_MS = 1000;
  private static int DEFAULT_RE_TRIES      = 3;
  private static int LOG_STACK_FIRST_TIMES = 100;
  private static int counter_ = 0;

  public String evaluate(String webPageURL) throws Exception {
    return fetch(webPageURL, DEFAULT_SLEEP_TIME_MS);

  public String evaluate(String webPageURL, int sleepTimeMS) {
    return fetch(webPageURL, sleepTimeMS);

  public String fetch(String webPageURL, int sleepTimeMS) {

    for (int i = 1; i <= DEFAULT_RE_TRIES; ++i) {
      try {
        StringBuffer output = new StringBuffer();
        URL url = new URL(webPageURL);
        System.out.println(counter_ + ") Fetching try [" + i + "]: " + webPageURL);
        InputStream response = url.openStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(response));
        for (String line; (line = reader.readLine()) != null;) {
        System.out.println(counter_ + ") Fetching try [" + i + "]: success");
        return output.toString();
      } catch (Exception e) {
        if (LOG_STACK_FIRST_TIMES > counter_) {
        try { Thread.sleep(sleepTimeMS * i * i ); } catch (Exception et) {};
        return null;
    return null;

CREATE TEMPORARY FUNCTION curl AS 'com.blout.thunder.hive.udf.CurlUDF';

# Get the gradient colors for (hgreen to the red). Alwayes needed so here is example in JS:

getColor :  function(value, maxValue) {
    var h = 120 + 240 - Math.max(0, Math.min(240, 240 * value / maxValue));
    return 'hsl('+ h + ',100%,90%)'

Tuesday, May 21, 2013

Friday, May 10, 2013

Hive: Force UDF execution to happen on reducer side

Doing quick and dirty URL fetch from hive, I wanted for URL"s to be ditributed among 5 jobs. Input is small it's very hard to tune up on mapper side things to heppen on 5 mappers say.


insert overwrite table url_raw_contant partition(dt = 20130606)
select full_url,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from url_queue_table_sharded_temp;

Forced  UDF execution to Reducer (5 reducers):

set mapred.reduce.tasks=5;
insert overwrite table url_raw_contant_table partition(dt = 20130606)
select full_url,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from (
    select full_url, priority
    from url_queue_table_sharded_temp
    distribute by md5(full_url) % 5
    sort by md5(full_url) % 5, priority desc
) d
distribute by md5(full_url) % 5;

Thursday, March 21, 2013

Hive Make mapper re-use JVM

usefull if your UDF has some kind of static initialization (eg. from distributed cache), nd you want given initialized object to be reused acros multiple map tasks.

SET mapred.job.reuse.jvm.num.tasks=100;

Monday, March 18, 2013

Port Forwarding on VM (VMWare Fusion)

Use Case:
Usually you spin your VM on local box but sometimes need to share server link with other folks in the office. By forwarding VM port to your local port you can easily share access your machine and have it forward requests to the VM.

1) Edit 
 /Library/Preferences/VMware Fusion/vmnet8/nat.conf

# Use these with care - anyone can enter into your VM through these...
# The format and example are as follows:
#<external port number> = <VM's IP address>:<VM's port number>
#8080 =
8081 =

2) Apply 
For change to become effective shut down VM then extit VMWare Fusion application, and start your application and VM

Thursday, March 14, 2013

Thursday, February 21, 2013

Maven Java Exec

mvn exec:java -Dexec.mainClass="com.klout.thunder.hive.udf.ExtractDictionaryKeyValuesUDF"  2>&1 |  grep BENCH_DICT