Friday, May 10, 2013

Hive: Force UDF execution to happen on reducer side

Doing quick and dirty URL fetch from hive, I wanted for URL"s to be ditributed among 5 jobs. Input is small it's very hard to tune up on mapper side things to heppen on 5 mappers say.


Regular:


insert overwrite table url_raw_contant partition(dt = 20130606)
select full_url,
       priority,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from url_queue_table_sharded_temp;



Forced  UDF execution to Reducer (5 reducers):



set mapred.reduce.tasks=5;
insert overwrite table url_raw_contant_table partition(dt = 20130606)
select full_url,
       priority,
       regexp_replace(curl_url(full_url),  '\n|\r', ' ') as raw_html
from (
    select full_url, priority
    from url_queue_table_sharded_temp
    distribute by md5(full_url) % 5
    sort by md5(full_url) % 5, priority desc
) d
distribute by md5(full_url) % 5;

Thursday, March 21, 2013

Hive Make mapper re-use JVM

usefull if your UDF has some kind of static initialization (eg. from distributed cache), nd you want given initialized object to be reused acros multiple map tasks.

SET mapred.job.reuse.jvm.num.tasks=100;


Monday, March 18, 2013

Port Forwarding on VM (VMWare Fusion)


Use Case:
Usually you spin your VM on local box but sometimes need to share server link with other folks in the office. By forwarding VM port to your local port you can easily share access your machine and have it forward requests to the VM.


1) Edit 
 /Library/Preferences/VMware Fusion/vmnet8/nat.conf

....
# Use these with care - anyone can enter into your VM through these...
# The format and example are as follows:
#<external port number> = <VM's IP address>:<VM's port number>
#8080 = 172.16.3.128:80
8081 = 192.168.242.128:8081
...

2) Apply 
For change to become effective shut down VM then extit VMWare Fusion application, and start your application and VM

Thursday, March 14, 2013

Thursday, February 21, 2013

Maven Java Exec

mvn exec:java -Dexec.mainClass="com.klout.thunder.hive.udf.ExtractDictionaryKeyValuesUDF"  2>&1 |  grep BENCH_DICT

Wednesday, November 7, 2012

Freebase autosuggest

Great jQuery plugin for the auto suggest:

http://wiki.freebase.com/wiki/Freebase_Suggest

# Schema explorer

http://schemas.freebaseapps.com/type?id=/type/property

Saturday, October 13, 2012

Small cross domain handler for the Scala Play

I rarely use scala and play, but occasionally I need to add some dashboard functionality. In this case I wanted to use HBase REST API to get cells from HBase which were encoded as JSON. You could read HBase from Scala client but using restfull is simpler however requires crossdomain calls which is why I had some kind of crossdomain proxy. It may be usefull for you if you use Play/Scala framework. It works but but, use it at your risk.

CrossDomainExample.scala

object CrossDomainExample extends Controller {

  def crossDomain = Action(parse.json) {
    request =>
      request.body match {
        case JsObject(fields) =>
          val jsonMap = fields.toMap
          val url: JsValue = jsonMap("url")
          val acceptOpt: Option[JsValue] = jsonMap.get("accept")
          val acceptValue = acceptOpt match {
            case Some(header) => header.as[String]
            case _ => "application/json"
          }
          val request : String = url.as[String]
          Async{
            for (response <- WS.url(request).withHeaders("Accept" -> acceptValue).get()) yield {
              println("Sending : " + response.body)
              //response
              Ok(response.body).as(acceptValue)
            }
          }
        case _ => Ok("received something else: " + request.body + '\n')
      }
  }
}

// Routes
POST    /crossdomain                  controllers.CrossDomainExample .crossDomain

# Test from Bash 
 curl  --header "Content-type: application/json"  --request POST  --data '{"url": "http://sample-url-that-serves-json.com/getJson?id=123456789", "accept" : "application/json"}'  http://localhost:9000/crossdomain -v


Tuesday, October 9, 2012

Proto buffer and json S(D)erialization


http://code.google.com/p/protobuf-java-format/

From:

Message.Builder builder = SomeProto.newBuilder();
String jsonFormat = _load json document from a source_;
JsonFormat.merge(jsonFormat, builder);

To:

Message someProto = SomeProto.getDefaultInstance();
String jsonFormat = JsonFormat.printToString(someProto)

Tuesday, September 25, 2012

UPDATE / SET on JOIN (MySQL)


UPDATE table1
JOIN table2
ON table1.sourceId = table2.sourceId
SET table1.sourceInfo = table2.sourceInfo;

Thursday, September 13, 2012

\001 and sed madness


# When you need to replace your separator or whatever using utf8 coded char like \001
cat   ~/Desktop/topic.csv  | sed -e "s/_ESCAPE_TAG_/$(echo -e \\001)/g" &> /tmp/t.table

Wednesday, August 15, 2012

Tuesday, July 24, 2012

Reading binary key data using hbase shell


// Whaever bytes you have you can spit as hex string in java or any other lang using something like 
public static String getHexString(byte[] b) throws Exception {
    String result = "";
    for (int i=0; i < b.length; i++) {
      result += "\\x" +  Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
    }
    return result;
  }

# On the shell you (note for the key you need to use double quotes " for hbase shell to interpret it as binary key )
hbase shell 
hbase> get 'stream_table', "\x00\x00\x00\x00\x00\x00\x00\x2c\x7b\x2e\x04\x00\x03"

Thursday, July 12, 2012

Identity reducer in hadoop

org.apache.hadoop.mapreduce.Reducer
is identity reducer so you better fo'get 'bout org.apache.hadoop.mapreduce.IdentityReducer 
(same is true for identity mapper )

Wednesday, April 4, 2012

Recover dropped stash in git

# Today I was just cleaning mu stash and dropped the CL I cared about
...

git stash list

stash@{3}: WIP on liProfile: 14ee7fd  Updated something to something bla.
...

git stash drop stash@{3}
Dropped stash@{3} (1a3c7ff9cbf09317fe7261d2a779225e7344d37c)
....


# 2 min later I figured out I needed the stash back, but luckily reference to it it's still there (the hast next to the command)
git diff  1a3c7ff9cbf09317fe7261d2a779225e7344d37c &> /tmp/jesus


So althought stash was dropped the change was still there so no big deal.

Sunday, March 11, 2012

intelliJ increase memory (bump up xmx xms via VMOptions)

# Find the plist of intelliJ on your machine
find /Applications/Intell* | grep Info.plist
/Applications/IntelliJ IDEA 11.app/Contents/Info.plist


# Then edit the mem flags under VMOptions as you wish.
# In my file I had couple VMOptions.* options for different platforms, I simply edited them all
# not to think too much. Below is my diff:


Nemanja-Spasjevics-MacBook-Pro:generic_scoring nemanjaspasjevic$ vi /Applications/IntelliJ\ IDEA\ 11.app/Contents/Info.plist
Nemanja-Spasjevics-MacBook-Pro:generic_scoring nemanjaspasjevic$ diff /Applications/IntelliJ\ IDEA\ 11.app/Contents/Info.plist /tmp/Info.plist
153c153
<       <string>-Xms256m -Xmx1512m -XX:MaxPermSize=1256m  -ea -Xverify:none -Xbootclasspath/a:../lib/boot.jar</string>
---
>       <string>-ea -Xverify:none -Xbootclasspath/a:../lib/boot.jar</string>
156c156
<       <string>-Xms256m -Xmx1512m -XX:MaxPermSize=1256m -XX:ReservedCodeCacheSize=254m</string>
---
>       <string>-Xms128m -Xmx512m -XX:MaxPermSize=950m -XX:ReservedCodeCacheSize=254m</string>
159c159
<       <string>-Xms256m -Xmx1512m -XX:MaxPermSize=1256m -XX:ReservedCodeCacheSize=254m -XX:+UseCompressedOops</string>
---
>       <string>-Xms128m -Xmx800m -XX:MaxPermSize=950m -XX:ReservedCodeCacheSize=254m -XX:+UseCompressedOops</string>


# Just to re iterate relevant part to be edited is:
...
      <key>VMOptions</key>
      <string>-ea -Xverify:none -Xbootclasspath/a:../lib/boot.jar</string>

      <key>VMOptions.i386</key>
      <string>-Xms128m -Xmx512m -XX:MaxPermSize=950m -XX:ReservedCodeCacheSize=254m</string>

      <key>VMOptions.x86_64</key>
      <string>-Xms128m -Xmx800m -XX:MaxPermSize=950m -XX:ReservedCodeCacheSize=254m -XX:+UseCompressedOops</string>
...

Tuesday, February 14, 2012

MR Unit (unittesting in hadoop Map / Reduce / MapReduce)

Preaty cool presentation ad good examples. Simplu yse MapperDriver/ReduceDriver/MapReduce driver and do quick test of your mr's.

Wednesday, December 7, 2011

Free book on 'Data-Intensive Text Processing'

Free book (the Pre-Production Manuscript) Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer Is available here http://www.umiacs.umd.edu/~jimmylin/book.html

Sunday, November 20, 2011

Make your own eBook sleeve using Google Books

I always loved Google Books not because I like books or reading too much, but because it's so rich and unique corpus. Try searching for the 16-th century anatomy, geography, medicine, or alchemy books and see what is out there. Looking at those books you can see that earth used to be different place compared to now. Most of those books are in latin or old german/italian/french, none of which I understand, which is I always hunt for illustrated books. Inspired by those I decided to make leather sleeve for my iRiver Story HD (same procedure applies for any other kind of device). The must was that no swing is needed as I do not know how to sew. I recently joined TechShop, and took Laser Cutting Class, so as firts project I decided to make this my first project. Here are the steps:
  1. Get inspiration:
    I choose Petri Bellonii Cenomani De aquatilibus featured in one of BiblyOdissey posts. It was printed in 1553 and has cool illustrations which wanted to use.
  2. Measure your device:
    The iRiver Story HD is 7.5in x 5in, so size of my sleeve was made so I can fit in device (outline of design was 8.5in x 6.5in, so extra 0.5in on each side).
  3. Fire up Ilustrator and create design:
    This was rather easy. Vectors for both back and front side are the same and were created by having rounded corner rectangle for base, and on left right and bottom side I had sequence of holes (1/8in diameter eyelets) (Note: Epilog Helix laser cutter treats lines of 0.001pt thickness and lower as vectors, which is why where you want to have cut you will need to have vector). On top of each of design I've put raster image (selected black and white graphic). Raster photos I cut-pasted from PDF of the book. To obtain PDF of clean images I used Google Books 'PDF' button (in viewing mode look at the top right side there will be down arrow and PDF written next to it). Overview of designs is shown below:
  4. Get Leather:
    My wife got cheap leather for me, from Franks Leather , which was ~$4 per square feet. And leather lace which was $0.25 per feet.
  5. Make it Burn aka, fire-up the lazer:
    Now that you have design and you got materials ready. Just print your design using laser printer (yes you have to figure out printing parameters yourself as different materials require different settings). Below is video of laser cutting phase:
  6. Wash, dust off char:
    As laser cutting is a lot of burning material your peace will smell on burning, you you might want to wash it, dry it, and treat it so it gets soft again. You can use suede leather protector spray for example.
  7. Connect the peaces and there you are:
    Put back and front face so graphics face outside and eyelets align. Simply take lace and pass-it through the eyelets.
  8. Start using it :-) :
    Now you have one of a kind sleeve, the true piece of history.

Saturday, September 10, 2011

Python regex

# usefull cmd ln thenever you are testing the regexp

import re
print re.compile("^53\d{7}(\d|X|x)$").search("5320075221")