Pages

Wednesday 8 January 2014

Pig Example - Loading Data From HBase

Pig Example - Loading Data From HBase


Background

This is just a simple pig script that use HBase as its data source. 

Column1 - timestamp - long
Column2 - composite numbers - bytearray
The Pig script loads in the data into the schema specified and then performs the various operations on it.  A UDF is used to convert the composite bytesarray into its two separate parts.  A param_file is also used to store the timestamps that will be tested. 

SmallIndex is the Column Family in this case.

hbase_iq1_pig is the name of the pig script.

Bash Command

pig -param_file paramfile.txt hbase_iq1_pig.pig

Pig Script

REGISTER ./ConvertCompositeKey.jar; 
DEFINE article allan.myudf.ConvertFirst();
DEFINE revision allan.myudf2.ConvertFirst(); 

in = LOAD 'hbase://Table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('smallIndex:1' , '-loadKey true -caster HBaseBinaryConverter')
 AS (ts: long, comp:bytearray);
modify = FILTER in BY (((long)'$STARTDATE') <= ts) AND (((long)'$ENDDATE') >= ts);
titles = FOREACH modify GENERATE article(comp), revision(comp);
DUMP titles;

Hive Example - Loading From .txt File on HDFS

Hive Example - Loading From .txt File on HDFS


Background

This is just a simple hive script that runs using a .txt file as its data source stored on HDFS. 

The text file that is being loaded is a space separated list with a newline character between each entry.  Here is an example of the format:

Lewis 210210201201 156 Iolaire 2006-11-01T00:00:00Z donald16a ds16a
Sam 21021987501 110 LARS 2006-11-01T00:00:00Z donald16a ds16a
J0hn 210207896201 516 Sproule 2006-11-01T00:00:00Z kayleigh9a k8a
etc
The Hive script loads in the data into the schema specified and then performs the various operations on it.  The parameters here are given through the command line.  HDFS_q1_hive.hql is the name of the Hive script.

Bash Command

hive -f HDFS_q1_hive.hql -hiveconf starttime='2006-11-01T00:00:00Z' -hiveconf endtime='2007-11-11T00:00:00Z'

Hive Script

DROP TABLE IF EXISTS table1;

CREATE EXTERNAL TABLE table1(type STRING, aid BIGINT, rid BIGINT, title STRING, ts STRING, uname STRING, uid STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/textfilelocation/textfile';

SELECT aid, rid from table1 WHERE (ts >= '${hiveconf:starttime}' AND ts < '${hiveconf:endtime}' AND type == 'REVISION');

Pig Script Example - Loading From .txt File on HDFS

Pig Example - Loading From txt File on HDFS


Background

This is just a simple pig script that runs using a .txt file as its data source. 

The text file that is being loaded is a space separated list with a newline character between each entry.  Here is an example of the format:

16485442 7896 11
21512131 2151516 9761651
20899996 12 7896
etc
The Pig script loads in the data into the schema specified and then performs the various operations on it.  A UDF is used to convert and ISO timestamp to Unix long.  A param_file is also used to store the timestamps that will be tested.  HDFS_iq1 is the name of the pig script.

Bash Command

pig -param_file paramfile.txt HDFS_iq1

Pig Script

data = LOAD '../output_folder/textfile' AS (ts, a_id,rev_id);
b = FILTER data BY (ts >= ISOToUnix('$STARTDATE')) AND (ts < ISOToUnix('$ENDDATE')); 
out = FOREACH b GENERATE $1,$2;
dump out;

Tuesday 7 January 2014

Hive UDF - Convert Date to Unix Timestamp Example

Hive UDF - Convert Date to Unix Timestamp


Background

This little UDF will convert a date, in any specified format, into a unix timestamp.  To specify the date just edit the string in the SimpleDateFormat to your liking. So here is how we did it.  

I have also left in the imports and you will need to find the jar files that contain these classes.  

Implementation

package allan.DtoT;
import java.text.ParseException;
import java.text.SimpleDateFormat;

import org.apache.hadoop.hive.ql.exec.UDF;

public class DateToTime extends UDF{
 public long evaluate(final String d){
  try{
   SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
   return sf.parse(d.trim()).getTime();
  } catch (ParseException pe) {return -1;}
  
 }
}

Pig UDF Example

Pig UDF Example


Background

This little UDF will convert the first 8 bytes of an HBase key into a long.  The Key that we had was a composite key made up of two 8 Byte longs and we needed to convert the first 8 bytes and then the second 8 bytes to get them separately. So here is how we did it.  

I have also left in the imports and you will need to find the jar files that contain these classes.  

Implementation

package allan.myudf;
import java.io.IOException;

import org.apache.hadoop.hbase.util.Bytes;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.data.Tuple;


public class ConvertFirst extends EvalFunc<Long> {
 public Long exec(Tuple input) throws IOException {
  if (input != null && input.size() == 1) {
   try {
    DataByteArray a = (DataByteArray) input.get(0);
    HBaseBinaryConverter b = new HBaseBinaryConverter();
    return Bytes.toLong(b.toBytes(a),0,8);
   
                        } catch (IllegalArgumentException e) {
    System.err.println("...");
   }
  }
  return null;
 }
}

Pig UDF - Converting HBase Key to Long

Pig UDF - Converting HBase Key to Long


Background

This little UDF will convert the first 8 bytes of an HBase key into a long.  The Key that we had was a composite key made up of two 8 Byte longs and we needed to convert the first 8 bytes and then the second 8 bytes to get them separately. So here is how we did it.  

I have also left in the imports and you will need to find the jar files that contain these classes.  

Implementation

package allan.myudf;
import java.io.IOException;

import org.apache.hadoop.hbase.util.Bytes;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.data.Tuple;


public class ConvertFirst extends EvalFunc<Long> {
 public Long exec(Tuple input) throws IOException {
  if (input != null && input.size() == 1) {
   try {
    DataByteArray a = (DataByteArray) input.get(0);
    HBaseBinaryConverter b = new HBaseBinaryConverter();
    return Bytes.toLong(b.toBytes(a),0,8);
   
                        } catch (IllegalArgumentException e) {
    System.err.println("...");
   }
  }
  return null;
 }
}