Hadoop Datatypes

Data Type Reference

Working with big data types can be a challenge. Integers, decimals, timestamps, and dates are not native in some services such as Parquet. The default data types can be difficult to translate with tools like Sqoop incremental saved as Parquet. If I were to use only Sqoop compressed with Snappy, the date and timestamp types would carry over to Hive. Once Parquet is in the picture, the date becomes a string and the timestamp becomes an Epoch time.

My goal is to bring data in as Parquet, due to space challenges, and increment records with most of the data already transformed. Developers in turn would not have to worry about converting Epoch into UTC, unix_timestamp, and timestamp casting.

The challenge is to properly reference Oracle data types into Parquet, Hive, etc., data types. Here is a table reference I use to load into memory and save it as a properties file for Java. Once loaded into an array, I can map Oracle and Parquet data types. The way I define a DECIMAL in Hadoop is by looking at the Oracle precision and scale; i.e., NUMBER(3,2). Perhaps there is better approach but for now this works well.

fields custom_type custom_format scala_type scala_format spark_type spark_format java_type java_format hive_type hive_format parquet_type parquet_format oracle_type oracle_format mysql_type mysql_format
ByteType Integer Byte ByteType BYTE TINYINT 10Y CHAR,NCHAR CHAR 255
ShortType Integer Short ShortType SHORT SMALLINT 10S
IntegerType Integer Int IntegerType INT INT 10 INT 32 NUMBER (=0,<10)
LongType Integer Long LongType LONG BIGINT 10L BIGINT 64 NUMBER (=0,<19)
FloatType DecimalType (0,0) Float FloatType FLOAT FLOAT FLOAT BINARY_FLOAT
DoubleType DecimalType (0,0) Double DoubleType DOUBLE DOUBLE DOUBLE BINARY_DOUBLE
DecimalType DecimalType (0,0) java.math.BigDecimal DecimalType DECIMAL DECIMAL NUMBER(0,0) (>0,>19)
StringType String String StringType STRING VARCHAR,CHAR (=1,<65355), (=255) STRING VARCHAR2,NVARCHAR2,BLOB,CLOB
BinaryType Array[Byte] BinaryType BINARY RAW
BooleanType Boolean BooleanType BOOLEAN BOOLEAN BOOLEAN
TimestampType timestamp (yyyy-MM-dd HH:mm:ss.F),(yyyy-MM-dd’T’HH:mm:ss.F’Z’) java.sql.Timestamp TimestampType TIMESTAMP (YYYY-MM-DD HH:MM:SS.fffffffff),(yyyy-mm-dd hh:mm:ss.ffffffffff) TIMESTAMP TIMESTAMP TIMESTAMPTZ |TIMESTAMPLTZ | | |
DateType timestamp (yyyy-MM-dd) java.sql.Date DateType TIMESTAMP DATE
ArrayType scala.collection.Seq ArrayType ARRAY ARRAY impala
MapType scala.collection.Map MapType MAP MAP impala
StructType org.apache.spark.sql.Row StructType (structFields) STRUCT STRUCT impala
StructField The value type in Scala of the data type of this field (For example, Int for a StructField with the data type IntegerType) StructField (name,datatype,[nullable]) STRUCT STRUCT impala

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax