Big Data and related technologies

Dean Wampler – What’s Ahead for Big Data?

Just watched the above nice explanation about some technologies around big data.

Big data is a kind of buzz word, but as defined in the presentation,

data so big that traditional solutions are too slow, too small, or too expensive to use.

it involves a technological leap from the traditional solutions, and there’re interesting topics, which I didn’t know much.

Notes

  • Recent trend: data size is increasing with less formal scheme, and more data driven programs are appearing.
  • Hadoop is the popular solution in this field so far with map-reduce backend. However, conversion from normal task to map-reduce one is not trivial.
  • There’re other solutions like Spark. It can work 10-100x faster by removing intermediate data save loading which Hadoop imposes.
  • NoSQL is often discussed recently, but SQL is not obsolete and now striking back with new frameworks like Impala, Presto, etc. These tools are gathering attention, by utilizing the SQL’s powerful and concise expression. Also, some NoSQL languages (Cassandra, mongo) are adding query language features.
  • Map reduce is not suitable for real-time event processing. Storm is taking care of this field.
  • Search is one sub-domain of big data solutions. Lucene with sola and elastic search is taking care of this field.
  • On top of map reduce, functional style expression and SQL type query languages are appearing. Big data is mathematics and functional languages are the best tools.

Reference

After watching this presentation, I’ve look around the official sites of mentioned softwares.

Spark (http://spark.incubator.apache.org/)

Spark provides in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Impala (http://impala.io/)

It provides SQL like query engine, and provides 10x faster performance compared with Hadoop based Apache Hive. The following blog post describes the details about how impala is ‘newly’ created.

http://vision.cloudera.com/impala-v-hive/

Presto (http://prestodb.io/)

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Storm (http://storm-project.net/)

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Lucene (https://lucene.apache.org/core/) and Sola (http://lucene.apache.org/solr/)

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

HIVE, IMPALA AND PRESTO – THE WAR ON SQL OVER HADOOP

http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/

Nice comparison blogpost.

Solr vs. ElasticSearch

http://stackoverflow.com/questions/10213009/solr-vs-elasticsearch

Discussion about comparison.

Advertisements

Posted on January 16, 2014, in Conference, Web. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: