Big Data and related technologies
Just watched the above nice explanation about some technologies around big data.
Big data is a kind of buzz word, but as defined in the presentation,
data so big that traditional solutions are too slow, too small, or too expensive to use.
it involves a technological leap from the traditional solutions, and there’re interesting topics, which I didn’t know much.
- Recent trend: data size is increasing with less formal scheme, and more data driven programs are appearing.
- Hadoop is the popular solution in this field so far with map-reduce backend. However, conversion from normal task to map-reduce one is not trivial.
- There’re other solutions like Spark. It can work 10-100x faster by removing intermediate data save loading which Hadoop imposes.
- NoSQL is often discussed recently, but SQL is not obsolete and now striking back with new frameworks like Impala, Presto, etc. These tools are gathering attention, by utilizing the SQL’s powerful and concise expression. Also, some NoSQL languages (Cassandra, mongo) are adding query language features.
- Map reduce is not suitable for real-time event processing. Storm is taking care of this field.
- Search is one sub-domain of big data solutions. Lucene with sola and elastic search is taking care of this field.
- On top of map reduce, functional style expression and SQL type query languages are appearing. Big data is mathematics and functional languages are the best tools.
After watching this presentation, I’ve look around the official sites of mentioned softwares.
Spark provides in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
It provides SQL like query engine, and provides 10x faster performance compared with Hadoop based Apache Hive. The following blog post describes the details about how impala is ‘newly’ created.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Lucene (https://lucene.apache.org/core/) and Sola (http://lucene.apache.org/solr/)
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.
HIVE, IMPALA AND PRESTO – THE WAR ON SQL OVER HADOOP
Nice comparison blogpost.
Solr vs. ElasticSearch
Discussion about comparison.