I started my career as an application developer mostly having worked on OOP languages, databases, REST APIs, a little client side programming, etc. It was around that time that I heard about “Big Data” and “Hadoop”, but didn’t know much in detail. My journey into Big Data started when, in the project that I was working on, we started migrating to Hadoop in order to support the large volume of data for our client. In my first month, I did a few things that most people do when they start learning Big Data —
The above things help me to understand Big Data to some extent, but I didn’t get the whole picture. Here are a few reasons —
While working on big data projects, I observed couple of things that we need to focus —
While solving problem for silo-computing, we mostly focus on performant APIs, immutability, heap memory and few other things. But, for distributed computing, you also have to consider in-memory vs disk I/O, CPU usage and network bottleneck. At every stage from analysis to production deployment, we need to test and verify all above parameters for each and every small change.
Let’s say we want to do an average of 1000 numbers, we can implement it using any language and run it on single machine. But, what if the numbers are 100 billions? We cannot efficiently solve this using the earlier approach. There are mainly two parameters that we need to consider — the volume of data and parallel execution. We need to split the data and store it in multiple machines and want to run each split in parallel to get the answer in minimum time.
The Average Problem using map reduce — Find the average of 1 to 10 numbers. Numbers are partitioned into three machines.
As you may know, MapReduce is a two phase process: Map phase and Reduce phase. In map phase, the mapper will take any Hadoop directory as input, one line at a time as key-value pair. The goal of the mapper is to take key-value as the input, process it and emit the other key-value pair. Between the map and reduce phase, all the values emitted by mapper are grouped together according the key. Reduce phase will take one key and all associated values at a time and process it and emit the other key-value pair.
In the above diagram, all mappers are run in parallel and generates key as number of elements and value as sum of all the elements. Then, all the grouped values will be consumed by the reduce task and emits key as zero and value as tuple of number of element and its total respectively. The emitted pairs will pass to one more mapper and it will give the average of one to ten number.
Some points to remember:
Try thinking how one can solve problems like cumulative sum, transpose of matrix, top N most frequently used words etc. in distributed fashion.
The framework like Spark provides higher level abstraction such as RDD (Resilient Distributed Datasets). When you do any operation on RDD, behind the scene it will execute on each of the split available on different machine. That’s why it is more important to understand the internal working of each API that you are going use for any transformation. For example, word count can be computed using either reduceByKey or groupByKey. Both will give the correct answer, but reduceByKey is much better on large volume of data.
When I started exploring tools in Big Data space, I was confused because you will find many tools or frameworks to fulfill your single requirement. Each of them has its own set of pros and cons. Here is a snapshot of how the big data landscape looks like —
Some frameworks give more flexibility but don’t perform well on very large volume of data. Some tools are good in iterative processing and some tools are not. Some frameworks are good in writing the data and some frameworks are good in querying the data. Understand your use cases, volume of data that you want to process by that framework and the performance. Before finalizing any tool or framework, I would suggest to spike out the framework on test environment and measure the performance on production-like environment to get better understanding of the framework on large data set. If everything goes right, use it in production!
Thanks for reading. If you found this blog helpful, please recommend it and share it. Follow me for more articles on Big Data.
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!