Earlier software applications have been developed to run on single computer. Some examples are calculators, word processing packages, drawing applications etc. With the introduction of client server architecture vast amount of software systems were developed with databases. Most of web applications, business systems are built databases and concurrency, transaction handling are some new terms introduced with the architecture. Now world has moved to new computer era with concepts of high performance computing.
Think about scenarios like,
- Process thousands of face images using image processing algorithms.
- Count words of large numbers of books.
- In natural language processing calculating N-grams with training large corpus
- Buying behavior analyzing of large amount of customers
- Large video processing
I had a requirement of align large number of DNA sequences in a bioinformatics related project. When I used single compute to do the task I had to wait hours. Single machine cannot afford such processing task with memory and time. Powerful server can be a solution but cost again provides more issues with computation. Big data processing is the concept we used whenwe need vast amount of data to be processed in rapid manner. There are number of frameworks are available to write applications and run on production environment with big data processing. These frameworks are connected with computer clusters, parallel architectures and programming models and distributed file systems.
Some of key things are to be considered when using big data processing frameworks.
- You should have vast amount data. Millions of docs, images, sequences etc.Since big data processing runs on computer cluster it need some time to prepare cluster, share data and collect results on each node. So some initial configuration time is required. So when data set is not large this preparing time will be considerable regard to data set processing time.
- Tasks or task set should be independent. They can have shared data but some individual processing has to be done with data set. Then only we can use parallel processing.As an Example one face image processing is independent with other. But processing of first image need to start the second image and so on applicability of big data processing may be complex.
- Better cluster set up
Apache Hadoop
Apache Hadoop is a well known big data processing solution for large scale data processing. This post we will look at basics of Apache Hadoop. It is the framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Rather than depend on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop will not run just any program and distribute it across a cluster. Programs must be written to conform to a particular programming model, named Map Reduce.
MapReduce
MapReduce is a programming model. Programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale. Programs will use two list operations called Mappers and Reducers.
Mapping Lists
Mapping is the first phase of a Map Reduce program. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element independently to an output data element.
Reducing Lists
Reducing lets you aggregate values together. A reducer function receives iterate of input values from an input list. It then combines these values together, returning a single output value.
Figure shows the High-level Map Reduce architecture.
High-level Map Reduce
Architecture.
Map Reduce get the input files from cluster in Hadoop Distributed File System HDFS. Then input files are evenly spread all our nodes. Map Reduce program will be running mapping tasks on many or all of the nodes in our cluster. Mapping tasks are doing the same thing in different nodes.. Therefore, any Mapper can process any input file. Each Mapper loads the set of files local to that machine and processes them. When the mapping phase has completed outcome will be processes as key value pairs.
These pairs must be exchanged between machines to send all values with the same key to a single reducer. The reduce tasks are spread across the same nodes in the cluster as the Mappers. This is the only communication step in Map Reduce. Individual map tasks do not exchange information with one another, nor are they aware of one another's existence. Similarly, different reduce tasks do not communicate with one another. Finally reducers will output the key value pairs collected from all nodes with summarised manner. I will describe this with real time example in next post.
HDFS is called Hadoop Distributed File System which is a distributed file system designed to hold very large amounts of data and provide high-throughput access to this information. A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways. (Ex. NFS, the Network File System, GFS Google file System). Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. When configuring HDFS two nodes to be considered. The Name Node stores all the metadata for the file system and Data node will store data. Name node will be run in master and
data nodes will be operated in slave nodes.
For more information,
http://developer.yahoo.com/hadoop/tutorial/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster
No comments :
Post a Comment