Improving Efficiency of GEO-Distributed Data Sets using Pact
Pages : 1284-1287
Download PDF
Abstract
In an Internet era, a report says every day 2.5 quintillion bytes of data is created. This data is obtained from many sources such as sensors to gather climate information, trajectory information, transaction records, web site usage data etc. This data is known as Big data. Hadoop is only scalable that is it can reliably store and process petabytes. Hadoop plays an important role in processing and handling big data It includes MapReduce – offline computing engine, HDFS – Hadoop Distributed file system, HBase – online data access.Map Reduce functions as dividing input files into chunks and processing these in a series of parallelizable steps., mapping and reducing constitute the essential phases for a Map Reduce job. As this freamework provides solution for large data nodes by providing distributed environment. Moving all input data to a single datacenter before processing the data is expensive. Hence we concentrate on geographical distribution of geo-distributed data for sequential execution of map reduce jobs to optimize the execution time. But it is observed from various results that mapping and reducing function is not sufficient for all type of data processing. The fixed execution strategy of map reduce program is not optimal for many task as it does not know about the behavior of the functions. Thus, to overcome these issues, we are enhancing our proposed work with parallelization contracts. These contracts help to capture a reasonable amount of semantics for executing any type of task with reduced time consumption. The parallelization contracts include input and output contract which includes the constraints and functions of data execution The main aim of this paper is to discuss various known Map reduce technology techniques available for geodistributed data sets by using different techniques. Further, the paper also discloses the implementation of these techniques, their advantages, disadvantages, and the results measured. Future trends including use of query optimizing techniques to improve the results of the query as well as reduce the cost for the computation. To achieve this we use the indexing mechanism to the cache system to preserve the query search results.
Keywords: Geodistributed , MaReduce, PACT, big data
Article published in International Journal of Current Engineering and Technology, Vol.4,No.3 (June- 2014)