In the modern era, digital data processing with a huge volume of data from the repository is challenging due to various data formats and the extraction techniques available. The accuracy levels and speed of the data processing on larger networks using modern tools have limitations for getting quick results. The major problem of data extraction on the repository is finding the data location and the dynamic changes in the existing data. Even though many researchers created different tools with algorithms for processing those data from the warehouse, it has not given accurate results and gives low latency. This output is due to a larger network of batch processing. The performance of the database scalability has to be tuned with the powerful distributed framework and programming languages for the latest real-time applications to process the huge datasets over the network. Data processing has been done in big data analytics using the modern tools HADOOP and SPARK effectively. Moreover, a recent programming language such as Python will provide solutions with the concepts of map reduction and erasure coding. But it has some challenges and limitations on a huge dataset at network clusters. This review paper deals with Hadoop and Spark features also their challenges and limitations over different criteria such as file size, file formats, and scheduling techniques. In this paper, a detailed survey of the challenges and limitations that occurred during the processing phase in big data analytics was discussed and provided solutions to that by selecting the languages and techniques using modern tools. This paper gives solutions to the research people who are working in big data analytics, for improving the speed of data processing with a proper algorithm over digital data in huge repositories.
Keywords: HADOOP; SPARK; scalability; batch processing; big-data
In this digital world, all people are generating a huge volume of information as data for their real-world applications and needs. Every day plenty of data were created in various domains like healthcare, retail, banking, industries, and companies [[
Graph: Fig. 1 Main V's used in Big Data.
To provide the solution for the high volume data storage, the classic scale-in database storage was developed but not sufficient. Later, scale-out concepts were introduced as commodity hardware by Hadoop [[
For parallel processing of data over networks, Hadoop Framework is used. The data processing between the nodes is based on their location and migration of their place. So aligning and arranging all individual nodes for performing distributed data processing at a time is complicated [[
In earlier days, distributed network node files are sending through a client-server architecture system with limited size. If huge files are sent across the network, latency, throughput, and speed of the transfer are very low [[
Graph: Fig. 2 History of Hadoop.
Hadoop has supported many data mining algorithms and methods for accessing data from a huge data set with the help of modern tools as a supporting system. Data collection from different resources and stored in a warehouse has to be controlled and monitored for data flow access. This will help to find a minimal or optimal solution for time consumption issues in the Hadoop framework. Nevertheless, data generation and extraction have to be monitored using any of the tools in a Hadoop ecosystem that will give an immense result of required data to the user on time among the clusters. Researchers find difficulty over the network optimization time of the ETL (Extraction, Transaction, and Loading) process normally, because of the CAP (Consistency, Availability, Partition Tolerance) theorem concepts [[
Table 1 Hadoop Eco System
Name Sqoop FLUME HIVE HBASE PIG R Functions Structured data Collect the logs SQL query Store data in data ware-house. Latin programming Refine data from the ware- house. Language DML JAVA JAVA JAVA Latin R Database Model RDBMS NoSQL JSON JSON NoSQL RDBMS, No SQL Consistency Concepts Yes Yes Yes Yes Yes Yes Concurrency Yes Yes Yes Yes Yes Yes Durability Yes Yes Yes Yes Yes Yes Replication default default No No No No Storage Method LOCAL HDFS HDFS HDFS HDFS HDFS
Hadoop Distributed File System (HDFS) consists of Name Node (NN) and Data Node (DD) in a single node or multi-node cluster setup. Classic Hadoop contains Job Tracker by name node and task tracker by data node to find the flow of the data access. But the limitations of Hadoop made this architecture with a new concept called replication. Each end every input the job has to complete and the output data will be stored in 3 data nodes as a replication [[
Graph: Fig. 3 HDFS Architecture.
The Hadoop framework is used to provide parallel distributed database access with a basic java programming paradigm [[
Hadoop 1. X version is a basic version that is explained two major components Map reduce and HDFS storage. Map-reduce is a programming model that reveals the input file is divided into the number of maps and converted into key-value pairs. Combiner parts get these maps as input and reduce them according to the keys produced by mappers. Finally, the reduced data will be stored in HDFS storage. Perhaps, this is a reliable storage system and redundant for a distributed database. It consists of a replication factor as 3 by default in master-slave architecture. Data nodes created 64 MB of blocks to store input data in HDFS.
Hadoop is Master-Slave architecture by nature and it is controlled by Name Node (NN) as a Master. The remaining nodes which are connected to this Name Node are called Data Nodes (DN) as a slave. If suppose NN got failure or is disconnected from the cluster the entire system will get collapsed. In this critical situation, the Name node has taken a photocopy of its data and stored it in a different node called Secondary Name Node (SNN) over the network to avail the CAP theorem concepts. This additional feature is available in Hadoop 2. X with the name of YARN (Yet Another Resource Negotiator). Here also replication factor is 3 but the block size is 128 MB for input data storage [[
Table 2 Hadoop versions differences
HADOOP 1. X HADOOP 2. X 4000 nodes per cluster 10,000 nodes per cluster Job Tracker work is the bottleneck YARN cluster is used One namespace Multiple in HDFS Static maps and reducer Not restricted Only one job to map-reduce Any applications that integrated with HADOOP Working based on the number of tasks in a cluster Working based on cluster size
Hadoop 3. X is the latest version of the Apache Hadoop developed by Apache to overcome the problems of previous versions. The problem in previous versions is mainly lying in the number of blocks allocated for input data. For example, if 6 blocks are needed for storing input data into blocks we need 6X3 = 18 blocks for replication. So the overhead storage value is calculated using extra blocks divided by original blocks and it will be multiplied by 100 which gives a 200 percent result. The extra memory space allocation causes more cost usage problems for business people. So in Hadoop 3. x erasure coding [[
Graph: Fig. 4 Replication.
Graph: Fig. 5 Erasure Coding.
The above diagram describes erasure coding in the Hadoop 3. x feature. The replication of 3 nodes can be divided and combined with two nodes using the XOR function as parity block storage. The same 6 blocks were taken for input file storage, instead of 18 blocks only 9 blocks were allocated for storage which means 3 blocks for extra storage. So the overhead storage is 3 divided by 6 and multiplied by 100 gives 50% only. Here the storage has to be denoted as Data Lake [[
One more feature added to this Hadoop 3. x is yarn architecture has slightly changed to adapt to the reduction of data blocks in HDFS. In this, the resource manager allocated the jobs to the node manager and it will be monitored by the application master. A container is a new feature that will give the request of each node to the Application Master [[
Graph: Fig. 6 Hadoop 3. X YARN Architecture.
There are lots of technical features changed in each version of Hadoop which improves the performance of the data processing speed in big data analytics. Table 3 will denote all the technical features of their versions.
Table 3 Hadoop latest and previous version differences
Features Hadoop 2. X Hadoop 3. X Java Version JDK 7 JDK 8 Fault Tolerance Replication Erasure Coding Data Balancing HDFS Balancer CLI disk Balancer Storage Overhead 200% 50% Data Storage Data skew Data fake YARN services Scalability issues V2 improves Container Delay due to feedback Queue Nodes per cluster 10000 More than 10000 Speed Low High Single point of Failure Overcome automatically No manual intervention Heap size memory Configured Auto tuning Job Monitoring Resource Manager Node Manager Task Monitoring Resource Manager Application Manager Secondary Name Node support Only one More than 2
In Hadoop, so many clients are sending their jobs for performing tasks. This can be handled by Job Tracker or resource Manager by Hadoop. There are two different versions are available in Hadoop named Hadoop 1. X and Hadoop 2.X.Here X denotes the version releases/updates. If Hadoop 1. X is used in the cluster, then the tasks can be controlled by the Job Tracker /Resource Manager. If it will be Hadoop 2. X, it may use the secondary node for the purpose of replica in the Name Node and will be used for copying Metadata [[
• i. FIFO
- ii. Capacity
- iii. FAIR
The following Table 4 [[
Table 4 Schedulers' drawbacks
Type of Scheduler Pros Cons Remarks FIFO Effective Implementation Poor data location Static Allocation FAIR Short response time Unbalanced workload Homogeneous System CAPACITY Unused Capacity jobs Complex implementation Homogeneous System, Non-primitive Delay Simple Scheduling Not work in all situations Homogeneous System, Static Matchmaking Good Data locality More response time Homogeneous System, Static LATE Heterogeneity Lack of reliability Homogeneous System &Heterogeneity Deadline Constraint Optimizing Timing Cost is high Homogeneous System, Heterogeneity, Dynamic Resource Aware Cluster nodes Monitoring Extra time for monitoring Homogeneous System, Heterogeneity, Dynamic HPCA High hit rate and redundancy Cluster change state Homogeneous System, Heterogeneity, Dynamic Round Robin Proper work completion No priority is given Homogeneous System, Heterogeneity, Dynamic
The data processing speed is improved using the Hadoop framework because of its features [[
Table 5 Hadoop Features
Features Usage Various Data Sources Multiple networks Availability It has a replication feature which means the data in which stored in a node can replicate in three different nodes. So there is no problem with availability issues. Scalable A lot of nodes can be connected in a cluster as a single node and multi-node at anytime, anywhere concept. Cost- Effective Hadoop is an open-source framework for the usage of all companies that created a huge volume of data dynamically. Low Network Traffic The traffic would not be affecting the data processing task because of connectivity among cluster nodes. High Throughput The Map-Reduce programming paradigm provides high throughput between the nodes connected in Hadoop by its divide and conquer method job process. Compatibility Hadoop is a framework that accepts all platforms of operating systems, programming languages, and modern tools of the Hadoop ecosystem. Multiple Language Support Hadoop is suitable for all object-oriented programming languages like java, python, and Scala. Moreover, it is integrated with Hadoop ecosystem tools effectively
Though Hadoop has many features for huge data processing in clusters, it has some drawbacks while executing the tasks. Because the features may have some limitations while distributed data processing running inside the clusters [[
- • While accessing the small files [[
28 ]] due to the default block size their speed has less and the allocation of memory is huge. To avoid this Merging of small files, HAR extension files (Hadoop Archives), and H Base tools can be used. - • When big files have handled the speed of retrieval is slow and can be processed by SPARK Framework.
- • Unstructured data processing initiates low latency due to different file formats and this could be handled by SPARK, FLINK, and RDD (Resilience Distributed Data set) is used for storage purposes.
- • High-level data storage and network-level problems are raised when we talk about security concerns [[
29 ]] in a larger network that can be solved using HDFS ACL for authentication purposes and YARN (Yet Another Resource Negotiation) as Application Manager. - • Batch-wise data input processing is working but not real-time data accessing. The tools like SPARK and FLINK is used to handle that.
- • More lines of code (
1 , 20,000) [[30 ]] cannot be accessed but using SPARK and FLINK it is possible. - • It does not support repetitive computations and no delta iterations but the SPARK tool supported all with in-memory analytics technique.
- • No Caching and Abstraction features are running in the Hadoop framework whereas SPARK.
Hadoop is used to perform parallel distributed data processing in different clusters. But it has a lot of problems with parallel processing among nodes. There are some bottlenecks which are affected the performance of Hadoop processing over the network. They are [[
- • All the key resources in the CPU can be utilized properly for Map and Reduce process.
- • Master-Slave architecture is running in the data node as Main memory using RAM.
- • Network and bandwidth traffic due to huge file size accessing.
- • The throughput problem of input-output devices data storage over the network.
Hadoop tuning problems in data processing are discussed below with solutions.
- ∘ A large volume of source data can be tuned by Huge I/O input at the map stage [[
32 ]] with LZ0.LZ4 codex - ∘ Spilled records in the Partition and Sort phases are using a circular memory buffer using the formula
- Sort Size = (16 + R) * N / 1,048,576
- R–number of Map
- N –dividing the Map output records by the number of map tasks are mapred.local.dir = 100MB
- ∘ Network Traffic at Map and reduce side can be tuned by Writing small snippets to enable or disable in the map-reduce program and default replication factor of 1,3,5,7 nodes in the single and multi-node cluster configuration.
- ∘ Insufficient Parallel Tasks [[
33 ]] in idle resources are handled by adjusting Map, Reduce Tasks numbers and memory. There are 2 map, re- duce tasks, 1 CPU vcore and 1024MB memory allocated as a default configuration. For example, 8 CPU cores with 16 GB RAM on Node Managers, then 4 Map, 2 Reduce Tasks with memory 1024 MB allocated to each task and it leaves 2 CPU cores in a buffer for an- other works.
Hadoop Framework running with java programming language by map-reduce model for data processing from a huge dataset warehouse on real-time applications. For complicated analysis of the real world, problems can be easily solved by Hadoop with low-cost open-source. Though data warehouse engines work effectively, the speed of data retrieval is the major problem [[
Map Reduce is an important programming model used in the Hadoop framework that accesses a high volume of data in parallel by disseminating the whole work into individual tasks. So that the input file can be accessed by map-reduce functions to minimize the size of the file coming in the output part with compression [[
Map-reduce is used to access a huge dataset that is stored in HDFS parallelly. Increasing the velocity and reliability of the cluster map-reduce plays a major role in processing. The latency and throughput of the entire system will be increased because of the time taken to complete the job.
Multiple phases are working in the map-reduce programming model because huge files are divided into independent tasks and each will work parallelly. Separate work has to be done in every stage of the map reduction.
The Map-Reduce model is working only on the data which are stored in HDFS. Because all the operations working in Hadoop Cluster are only based on HDFS storage data. So the input from various sources has to be given to the map-reduce from HDFS is the first step of Map Reduce. According to the data size, the entire file is disseminated into individual tasks by a splitter. The input text format is changed into key-value pairs by the record reader function. Combiner is taking care of that key matches and it will make partitions over the HDFS disk based on the file size. The partitions are stored in the intermediate data of the mapper function to give the output to the next phase. But alignment is the major problem that leads to cause latency or throughput problems. So shuffling of keys and value pairs for each partition is running on the HDFS disk. The next important process that happened in Map Reduce is sorting [[
Map Reduce is designed with java as a programming language platform working on a Hadoop cluster. The cluster may vary in their nodes named as a single node or multi-node cluster have master-slave architecture. The main problem of Map Reduce is extracting data from a huge dataset within a stipulated time but that is not achieved because of the input file size of data from HDFS. The challenge in map-reduce is to minimize or optimize the whole volume of data into compressed format low volume data. But the time to complete that process is very high. In other words, latency and throughput are very low. Normal data extraction from the data warehouse is a little bit slower because of the patterns and algorithms used for processing [[
Map Reduce is running with batch processing on Hadoop cluster data input format which means once the input has to be taken another input is waiting for the completion of the previous task. This is the most important problem in Map Reduce and it will be accessed through iterations [[
MAP: Fig. 7 Data Sharing in Map Reduce.
The best example for Map Reduce is a java based Word Count Program in the Hadoop cluster. Initially, three sentences have to be taken for input and it will be split into different individual tasks as input split. The next mapping phase takes care of individual tasks and converts that input split into keys and values which means the number of presence of the word is calculated. Based on the alphabet criteria the keys are shuffled and sorted as an output of the mapper. Reducer collects those outputs and gives them as input to the combiner for alignment of key-value pairs. Finally, it collects the time of occurrences of each word from that three sentences and will be given to the output to the client or user. Here the final output will be in the compressed form of input data which leads to data processing with poor latency best throughput and. The size of the input file is low like KB means within a few seconds map-reduce has to be finished. If it will be in MB/GB, then the number of maps and reduces will be more for doing the Map-reduce function [[
MAP: Fig. 8 Word Count Example for Map Reduce.
Map Reduce function done in Hadoop cluster by job tracker and task tracker. Classic versions of Map Reducev1 function is working with trackers. But latest version MRV2 is running with YARN architecture. Because it gives the tracking feature of Map reduce job in every stage [[
MAP: Fig. 9 Map Reduce Version1.
MAP: Fig. 10 Map Reduce Version2.
The Map Reduce performance can be accessed by several factors of the Hadoop framework and its features. Map Reduce performance can be affected in terms of speed, latency, throughput, and time taken to complete the task. There are several other factors that may exist during the transmission of data in the Hadoop cluster that will affect map-reduce [[
- a. Performance
- b. Programming model & Domain
- c. Configuration and automation
- d. Trends
- e. Memory
Initialization of Hadoop and Map Reduce will affect the performance due to the techniques used in the entire data processing system. Because Hadoop 1. x gives only the output but cannot give time to complete the task. But Hadoop 2. x overcomes this issue and tracks the status of the job throughout the task. At last, the latest Hadoop 3. x version describes the advanced MRV2 process for quick response over the network on the Hadoop cluster through its erasure coding techniques [[
Graph: Fig. 11 Performance issue 1.
Scheduling of jobs in Map-reduce is an important concept in the Hadoop cluster. Continuously jobs are assigned in Hadoop Framework by the clients; the order of jobs taken for Map Reduce is a typical process. So the schedulers are used to perform this work with the help of queues. Three main schedulers are available in Hadoop namely FIFO, Capacity, and FAIR [[
Graph: Fig. 12 Performance issue.
Map Reduce writing map and reduce functions using good programming is essential for the users. There are various programming languages supported by Hadoop for performing Map-Reduce operations. Every language is based on platform dependent or independent employing their characteristics. Some of the languages that support the Hadoop ecosystem are SQL, NoSQL, Java, Python, Scala, and JSON [[
Graph: Fig. 13 Programming Model issue 1.
Graph: Fig. 14 Programming Model issue 2.
Self-tuning of the workload between the nodes can be balanced by a load balancer on Hadoop and the data flow sharing among the nodes is controlled adequately is a big challenge. If this work fails automatically Map Reduce will give poor output on the task. Input-Output disk minimization is the major drawback in Hadoop MR for accessing data regularly. Their performance is changed due to the size of input data and methods used for splitting are noted. If the number of reduces is less may increase MR performance. The code written in a specific language supports static code generation [[
Data warehouse data are accessed by the database engine on Map-reduce. But the data size is very large, and extraction of small data from that engine made it difficult. The time taken to complete the process is very high. But instead of disk processing, it should be done by memory processing directly will improve the MR performance by I/O disks. Indexing [[
Map Reduce function fully depends on the number of maps and reducers used for every task in the Hadoop cluster. If it will get increase immediately the performance of the system goes very slow in terms of time taken to complete the task.
- • Calculation of number of maps
- The number of maps assigned for every job by a client is too calculated by the size of the input file [[
59 ]] and allocated blocks for accessing those data. The following formula denotes the number of maps required for performing Map Reduce operations. - (
1 ) - By default, minimum of 10 –100 maps per node is assigned for the job. A maximum of 300 maps can be allocated to do Map Reduce job. For example, 10TB of input file size and 128MB block size are allocated by Hadoop 2. x means 10TB/12b MB = 82,000 maps are approximately assigned for completing that job.
- • _B_Calculation of number of reduces
- Normally reducer is allocated for all maps reduce job is 1. If the number of reducers wants to be increased for huge processes, then the configuration file can be changed during installation or after using speculative tasks. The following formula denotes the number of reduces by default required for performing Map Reduce operations.
- (
2 ) - • Skipping bad records
- To eliminate the bad records created during the Map-Reduce process can be changed using configuration files. By enabling the true or false function in the configuration file it can be removed. For example, in the word count Map Reduce program written by java only case the sensor output is required means making –DwordCount.case.sensitive = true/ false command during the run time will give better performance than the previous one [[
59 ]]. Because the bad records can be eliminated using these commands. - • Task execution & environment
-
The task tracker in data nodes keeps track of all information about the jobs and is sent to YARN Resource Manager consequently. But there is a limitation over these operations in terms of memory allocation in a map and reduction for task execution. The command _B_–Djava.library.path=< -Xm512M/-Xm1024M executes Map Reduce environment [[
60 ]] within that memory limit successfully. The following Table 6 & provides details of Map Reduce Implementation methods and their applications.
Table 6 Map Reduce Implementations
Map Reduce Implement Methods Advantages Disadvantages Google Map Reduce multiple data blocks on different nodes to avoid fault tolerance problem Batch processing-based architecture is not suitable for real-time applications Hadoop High scalability Cluster maintenance is difficult. Grid Grain Subtask distribution and load balancing Does not support non-java applications Mars Massive Thread Parallelism in GPU Not for atomic operations due to expensive Tiled-Map Reduce Convergence and Generalization Cost is high Phoenix Multicore CPU Scalability is less Twister Tools are used effectively Not possible to break huge data set
Table 7 Map Reduce Implementations
Map Reduce Applications Pros Cons Distributed Grep Data analysis is generic Less response time Word Count Massive document collection of occurrences Limited only Tera Sort Load balancing transparency Inverted Index Collection of unique posting list Lots of pairs in shuffling &sorting Term Vector Host analysis search Sequential tasks Random Forest Scalability is high Low Extreme Learning Machine union and simplification Uncertainty Spark Data fit in memory Huge memory needed Algorithms Data exhaustive applications Time uncontrollable DNA Fragment Parallel Algorithm Large memory Mobile sensor data Extracting data is easy Difficult to implement Social Networks Quick response Need more techniques for analysis
The Map-Reduce job allocated by the resource manager of Hadoop will improve the performance of the data processing speed and accurate results based on the configuration of the cluster and proper allocation of map-reduce tasks with their type of input data. Though LZO compression helps to compress input file size there will be a combiner between mapper and reducer is a must for improving map-reduce job performance optimization [[
There are some other important aspects used in the map-reduce programming model to provide solutions for map-reduce job performance improvement in the Hadoop framework. All factors have represented the flow of jobs from resource managers to data nodes and how data can deviate from the flow during run time. Because these factors are rectified means even a big job running on the Hadoop cluster will give output with low latency. Below Fig. 15 listed the factors for job optimizations.
Graph: Fig. 15 Data colocation.
It is mainly used in Map reduce concept for aggregation of databases to utilize the filter data and perform operations like grouping, sorting, and converting [[
The result of the map-reduce is approximate in terms of size, time, and accuracy. Even though the performance has to be increased during the running time it cannot be predictable by its output. Any files can be taken as an input format it will provide an output of map reduced function. The output cannot be accurate or reliable in such cases.
Since Map-Reduce works with key-value pairs, it is very complicated to align the order of the jobs by a resource manager. It allocates the task to the data node which may cause conflicts rapidly [[
Map Reduce is specially designed for handling multiple jobs parallelly. If multiple jobs are running simultaneously, it is recommended to share those jobs by individual maps [[
Data that is used for the Map Reduce function from the HDFS storage can be reused for next-level changes in the same input file. Reusability [[
Skew Mitigation is the main issue in Map reduce, solved by different techniques to avoid data transmission. Using skew-resilient operators, classical skew-mitigation problems were solved. By repartitioning the concept, skew mitigation can be handled in a big data environment using three major methods. Minimizing the number of times of repartition to any task can reduce repartitioning overhead. Then minimizing repartitioning side effects can be removed during the struggling time to remove mitigation ambiguity. At last, unnecessary recompilations are used to minimize the total transparency of skew mitigation [[
Same location files will be collocated on the similar locate of nodes is a new concept based on the locator of file attribute in the file characteristics. When the new file is creating its location, the list of data nodes and the number of files in the same case can be identified and stored all those input files in the same set of nodes automatically [[
Map Reduce function can be written in java or any other higher languages, the performance should be changed according to the features of selected languages. Table 8 narrates the differences between java and python coding languages when map reduce can be written.
Table 8 Map Reduce written in java and python differences
Features Java Python File size Handling <1 GB is easy >1 GB is easy Library Files All in JAR format Separate library files File Extension .java .py Method of calling Main No main method Data collection Arrays, Index List, set, dictionary, tuples Object oriented Required Optional Case Sensitive Required Optional Compilation Easy in all platform Easy in Linux Productivity Less More Applications Desktop, Mobile, Web Analytics, Mathematical,Calculations Type of files Batch processing, embedded application Real time processing files also Functions Return 0 &1 is used Dict is used for return Programming concepts Dynamic less Cannot push threads of single processor to another Syntax Specific types Simple only Basic programming C,c++ basics(oops) Higher end concepts like ML Number of codes High Less code size Input data format Streaming with STDIN,STDOUT by binary not text Both binary and text Areas Working Architecture, tester, developer, administrator Analytics, manipulation, retrieval, visual reports, AI, Neural Networks Speed 25 times greater than python Low due to interpreter Execution Time High because of code length Easy Typing Dynamic Static Verbose Syntax Low Normal Frameworks Spring, Blade Django, Flask Gaming Jmonkey Engine Pandas3D,cocos Ml Libraries Weka, Mallet Tensorflow, pytorch
Apache Spark framework is an open-source used for distributed cloud computing clusters. It is working with the data processing engine concept meanwhile to be faster than the Hadoop Map Reduce for data analytics. Though Hadoop is used to provide big data analytics effectively, it has some drawbacks [[
- • _B_In-memory Processing: This technique is used to capture moving data or processes inside and outside of the disk without spending more time. So obviously it is working faster than Hadoop. Approximately 100 times better than Map Reduce on Hadoop due to memory.
- • _B_Stream Processing: It supports stream processing which means input and output data are continuously accessed. It is mainly used to access real-time application data processing.
-
• _B_Latency: Resilient Distributed Dataset (RDD) is used to catch the data using memory in between the nodes on the cluster. RDD manages logical partitions for distributed data processing and conversion of data format. This is where Spark does most of the operations such as transformation and managing the data. RDD is used in logical portions [[
71 ]], which can be manipulated on the Hadoop cluster. - • _B_Lazy Evaluation: Only for needed situations it is accessed the real world applications otherwise it will be the idle condition.
- • _B_Less Lines of Code: SAPRK is used SCALA language for processing data with less number of lines when compared to Hadoop.
- 1 Figure 16 and Fig. 17 are explained the working principles of the Hadoop map-reduce and spark engine.
MAP: Fig. 16 Working of Hadoop Map Reduce.
Graph: Fig. 17 Working of Spark.
Many companies created terabytes of data through human and machine generation applications. Apache Spark is used to improve the company's business insights [[
- • _B_E-commerce: To improve consumer satisfaction over competitive problems, a few industries are implementing SPARK to handle this situation. They are:
- (a) _B_A. eBay: Discounts and or offers for online purchases and any other purchase transaction SPARK can be developed using real-time data. It will provide the updating status and consistency of data at each second so that the customer relationship is very strong on their feedback.
- (b) _B_B. Alibaba: Analyze big data, and extraction of image data can be handled by Alibaba Company using SPARK as an implementation tool. They are used on a large graph, for deriving results.
- • _B_Healthcare: MyFitnessPal, which is used to improve a healthier lifestyle through diet and to scan through the food calorie data of about 100 million users to find the quality of the food system using SPARK in-memory processing techniques.
- • _B_Media and Entertainment: Netflix, for video streaming uses Apache Spark to control and monitored its users compared with the earlier shows that they have watched.
- • Stand-alone Mesos and Cloud are the places where Spark can run on Hadoop.
- • Machine Learning algorithms can be executed faster inside the memory using Spark's MLlib in order to provide the solutions which are not given easily by Hadoop Map Reduce [[
73 ]]. - • Cluster Administration and Data Management can be done by combining SPARK and Hadoop because SPARK does not have its own Distributed File System (DFS).
- • Enhanced security can be provided by Hadoop, for making workloads. But Spark can be deployed on available resources at all places of a cluster. So there is no manual allocation and tracking of individual tasks. For the above-said features, SPARK is still used by big companies and industries those who are working on real-world applications.
Hadoop framework is working under the principle of master-slave architecture where used as name node and data node with replication principle. The output of each step in Hadoop has stored their data in the HDFS cluster continuously. So if the client needs to retrieve the data from the database it will be very easy to extract in Hadoop. Because the Hadoop framework takes replication of every job output data in the HDFS cluster disks.
Spark is a distributed cluster framework for processing data on the memory of the nodes by its process engine. In-memory analytics data processing is used in SPARK, so the output of each step is stored in between the node memories for clients. For this, it consumes a lot of memory for storage. One big advantage of SPARK is to access real-time applications frequently. Although it is used for online generated data processing, streaming is mainly used. There is plenty of data generated online with every second. To maintain all those heavy storages and accessing engine or machine should be needed. So SPARK is used a lot of memory units in between the nodes on the network path. The time to complete the job is also very less by using SPARK [[
Graph: Fig. 18 Working difference between SPARK and Hadoop.
In general, the SPARK framework is used to access real-time data with its memory analytics processing over the big network without any delay or traffic. A normal SPARK architecture consists of a software driver program that has to be written in SCALA language [[
The cluster manager has monitored all these works and it is located between the worker node and the driver program node. Spark context is a small program written only for doing the job of data processing on the nodes but the difference is mainly in the memory storage part. The worker node contains the task assigned by the cluster manager with the executor module. Once the program can be executed by a cluster manager, the executor module in the worker node access the input data from HDFS and immediately stores the output in memory. The client wants to know the intermediate data at every step of the execution they will retrieve from that. Figure 19 is clarifying the architecture of SPARK.
Graph: Fig. 19 Architecture of SPARK.
In Hadoop Distributed File System, Map Reduce can use for data processing by mapper functions among nodes under the cluster. The input file is disseminated into the number of tasks by the splitter and each task is working individually for the Map Reduce operation. Every mapper output is collected as a key-value pair and it will be stored in a circular buffer [[
Graph: Fig. 20 SPARK files system.
Features of big data eco system tools are listed below in Table 9 for all the tools. There are plenty of differences between Hadoop and SPARK. The experimental results of multi-node clusters are displayed in Table 10.
Table 9 Specifications of all tools
Features Hadoop SPARK Flink Strom Kafka Samza Performance Slower 100 times Faster than Hadoop Closed loop Iteration. Fast Fast Fast Language Support Java, Python Scala Python and R Java and provides API in Scala, Python All Languages Best with java &work with all languages JVM languages Processing Batch Stream &Batch Single Stream Native Stream Native Stream Native Stream Latency High(min) Low(sec) Low(sub sec) Very low(ms) Low (1-2 sec) Low (less than sec) Security Kerberos and ACL Low secured using only passwords Kerberos Kerberos TLS, ACL, Kerberos, SASL No security Fault Tolerance High Less snap shot method High High High Scalability Large 14000 nodes High 8000 nodes High 1000 nodes High Average Average
Table 10 Experimental results of multi-node cluster
Parameters Hadoop Records SPARK Records FLINK Records Data Size 102.5TB 100TB >100TB Elapsed Time 72mins 23 min >23mins Nodes 2100 206 190 Cores 50400 physical 6592 virtualized 6080 virtualized Throughput in cluster 3150GB/sec 618 GB/sec 570 GB/sec Network 10Gbps EC2 >10Gbps Sort Rate 1.42TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
There are plenty of technical differences between Hadoop and SPARK. Based on these results anyone can conclude that for computing their big data which framework is better to select for data processing? Moreover, these technical differences convey the message to the people who plan to initiate a start-up company using computers. They have planned to select the framework for their requirements in all aspects. Table 11 summarizes the features of both Hadoop and SPARK.
Table 11 Hadoop Vs SPARK
Features Hadoop SPARK File Processing Method Batch processing Batch/Real Time/iterative/graph Processing Programming Language Java, Python Scala Data Storage type Scale-out Data Lake or Pool Programming Model Map reduce In Memory processing Job Scheduler Externally Not required Cost Low High RAM Usage Less Lot of RAMs Memory Type Single memory Execution &Storage memory Separately Data Size Up to GB is fine PB is fine Latency High Low latency Data taken as input Text, images, videos RDD( Disk Type HDD (Hard Disk) SDD (Solid Disk) N/w Performance Low High Speed rate <3x <3x. 1/10 nodes Algorithm by default Divide and conquer ALS (Alternate Least Square) Data Location details Index Table Abstraction using Mlib Data Hiding Low High using function calls Dataset size Small set Huge set > TB Shuffle speed Low High Storage memory of mapper output Directly in Disk RAM to Disk Containers Usage Releases after every map Release only after the entire job completion Dynamic Allocation Not possible Possible but hectic Replications 1,3,5 nodes Pipelines Delay High due to assign JVM for each task Low due to quick launch Mechanism for message passing Parsing and JAR files Remote Procedure Call (RPC) Time Taken to complete job Minutes because of small data set Hours for big data set. Allocating Memory Erasure Coding DAG (Directed Acyclic Graph) Data Input method Hadoop Streaming SPARK Streaming Data conversion formats Text to binary All forms Job Memory Large Low Input Memory Less High Processing type Parallel and distributed Parallel and distributed Data Extraction Disk Based Memory Based I/O Processing Disk RAM Resources Usage More Less Data status Stateless State Iterative Process Not Taken Taken Caching Doesn't support Support in RAM R/W to HDFS YARN Cluster SPARK Engine Tools supported Pig,Hive,HBase ALL in one Accessibility Command User Interface (CUI) Graphical User Interface(GUI) Traceability Easy by YARN Not possible Fault Tolerance High Low Security High(tracking) Low (no tracking) Storage Architecture Distributed Not distributed Data taken slot from resources Only one slot Any slots(real time) Time Lag Yes No Program Written Map Reduce Driver Program Controller YARN Cluster Manager Partition Type Single partition for all map outputs Separate partition for every map output Companies Used Industries and Companies not needed real time data analytics. Cloud era, Horton works, IBM, British Airways, Face book, Twitter, LinkedIn Real time data processing needed. YAHOO, eBay, Alibaba, Netflix, oracle. Cisco, Verizon, Microsoft, Data Bricks and Amazon
To summarize the contribution for this paper, the authors are explained the challenges and limitations faced in modern tools like Hadoop and SPARK for data processing as following points:
I. Authors were taken various techniques from many research papers on the topic of tuning the performance of the databases while scalability is increased, and all papers are discussed about the data extraction techniques from the huge repositories with low latency and high accuracy over large networks.
II. Authors were written this review paper about Hadoop versions and their features to extract the data from the repositories and also SPARK tool features with the latest techniques. A detailed review has been written in this paper while selecting the tool for extractions with their advantages and disadvantages.
III. Authors have suggested ways to improve the performance of the databases extraction from the repositories. Moreover, the difficulties faced in previous methods. Though modern tools are used for data extraction writing a map-reduce program in Hadoop with a recent algorithm is a challenging task. SPARK is an advanced tool but the cost spend for used that tool is unimaginable for small-scale companies. Here authors were given suggestions to improve the performance in both tools.
Big data analytics is an important technology in this era used to access huge datasets Parallelly in a distributed cluster environment. Based on the requirements of the client or user every software company is deciding to deploy its software and hardware frameworks. Many start-up companies are also confused about their infrastructure to build up. This paper provides a solution for all companies and research-oriented people to select their framework for data processing rapidly. Perhaps, the basic factors of the data processing projects like speed and cost are considered in all situations. The above said technologies and examples are given a transparent view of the big company's infrastructure for dealing with real-world problems effectively. There is a million-dollar question raises in the software industry that the real world scenario problems have been solved only by big industries or those who are ready to invest more money is the only possibility. But there are other factors also considered in the same scene taken by different industries. The main problem is data-driven from the large datasets with fewer resources is a challenging one. This paper deals with all the points to improve the data processing velocity of big data analytics by the famous framework Hadoop vs. SPARK. Henceforth, the data generated day by day in the real-world can handle different latest algorithms for analytics, and processing from the huge volume is being possible with tuning the already existing methods or trends. There must be proper analysis and research problems finding capacity that should be needed to implement all the innovative solutions for real-world problems. Finally, the user wants to find a solution for their problem with big data analytics Hadoop and SPARK are the main frameworks to provide solutions but according to the user requirement, they have to choose the best one. For example, the client wants to start a company which has low investment but dealing big data problem for a complex solution means Hadoop is their best choice because of the cost and type of data. If the same company has the urge to handle real-world application data and ready to provide huge investment, obviously SPARK is the best tool for them. When we consider technical aspects like algorithm and methodology, both tools are using some common techniques but final decisions might be taken based on cost and type of data handling. The decision taken by all persons who are handling big data analytics, Hadoop Map Reduce is suitable for low-cost and batch processing whereas SPARK is apt for real-time processing and a high-cost tool for data processing.
There are plenty of tools available for handling big data in the IT world, but only limited ones are popular among companies and industries because of their user-friendly or cost-wise approach. Hadoop and SPARK are the tools used in very high-speed data processing by various factors. How long have these tools ruled the world with their updated versions and techniques? New tools of Apache like FLUME, FLINK, and Kafka [[
By M.R. Sundarakumar; G. Mahadevan; R. Natchadalingam; G. Karthikeyan; J. Ashok; J. Samuel Manoharan; V. Sathya and P. Velmurugadass
Reported by Author; Author; Author; Author; Author; Author; Author; Author