The modules are:. This article is part of our Hadoop Guide. Use the right-hand menu to navigate. Hadoop was originally developed by Doug Cutting and Mike Cafarella. An image of an elephant remains the symbol for Hadoop. Hadoop works across clusters of commodity servers. Therefore there needs to be a way to coordinate activity across the hardware. Hadoop can work with any distributed file system, however the Hadoop Distributed File System is the primary means for doing so and is the heart of Hadoop technology.
HDFS manages how data files are divided and stored across the cluster. Data is divided into blocks, and each server in the cluster contains data from different blocks. There is also some built-in redundancy. As the full name implies, YARN helps manage resources across the cluster environment.
It breaks up resource management, job scheduling, and job management tasks into separate daemons. Think of the ResourceManager as the final authority for assigning resources for all the applications in the system.
The NodeManagers are agents that manage resources e. CPU, memory, network, etc. NodeManagers report to the ResourceManager. ApplicationMaster serves as a library that sits between the two. It negotiates resources with ResourceManager and works with one or more NodeManagers to execute tasks for which resources were allocated. MapReduce provides a method for parallel processing on distributed servers. Before processing data, MapReduce converts that large blocks into smaller data sets called tuples.
Tuples, in turn, can be organized and processed according to their key-value pairs. The shorthand version of MapReduce is that it breaks big data blocks into smaller chunks that are easier to work with. Both Map Tasks and Reduce Tasks use worker nodes to carry out their functions. JobTracker is a component of the MapReduce engine that manages how client applications submit MapReduce jobs.
It distributes work to TaskTracker nodes. TaskTracker attempts to assign processing as close to where the data resides as possible. Common , which is also known as Hadoop Core , is a set of utilities that support the other Hadoop components.
Common is intended to give the Hadoop framework ways to manage typical common hardware failures. Oozie is the workflow scheduler that was developed as part of the Apache Hadoop project. It manages how workflows start and execute, and also controls the execution path. You can use any Hadoop data source e. Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources. MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release. If you have questions about the library, ask on the Spark mailing lists.
MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib, read how to contribute to Spark and send us a patch! MLlib is Apache Spark's scalable machine learning library. Performance High-quality algorithms, x faster than MapReduce. Logistic regression in Hadoop and Spark.
0コメント