Hadoop (HDFS) | CompecTA - IT Consulting & Integration Services

Homepage Contact Add to Favorites EN | TR

Hadoop (HDFS)

HDFS, ZooKeeper, Hadoop, Free system software, Distributed file systems, Cloud computing, Cloud infrastructure.

The HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files.

The HDFS does not provide high availability, because an HDFS filesystem instance requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[10] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the Primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the Primary Namenode and builds snapshots of the Primary Namenode's directory information, which is then saved to local/remote directories. These checkpointed images can be used to restart a failed Primary Namenode without having to replay the entire journal of filesystem actions, the edit log to create an up-to-date directory structure.

An advantage of using the HDFS is data awareness between the jobtracker and tasktracker. The jobtracker schedules map/reduce jobs to tasktrackers with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). The jobtracker will schedule node B to perform map/reduce tasks on (a,b,c) and node A would be scheduled to perform map/reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other filesystems this advantage is not always available. This can have a significant impact on the performance of job completion times, which has been demonstrated when running data intensive jobs.

Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems.

File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, or browsed through the HDFS-UI webapp over HTTP.

References:

Yahoo! Launches World's Largest Hadoop Production Application

Hadoop creator goes to Cloudera

The Hadoop Distributed File System: Architecture and Design

HADOOP-6704: add support for Parascale filesystem

Apple Embraces Hadoop

Related Solutions

Cloud Computing

Easy Access
Solutions&Products related pages
- Solutions
  - » HPC
  - » Cloud Computing
  - » Virtualization
  - » Benchmarking
  - Post-production Rendering
    - » Broadcasting
    - » Editing, VFX and Parallelization
- Products
  - File Systems
    - » Lustre
    - » Hadoop (HDFS)
    - » Panassas pNFS
    - » IBM GPFS
    - » Red Hat GFS
    - » SANBOLIC MelioFS
    - » HP Ibrix FS
  - » ScaleMP
  - » InfiniBand
  - » DDN SFA10000
  - » HP EVA P6000
  - » IBM DS5000 Series
  - » IBM Storwize V7000
  - » IBM DS3500
  - » Fujitsu Eternus DX400
  - HVCEM Tools
    - » xCAT
    - » HP CMU
    - » Platform ISF
    - » Platform LSF
    - » Platform Symphony
    - » Platform Cluster Manager
    - » MOAB Cluster Suite
    - » Moab Adaptive HPC Suite
    - » Moab Grid Suite
    - Open Source Products
      - » Torque
      - » Maui
  - » HPC Tools
  - » HPC Applications

E-mail Group
Join our e-mail group to hear the latest news and updates about Compecta.

Customer Login

tercüme bürosu

CompecTA Bilisim ve Iletisim Hizmetleri Sanayi Ticaret Ltd. Sti.
Yenişehir Mh. Barajyolu Cad. Özgür Sk. Tellioğlu Apt. No:7/4 Ataşehir/İSTANBUL Tel: 0216 455 18 65, Fax: 0216 455 39 88

Related Solutions:

References:

Related Solutions