Lustre is a massively parallel distributed file system, generally used for large scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Available under the GNU GPL, the project provides a high performance file system for clusters of tens of thousands of nodes with petabytes of storage capacity.
Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters.
Lustre file systems can support tens of thousands of client systems, tens of petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Due to Lustre's high scalability, businesses such as Internet service providers, financial institutions, and the oil and gas industry deploy Lustre file systems in their data centers.
A Lustre file system has three major functional units:
A single metadata server (MDS) that has a single metadata target (MDT) per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a single local disk filesystem.
One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs). Depending on the server’s hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allow concurrent and coherent read and write access to the files in the filesystem.
The MDT, OST, and client can be on the same node, but in typical installations these functions are on separate nodes communicating over a network. The Lustre Network (LNET) layer supports several network interconnects, including native Infiniband verbs, TCP/IP on Ethernet and other networks, Myrinet, Quadrics, and other proprietary network technologies. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and normally formatted as ext4 file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use an enhanced version of ext4 called ldiskfs to store data. Work started in 2008 at Sun to port Lustre to Sun's ZFS/DMU for back-end data storage and continues as an open source project.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then interprets the layout in the logical object volume (LOV) layer, which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and increase the risk of filesystem corruption from misbehaving/defective clients.
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay while the backup server takes over the storage.
Lustre MDSes are configured as an active/passive pair, while OSSes are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.
Sources:
Lustre File System presentation
Sun Assimilates Lustre Filesystem
Oracle has Kicked Lustre to the Curb
MCR Linux Cluster Xeon 2.4 GHz - Quadrics
Bogazici University / CED, Gazi University - HPC Cluster