deepseek-ai

3FS

Infra#Distributed Systems#Storage#AI#NVMe#RDMA

9,806

// summary

The Fire-Flyer File System (3FS) is a high-performance distributed storage solution engineered to meet the demanding requirements of AI training and inference workloads. It utilizes a disaggregated architecture with RDMA and SSDs to provide strong consistency and familiar file interfaces for distributed applications. The system supports diverse use cases including large-scale data preparation, efficient dataset loading, and high-throughput checkpointing.

// technical analysis

The Fire-Flyer File System (3FS) is a high-performance distributed storage solution architected specifically to meet the demanding I/O requirements of large-scale AI training and inference workloads. By utilizing a disaggregated architecture that decouples storage from compute, it leverages modern SSDs and RDMA networks to provide a locality-oblivious, high-throughput shared storage layer. The system prioritizes developer productivity by implementing standard file interfaces backed by a transactional key-value store, while ensuring data integrity through Chain Replication with Apportioned Queries (CRAQ).

// key highlights

Utilizes a disaggregated architecture to aggregate the throughput of thousands of SSDs across hundreds of nodes for massive parallel performance.

Implements Chain Replication with Apportioned Queries (CRAQ) to provide strong consistency, simplifying application logic and reasoning.

Provides standard file interfaces backed by transactional key-value stores, allowing developers to use familiar APIs without learning new storage protocols.

Enables efficient data loading by supporting random access to training samples across compute nodes, eliminating the need for manual prefetching or shuffling.

Supports high-throughput parallel checkpointing, which is critical for maintaining stability and progress in large-scale AI model training.

Offers a cost-effective, high-capacity alternative to DRAM-based KVCache for LLM inference, significantly increasing throughput and capacity.

// use cases

High-throughput parallel checkpointing for large-scale AI training

Efficient data preparation and management for analytics pipelines

Cost-effective KVCache storage for LLM inference optimization

// getting started

To begin using 3FS, clone the repository and initialize the submodules using the provided script. Install the necessary system dependencies for your Linux distribution (Ubuntu, openEuler, or OpenCloudOS), ensure FoundationDB and Rust are configured, and build the project using CMake. Finally, refer to the documentation in the deploy directory to set up and run a test cluster.