• DAFT: Disk Geometry-Aware File System Traversal

  • FileName: TR240.pdf [read-online]
    • Abstract: The resulting scheme, called DAFT (Disk. geometry-Aware File system Traversal), provides a bulk file access ... As a result, DAFT is able to make the best use of the raw disk ...

Download the ebook

DAFT: Disk Geometry-Aware File System Traversal
Fanglu Guo Tzi-cker Chiueh
Symantec Research Labs
Culver City, CA 90230
Abstract--Bulk file access is a read access to a large number of back-up cannot capture per-file information, it cannot provide
files in a file system. Example applications that use bulk file access several important benefits associated with file-level data back-
extensively are anti-virus (AV) scanner, file-level data back-up up, such as more restore flexibility [1], better deduplication
agent, file system defragmentation tool, etc. This paper describes
the design, implementation, and evaluation of an optimization to efficiency [2], etc.
modern file systems that is designed to improve the read efficiency As a result, despite being slower, file-level data back-up
of bulk file accesses. The resulting scheme, called DAFT (Disk is still commonly supported in many commercial back-up
geometry-Aware File system Traversal), provides a bulk file access applications, which in turn require fast bulk file access.
application with individual files while fetching these files into Another example application with bulk file access pattern is
memory in a way that respects the disk geometry and thus is as
efficient as it can be. We have successfully implemented a fully the AV (Anti-Virus) scanner. When an AV scanner performs a
operational DAFT prototype, and tested it with commercial AV full-system scan, it needs to retrieve each file and scans the file
scanners and data back-up agents. Empirical measurements on content to determine if a file is infected. Because AV scanners
this prototype demonstrate that it can reduce the elapsed time need to parse individual files, e.g., PE executable file, MPEG
of enumerating all files in a file system by a factor of 5 to 15 for file, etc., it is impossible for them to operate directly on a raw
both fragmented and non-fragmented file systems on fast and
slow disks. disk volume.
This paper describes the design, implementation and eval-
I. I NTRODUCTION uation of a file system extension for Microsoft's NTFS
File system provides the abstractions of file, file attribute, file system called DAFT (Disk geometry-Aware File system
and directory/folder on top of raw disk blocks to simplify Traversal), which is specifically designed to speed up bulk
the design and implementation of applications accessing data file accesses. The goal of the DAFT project is to develop a
stored on disk. However, this simplification sometimes carries bulk file access mechanism that provides the same flexibility of
a significant performance price, because the existence of a file file-level bulk file access and the performance benefit of block-
system layer insulates these applications from the disk and level access, thus achieving the best of both worlds. Through
prevents them from accessing their disk-resident data in the a bulk file access API, an application specifies a set of files
most efficient way. For example, many applications require that it needs. Then DAFT delivers each individual file to the
the ability to access every file in a file system or in a selective application by reading the file's content directly from the disk.
set of directories in a file system. A natural way to develop As a result, DAFT is able to make the best use of the raw disk
such applications is to go through every file in the target list, transfer bandwidth while returning the target file set on a file
open it, and read in its contents. Depending on the order in by file basis.
which the files in the target list are traversed, the end-to-end More concretely, given a target file set, DAFT first accesses
elapsed time required to access them could vary significantly. the file system's metadata to identify the disk location of
Conventional file systems are not designed to optimize for such every file fragment to be accessed, sorts their disk locations in
bulk file access pattern because (1) they don't provide an API an increasing order, and fetches the requested file fragments
for applications to express their intentions to access a target file according to this order. Therefore, only one sequential pass
set, which could be a set of directories or an entire file system, through the disk is needed to retrieve the target file set. As
and (2) they lack a multi-file disk scheduling mechanism to file fragments are brought into memory, DAFT associates
retrieve the requested file blocks in a way most friendly to the them with their corresponding files, completes the assembly
disk geometry. of a file as soon as all of its file fragments become available
One example application with bulk file access pattern is in memory, and delivers each completely assembled file to
a file-level data back-up application that retrieves individual the requesting application. Consequently, DAFT is able to
files on a host's file system and transfers them to a back- enumerate individual files in an application-specified file set
up media server for persistent storage. To overcome the poor with the same efficiency as block-level bulk file access. At the
performance problem of file-level data back-up, an alternative same time, application can still operate on each individual file
back-up method, block-level data back-up, is invented. Block- for the purpose of AV scanning or file-level data back-up.
level data back-up directly reads the disk volume underlying We have successfully implemented the first DAFT prototype
a file system and bypasses the file system layer completely. on Microsoft's NTFS file system. Compared with the original
Because this approach accesses disk volumes sequentially, its version of the NTFS file system, DAFT is able to improve the
performance is excellent. However, because block-level data sustained throughput of bulk file access by a factor of 5 to
location, 4 bytes for the fragment's length, 4 bytes for the
file1 f1.1 f1.2 f1.3 fragment's offset in the file, and 4 bytes for pointing back
to the file to which the fragment belongs. For a 150,000-
file2 f2.1 f2.2
file NTFS system, the total amount of memory required to
file3 f3.1 f3.2 record its file fragments' disk location information is about 10
Mbytes, which suggests the average number of file fragments
file4 f4.1 per file is smaller than 4.
When the number of files in FileList is so large that
the file fragment information cannot be fitted in memory, we
Fig. 1. An example to show how DAFT works when the file is fragmented. can divide the files to several small subsets and process one
Each file may have several fragments and the fragment location on the figure subset at a time.
represents where the fragments are physically on the disk. The dark areas on In the second phase, DAFT first sorts the fragments of the
the disk represent blocks not processed by DAFT.
files in the target set according to their disk location, and
15, depending on the degree of fragmentation in the test file then read them into memory in the sorted order. Figure 1
systems and the speed of the underlying disks. shows the physical location of several file fragments on the
disk. DAFT first reads in the first fragment of file2, fragment
II. DAFT D ESIGN f2.1, and then fragment f1.1. DAFT ignores the file boundary,
In this section, we describe the design of DAFT in the and does not need to fetch a file completely before reading
context of Microsoft's NTFS file system. in fragments of another file. When fragments are sufficiently
close to each other, they can be merged into one disk read
A. Bulk File Access API request. For example, f2.1 and f1.1 in Figure 1 can be fetched
DAFT provides a new API, in one physical disk access because they are sufficiently close
BulkFileRead(FileList, CallBackFunction), to each other. The version of DAFT with this optimization is
for a bulk file access application to specify a list of files or called DAFTM (DAFT with merging adjacent fragments), and
directories that it intends to access, and a call-back function greatly improves DAFT's disk access efficiency, especially on
that DAFT should call when it brings each requested file in disks that do not support automatic coalescing of physical disk
the list into memory. FileList could be a list of directories access requests.
and represents all the files recursively under these directories. When the first fragment of a file is read into memory,
Through this API, a bulk file access application indicates to DAFT allocates memory for the entire file, not just for that
DAFT that the order in which files are brought into memory file fragment. When subsequent fragments of the file are read
is unimportant and is thus left to DAFT for performance into memory, they are stored in the corresponding locations of
optimization. In addition, this API dictates that the control the file's allocated memory. When all fragments of a file are
flow of the calling application be structured as data-driven, read to memory, DAFT invokes the application-specific call-
i.e., the processing logic associated with a file, e.g., AV back function CallBackFunction and frees the memory
scanning or data back-up, is triggered only upon the file's occupied by the file. DAFT continues to fetch all the file
arrival at memory. fragments until all fragments in the sorted fragment list are
brought into memory.
B. How DAFT Works To summarize, the key ideas in DAFT that make it more
DAFT divides the file access to two phases. In the first efficient for bulk file access are
phase, all the files' metadata are brought into memory and DAFT ignores the file and file fragment boundary when
analyzed. In the second phase, DAFT accesses the files' data accessing the target file set of a bulk file access. This
according to the analysis result from the metadata. In this two- allows DAFT to use large disk reads in most cases, and
stage approach, both accesses to file metadata and file data are reduces the seek distance between consecutive disk reads.
sequential. After bringing file fragments into memory through a
In the first phase, DAFT determines the disk location of raw disk interface, DAFT transparently assembles file
every fragment of every file in the target file set. On NTFS, a fragments into complete files, essentially converting in-
file's fragment information is described by one or more MFT efficient random disk accesses to more efficient random
records. An NTFS file system's MFT records are stored in a memory accesses. This allows DAFT to present to bulk
special file called $MFT, which is usually stored in a reserved file access applications with individual files.
disk region and is itself rarely fragmented. DAFT reads the The basic idea behind DAFT could be applied and ported
$MFT file into memory, parses the MFT records, and puts the to arbitrary file systems, as long as it is possible to access
file fragment location information to a fragment list. the file system's metadata that contains the mapping from a
The total memory requirement for file fragment disk loca- file to the disk locations of its file fragments. In the case of
tion information is relatively modest, because most files have a Unix-like systems, this file mapping information is stored in
small number of fragments. It costs DAFT 20 bytes to maintain an inode table. In the case of Microsoft's NTFS file system,
each file fragment's disk location: 8 bytes for the fragment's it is stored in the Master File Table (MFT). DAFT can easily
be implemented on any files system as user level library and encountered, the algorithm checks the first bin to see if it
benefit any bulk file access applications. Furthermore, DAFT has enough free space to hold the file. If it has, the file is put
doesn't need to defragment the disk or have other run time to the first bin. Otherwise, the algorithm checks subsequent
component. It doesn't introduce any overhead to a system but bins the same way and creates a new bin if no existing bin
can provide improvement at the spot where it is needed. can hold the file. After the last fragment of a file is fetched,
the algorithm removes the file from its bin and frees up the
C. Re-assembly Buffer Memory Management memory it previously occupied. Files with disjoint life spans
Although DAFT drastically improves the throughput of could take up the same buffer memory area even when they
bulk file access, it achieves this feat at the expense of larger are assigned to the same pass. As an example in Figure 1,
memory requirement. In the process of fetching file fragments when f2.1, f1.1, f3.1, and f4.1 are read, the first bin is always
according to the sorted order, DAFT needs to buffer in memory checked first. Assume f3.1 cannot be fit in the first bin, file3
those files that are waiting to be completed. We define the will be put into other bins. When f4.1 is about to be read,
range between the first and the last physical fragment that since file2 is completed, file4 may be put to the first bin again
belong to a file as the file's physical range. For example, the if this bin has enough space to hold file4 after the memory
physical range of file1 in Figure 1 is the range between f1.1 taken by file2 is freed up.
and f1.3. The set of files that need to be buffered in memory Surprisingly our experiments show that minimizing the
when the ith file system fragment is fetched will be all the files number of sequential disk passes, as is done in the first fit
whose physical range covers the ith file system fragment. For algorithm, may not result in the best end-to-end throughput.
example, when DAFT reads fragment f2.2, file1, file2, and Instead, a simple greedy algorithm, on-demand cleaning,
file3 need to be buffered in memory because their physical actually performs better and processes files in the target file
range covers fragment f2.2. set in the same order as follows. Before the first fragment of
Accordingly, the buffer memory requirement when the ith a file is read, the algorithm checks if there is enough space in
file system fragment is fetched is the sum of the sizes of all the buffer memory to hold the entire file. If yes, the algorithm
those files whose physical range covers it, and the memory allocates the memory required for the file; otherwise, it enters a
requirement of a DAFT run will be the maximum of the buffer clean-up mode. In the clean-up mode, the algorithm identifies
memory requirements across all file fragments specified in the and sorts the set of file fragments that need to be fetched to
target file set. complete the currently pending files. Then it reads in all the
Because the amount of buffer memory available on ma- remaining fragments of the pending files in the sorted order,
chines where DAFT runs is typically limited, we need to and completes all the pending files. No memory limit violation
adapt DAFT to work with a fixed amount of buffer amount, is possible in the clean-up mode because no new files are
even at the expense of some performance degradation. More admitted. After completing all the pending files, the algorithm
concretely, given a fixed buffer memory budget M , DAFT par- returns to the normal mode and proceeds from the last fetched
titions the file fragment list into a number of sub-lists so that fragment that triggers the on-demand cleaning which it just
each sub-list can be fetched into memory via one sequential completes.
pass through the disk, the buffer memory requirement of each Although on-demand cleaning appears greedy, it actually
such pass is lower than M . Intuitively, the fewer the number of corresponds to a solution to a batched version of the dynamic
passes required, the lower the total elapsed time. Therefore the bin packing problem, which imposes a minimal size limit on
main goal for DAFT's buffer memory management design is the item(s) that are assigned to a bin. That is, whenever a bin
to minimize the total number of sequential disk passes needed accepts new items, the algorithm assigns a batch of items at a
in a bulk file access. time rather than one file at a time, and the total size of these
This problem can be mapped exactly to the Dynamic items must be greater than or equal to a threshold S. In the
Bin Packing (DBP) problem [3], whose goal is to pack a context of DAFT, if S is too small, physically adjacent files
sequence of items that arrive and depart at arbitrary times into may be assigned to different bins/passes, and each pass may
fixed-sized bins so that the total number of bins required is thus incur additional disk seeks; if S is too large, having to
minimized. Here a bin of size M corresponds to a DAFT pass process each batch of files separately decreases the probability
through the disk with a buffer memory budget of M , and an of picking up all relevant file fragments in a disk area as the
item corresponds to a file whose life time is the file's physical disk head sweeps through the area, and in turn may require
range and whose size is the file's size. The Dynamic Bin more disk passes than necessary. On-demand cleaning requires
Packing problem is a well researched problem and Coffman, S to be the size of the entire buffer memory, whereas file-by-
Garey and Johnson [3] proved that the lower bound on the file dynamic bin packing does not have any such constraint,
competitive ratio of any on-line DBP algorithm is 2.3881 and i.e. S = 0.
the first fit heuristic is very effective and is 2.788-competitive.
Given a target file set, the first fit algorithm statically D. Special Files
considers the files in an order according to the disk location There are some special files that DAFT currently does not
of their first fragment and partitions them into multiple bins handle. First, DAFT treats large files specially by separately
(passes) as follows. When the first fragment of a file is bringing them into memory and handing them to the applica-
tion. The rationale of this design is the additional performance disk volume level, and then reads each requested file through
gain from including large files into the standard DAFT flow the standard file system interface when it is fully assembled
does not justify the extra memory requirement they introduce. at the disk volume level. By retrieving each requested file
Second, for files compressed or encrypted by the file system, through the file system layer, DAFT effectively masks any
DAFT cannot decompress or decrypt them properly before metadata and data inconsistency problems. Because everything
delivering them to the application, because DAFT retrieve DAFT accesses goes through the file system, this version of
them from disk without going through the file system. This DAFT can seamlessly work with files that are encrypted or
may be acceptable for some applications, such as data back- compressed by the file system layer without any modifica-
up, but is not acceptable for other applications, such as AV tion. Finally, because in most cases DAFT already caches a
scanning, because AV scanners cannot make sense of a file if requested file's fragments at the disk volume level, most of
it is in compressed or encrypted form. its file system reads are serviced from this cache and do not
One solution to this problem is to access these files through require any physical disk accesses.
the standard file system API. In the first phase, if the file
size is over a certain limit, or the file is compressed or F. Other Miscellaneous Issues
encrypted, the file is put into a special file list. The file
fragment information will not be collected. In the second Disk geometry-aware access. One concern regarding disk
phase, after all fragments are processed, the files in the special geometry is that it is hard to get the exact disk geometry. Spare
file list are handed over the application one by one with a flag sectors are provisioned to replace defective sectors. The RAID
to notify the application that the file content cannot be read technology can map a volume to multiple disks. Fortunately
directly from memory. Instead, the file content should be read DAFT doesn't need to have accurate disk geometry informa-
from disk with standard file system API. Fortunately, large tion. The only assumption is that sequential disk access will be
files or compressed/encrypted files are only a small portion of much faster than random access. By sorting file fragments and
the files on a typical file system. As long as DAFT can handle accessing them sequentially, we can get more throughput from
over 90% of the files, it is expected to produce significant a disk than access the file fragments randomly. This is true
performance improvement when compared with existing file most of the time even if RAID mapping or sector remapping
systems without any bulk file access optimization. renders the logical volume view drastically different from the
physical disk view.
E. Consistency Issue Other concurrent accesses to the disk. DAFT improve
Because DAFT bypasses the file system layer, the metadata performance by accessing the volume sequentially. When there
and data it reads may not be up to date or consistent with are other concurrent accesses to the disk, DAFT's sequential
the file system's current image. For example, if a file is being access may be interrupted by those requests. This might shrink
opened for write by an application at the time when DAFT the performance gain of DAFT. But DAFT is still much better
reads it from the disk, the version that DAFT brings into than standard file access in that when a disk serves DAFT
memory may not be up to date with respect to the version requests, they correspond to sequential accesses, which are
in the file system cache, let alone that in the application's still much more efficient than random accesses in standard
address space. As another example, when DAFT reads the file file system traversal.
system metadata to build up the requested file fragment list, it Asynchronous access. In theory, asynchronous file system
does not go through the file system cache, and therefore may access interface may already be good enough for file sys-
miss some pending file system metadata updates that should tem traversal. In practice, asynchronous access interface only
be but have not been committed to disk. slightly improves the overall performance. When applications
Fortunately, this inconsistency problem has already been use asynchronous interface to issue multiple requests to a
addressed in the back-up practice. Windows Volume Shadow disk, it helps in that the disk has more freedom to sched-
Copy Service (VSS) [4] provides a framework for applications ule the requests by using some form of elevator algorithm
to register application-specific quiesce functions. Before a to improve the access efficiency. This has the same effect
back-up agent takes a back-up, it asks VSS to create a as the bulk access feature of DAFT: when more freedom
snapshot, which is a point-in-time image of the underlying is given, DAFT gets a chance to reorder the requests and
disk volume. Before creating the snapshot, VSS automatically makes them more efficient. But the similarity ends here. In
quiesces the applications through application-provided quiesce practice, asynchronous requests are inferior to DAFT for the
call-back functions and flushes the file system cache. The following reasons. First, asynchronous accesses don't exploit
quiesce step forces all applications to flush their cache and file fragment boundary information. As a result, random access
stop modifying files they open. Because a snapshot captures inside each request cannot be avoided. Second, it is difficult
an instant image, there is no consistency issue at all when to issue all the requests associated with a file system traversal
DAFT works with an VSS snapshot. through the asynchronous access interface in one shot. Thus
Yet another solution to the consistency issue is to run the disk scheduler may not have the broader view that DAFT
DAFT through the file system cache. In this solution, DAFT has. Finally, to allow the disk scheduler to have more freedom
accesses the requested file fragments and caches them at the to reorder requests, many small asynchronous access requests
need to be issued. The overhead of these system calls and File Size 4 KB 16 KB 64 KB
asynchronous signal processing is substantial. Fast Disk 15.5 5.5 2.1
Slow Disk 6.2 3.9 2.1
III. P ERFORMANCE E VALUATION The throughput improvement of DAFTM over the Vanilla method for the
non-fragmented test file system. The improvement is more significant for small
files because DAFTM is less affected by the file size than the Vanilla method.
A. Methodology
To evaluate the performance gain of DAFT when an ap-
B. Non-Fragmented File System
plication accesses a large set of files, we vary the following
configuration parameters in the test runs: In this test, we evaluate the performance gain of DAFT on
an ideal non-fragmented file system, in which (a) files are not
Speed of the test disk: the slow disk is a normal desktop fragmented, (b) files in a directory are clustered physically on
disk with a throughput ranging from 30 MB/sec to 56 disk, and (c) the order in which the file system enumerates
MB/sec. The fast disk has a throughput ranging from 76 a directory's files is the same as they are laid out on disk.
MB/sec to 128 MB/sec. Note that a file system output by a standard disk defrag-
Degree of fragmentation: we used a real-world frag- mentation tool is NOT an ideal non-fragmented file system
mented file system and a synthetic non-fragmented file because disk defragmentation typically only eliminates intra-
system. file fragmentation (case (a) in the above), but not necessarily
Maximum file size: we varied the maximum size of the
inter-file fragmentation (case (b) and (c) in the above). DAFT
files used in the test runs.
Buffer size: we varied the size of the memory buffer used is immune to inter-file fragmentation but the Vanilla method is
in re-assembling file fragments into files. not. Therefore, the performance gain of DAFT over the Vanilla
method is expected to be higher on a real-world defragmented
We compared three different methods to access a target file system than on an ideal non-fragmented file system.
file set on the Windows platform. The first method, Vanilla, To synthesize a non-fragmented file system, we start with
corresponds to the conventional way, which uses standard an empty disk, and use a program to populate it with 1
Win32 API, i.e., million files in the following way. The program creates the first
FindFirstFile and FindNextFile to traverse the di- directory, and generates 1000 files under that directory. Then it
rectories and files, and CreateFile and ReadFile to repeats the same process to create another 999 directories one
access each file's content. The performance of the Vanilla by one, each containing 1000 files. Three file sizes are used
method serves as the base case for comparison. The sec- in the test runs: 4 KB, 16 KB, and 64 KB. The maximum file
ond method is DAFT without merging adjacent fragments size is limited to 64 KB because the capacity of the fast disk is
(DAFTNM), which traverses and parses the MFT, collects only around 70 GB. We created 1 million files on the fast disk,
and sorts the file fragments associated with the target file set, and then copied the fast disk's image to the slow disk using an
and reads these file fragments according to the resulting order. imaging software (Symantec's Backup Exec System Restore
That is, even if two needed file fragments are adjacent to each [6]). The imaging software insures that both disks have the
other, DAFT reads each of them using a separate disk read same file system layout.
operation. The third method is DAFT with adjacent fragments Figure 2 shows the throughput of using the three bulk file
merged (DAFTM), which is different from the second method access methods, Vanilla, DAFTNM, and DAFTM, to access
in that it fetches adjacent fragments in the sorted list that are the entire synthesized file system, when the file size is varied
sufficiently close to one another on the disk using a single disk and the disk used in the tests is the fast disk or the slow disk.
read operation. The performance difference between these two As expected, the performance of DAFTM is better than that
DAFT versions lies in the tradeoff between the decrease in the of DAFTNM, which in turn is better than that of the Vanilla
number of disk read operations required and the increase in method.
the total number of bytes retrieved from disk. When the file system being enumerated is an ideal non-
All evaluation tests are carried out on the same test computer fragmented file system, the sorted list in DAFT does not help
to make it easier to compare results from different tests. The much, and there should not be any performance difference
test computer used has a Xeon 2.4 GHz CPU with 1 GB between Vanilla and DAFTNM. However, Figure 2 shows that
memory. The specifications of the two disks used in these tests there is actually a substantial performance gap between Vanilla
are shown in Table I. The slow disk is a normal desktop disk and DAFTNM, because the disk head needs to move between
accessible through the Parallel ATA interface. The fast disk is the MFT and the accessed files in the case of Vanilla, whereas
a high performance SCSI disk that is connected to an SCSI the disk head just needs to move among the accessed files in
adaptor through an Ultra 320 SCSI channel with maximum the case of DAFTNM.
throughput of 320 MB/sec. The SCSI adaptor is connected The performance gap between DAFTM and DAFTNM,
to the computer's 64-bit, 133-MHz PCI-X bus. The raw disk on the other hand, mainly comes from higher physical disk
throughputs and access times reported in Table I are measured access efficiency as a result of larger disk access request size,
using the program HD Tune 2.55 [5]. because both use the same sorted fragment order. For example,
Model Capacity (GB) Speed (RPM) Throughput (MB/sec) Access Time (msec)
Fast Disk Seagate ST373455LW 73 15000 76 to 128 5.8
Slow Disk Maxtor 6Y160P0 163 7200 30 to 56 19.8
The hardware specifications of the fast and slow disks used in this study.
Vanilla Vanilla
100 50
Throughput (MB/sec)
Throughput (MB/sec)
80 40
60 30
40 20
20 10
0 0
4 KB 16 KB 64 KB 4 KB 16 KB 64 KB
File Size File Size
(a) Throughput of DAFT and Vanilla on the fast disk (b) Throughput of DAFT and Vanilla on the slow disk
Fig. 2. The throughput of DAFTM (DAFT with adjacent fragments merging) and DAFTNM (DAFT without adjacent fragments merging) on the non-fragmented
test file system. DAFTM is able to deliver close to the raw disk throughput regardless of the file size and improve over the Vanilla method by a factor of 2 to 15.
File Size Limit 128 KB 512 KB 1024 KB File Size 128 KB 512 KB 1024 KB
File Percentage (%) 87.9 95.1 97.3 Fast Disk 6.7 5.3 4.4
Fragment Ratio 1.11 1.21 1.28 Slow Disk 7.0 5.7 4.3
Memory Requirement (MB) 84 387 718 TABLE IV
TABLE III The throughput improvement of DAFTM over the Vanilla method for the
The characteristics of the fragmented test file system and the total buffer

Use: 0.2313