Windows File System Filter Driver FAQ
What is the difference between
            cached I/O, user non-cached I/O, and paging I/O? 
              
In a file
            system or file system filter driver, read and write operations fall into
            several different categories. For the purpose of discussing them, we normally consider the following types: 
            
            - Cached I/O. This includes normal user I/O, both via the Fast I/O path as well
            as via the IRP_MJ_READ and IRP_MJ_WRITE path. It also includes the MDL
            operations (where the caller requests the FSD return an MDL pointing to the
            data in the cache). 
            
            - Non-cached user I/O. This includes all non-cached I/O operations that
            originate outside the virtual memory system. 
            
            - Paging I/O. These are I/O operations initiated by
            the virtual memory system in order to satisfy the needs of the demand paging
            system. 
            
            Cached I/O is any I/O that can be satisfied by the file
              system data cache. In such a case, the operation is normally to copy the
            data from the virtual cache buffer into the user buffer. If the virtual cache
            buffer contents are resident in memory, the copy is fast and the results
            returned to the application quickly. If the virtual cache buffer contents are
            not all resident in memory, then the copy process will trigger a page fault,
            which generates a second re-entrant I/O operation via the paging mechanism. 
            
            Non-cached user I/O is I/O that must bypass the cache - even if the data is
            present in the cache. For read operations, the FSD can retrieve the data
            directly from the storage device without making any changes to the cache. For
            write operations, however, an FSD must ensure that the cached data is properly invalidated (if this is even possible, which it
            will not be if the file is also memory mapped). 
            
            Paging I/O is I/O that must be satisfied from the storage device (whether local
            to the system or located on some "other" computer system) and it is being requested by the virtual memory system as part of
            the paging mechanism (and hence has special rules that apply to its behavior as
            well as its serialization).
            
I see I/O requests with the
            IRP_MN_MDL minor function code? What does this mean?
            How should I handle it in my file system or filter driver?
              
Kernel mode
            callers of the read and write interface (IRP_MJ_READ and IRP_MJ_WRITE) can
            utilize an interface that allows retrieval of a pointer to the data as it is
            located in the file system data cache. This allows the kernel mode caller to
            retrieve the data for the file without an additional data copy. 
            
            For example, the AFD file system driver has an API function it exports that
            takes a socket handle and a file handle. The file contents are
            "copied" directly to the corresponding communications socket. The AFD
            driver accomplishes this task by sending an IRP_MJ_READ with the IRP_MN_MDL
            minor operation. The FSD then retrieves an MDL describing the cached data (at Irp->MdlAddress) and completes
            the request. When AFD has completed processing the operation it must return the MDL to the FSD by sending an IRP_MJ_READ with the
            IRP_MN_MDL_COMPLETE minor operation specified. 
            
            For a file system filter driver, the write operation
            may be a bit more confusing. When a caller specifies IRP_MJ_WRITE/IRP_MN_MDL
            the returned MDL may point to uninitialized data regions within the cache. That
            is because the cache manager will refrain from reading the current data in from
            disk (unless necessary) in anticipation of the caller replacing the data. When
            the caller has updated the data, it releases the buffer by calling
            IRP_MJ_WRITE/IRP_MN_MDL_COMPLETE. At that point the
            data has been written back to the cache. 
            
            An FSD that is integrated with the cache manager can
            implement these minor functions by calling CcMdlRead and CcPrepareMdlWrite. The corresponding functions
            for completing these are CcMdlReadComplete and CcMdlWriteComplete. An FSD that is not integrated with the
            cache manager can either indicate these operations are not supported (in which
            case the caller must send a buffer and support the "standard"
            read/write mechanism) or it can implement them in some other manner that is
            appropriate.
            
Handling
            FILE_COMPLETE_IF_OPLOCKED in a filter driver. 
              
A filter
            driver may be called by an application that has
            indicated the FILE_COMPLETE_IF_OPLOCKED. If the filter in turn calls ZwCreateFile it may cause the
            thread to deadlock. 
            
            A common problem for file systems is that of reentrancy. The Windows operating
            system supports numerous reentrant operations. For example, an asynchronous
            procedure call (APC) can be invoked within a given thread context as needed.
            However, suppose an APC is delivered to a thread while
            it is processing a file system request. Imagine that this APC in turn issues
            another call into the file system. Recall that file systems utilize resource
            locks internally to ensure correct operation. The file systems must also ensure
            the correct order of lock acquisition in order to eliminate the possibility of
            deadlocks arising. However it is not possible for the
            file system to define a locking order in the face of arbitrary reentrancy! 
            
            To resolve this problem, the file systems disable certain types of reentrancy
            that would not be safe. They do this by calling FsRtlEnterFileSystem when they enter a region of code that is not reentrant. When they leave that
            region of code, they call FsRtlExitFileSystem to
            enable reentrancy. 
            
            This is important to an understanding of this problem because the CIFS file
            server uses oplocks as part of its cache consistency
            mechanism between remote clients and local clients. This is
              done using a "callback" mechanism, which is implemented using
            APCs. 
            
            Normally, the FSD will block waiting for the completion of the APC that breaks
            an oplock. Under certain circumstances, however, the
            CIFS server thread that issued the operation requiring an oplock break is also the thread that must process the APC. Since the file system has
            blocked APC delivery, and now the thread is blocked awaiting completion of the
            APC, this approach leads to deadlock. Because of this, the Windows file system
            developers introduced an additional option that advises the file system that if
            an oplock break is required to process the
            IRP_MJ_CREATE operation, it should not block, but instead should return a
            special status code STATUS_OPLOCK_BREAK_IN_PROGRESS. This return value then
            tells the caller that the file is not completely opened.
            Instead, a subsequent call to the file system, using the
            FSCTL_OPLOCK_BREAK_NOTIFY, must be made to ensure that
            the oplock break has been completed. 
            
            Of course, this works because by returning this status code the APC can be delivered, once the thread exits the file system
            driver. 
            
            Note that FSCTL_OPLOCK_BREAK_NOTIFY, and the other calls for the oplock protocol, are documented in the Windows Platform
            SDK.
  
What are the rules for my file
            system/filter driver for handling paging I/O? What about paging file I/O? 
              
The rules for
            handling page faults are quite strict because incorrect handling can lead to
            catastrophic system failure. For this reason, there are specific rules used to
            ensure correct cooperation between file systems (and file system filter
            drivers) and the virtual memory system. This is necessary because page faults are trapped by the VM system, but are then ultimately
            satisfied by the file system (and associated storage stack). Thus, the file
            system must not generate additional page faults because this may lead to an
            "infinite" recursion. Normally, hardware platforms have a finite
            limit to the number of page faults they will handle in a "stacked"
            fashion. 
            
            Thus, the most reliable of the paging paths is that for the paging file. Any
            access to the paging file must not generate any additional page faults.
            Further, to avoid serialization problems, the file system is not responsible
            for serializing this access - the paging file belongs to the VM system, and the
            VM system is responsible for serializing access to this file (this eliminates
            some of the re-entrant locking problems that occur with general file access).
            To achieve this, the drivers involved must ensure they will not generate a page
            fault. They must not call any routines that could generate a page fault (e.g., code
            modules that can be paged). The file system may only be called at APC_LEVEL but it should only call
            those routines that are safe to call at DISPATCH_LEVEL, since such routines are
            guaranteed not to cause page faults. None of the data being
              used through this path should be pagable. 
            
            For all other files, paging I/O has less stringent rules. For a file system
            driver the code paths cannot be paged - otherwise, the
            page fault might be to fetch the very piece of code needed to satisfy the page
            fault! For any file system, data may be paged, since
            such a page fault can always be resolved eventually by retrieving the correct
            contents off disk. Paging activity occurs at APC_LEVEL and thus limits
            arbitrary re-entrancy in order to prevent calling
            code paths that could generate yet more page faults.
            
IRP_MJ_CLEANUP vs  IRP_MJ_CLOSE
            
The purpose
            of IRP_MJ_CLEANUP is to indicate that the last handle reference against the
            given file object has been released. The purpose of IRP_MJ_CLOSE is to indicate
            that the last system reference against the given file object has been released.
            This is because the operating system uses two distinct reference counts for any
            object, including the file object. These values are stored within the object
            header, with the HandleCount representing the number
            of open handles against the object and the PointerCount representing the number of references against the object. Since the HandleCount always implies a reference (from a handle
            table) to the object, the HandleCount is less than or
            equal to the PointerCount. 
            
            Any kernel mode component may maintain a private reference to the object.
            Routines such as ObReferenceObject, ObReferenceObjectByHandle, and IoGetDeviceObjectPointer all bump the reference count on a specific object. A kernel mode driver
            releases that reference by using ObDereferenceObject to decrement the PointerCount on the given object. 
            
            A file system, or file system filter driver, will often see a long delay
            between the IRP_MJ_CLEANUP and IRP_MJ_CLOSE because a component within the
            operating system is maintaining a reference against the file object. Frequently,
            this is because the memory manager maintains a reference against a file object
            that is backing a section object. So long as the section object remains
            "in use" the file object will be referenced.
            Section objects, in turn, remain referenced for extended periods of time
            because they are used by the memory manager in tracking the usage of memory for
            file-backed shared memory regions (e.g., executables,
            DLLs, memory mapped files). For example, the cache manager uses the section
            object as part of its mappings of file system data within the cache. Thus, the period of time between the IRP_MJ_CLEANUP and the
            IRP_MJ_CLOSE can be arbitrarily long. 
            
            The other complication here is that the memory manager uses only a single file
            object to back the section object. Any subsequent file object created to access
            that file will not be used to back the section and thus for these new file
            objects the IRP_MJ_CLEANUP is typically followed by an IRP_MJ_CLOSE. Thus, the
            first file object may be used for an extended period of time,
            while subsequent file objects have a relatively short lifespan.
            
            
What are the rules for managing MDLs
            and User Buffers? How do I substitute my own buffer in an IRP? 
              
In all
            fairness, there are no "rules" for managing MDLs and user buffers.
            There are suggestions that we can offer based upon observed behavior of the
            file systems. First, we note that there are two basic sources of I/O operations
            for a file system - the applications layer, and other operating system
            components. 
            
            For applications programs, most IRPs are still buffered.
            The operations for which this is not necessarily the case are those for which
            larger amounts of data are transferred. These are
            IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, IRP_MJ_SET_QUOTA,?and the per-control code options of
            IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and
            IRP_MJ_FILE_SYSTEM_CONTROL. If the Flags field in the device object specifies
            DO_DIRECT_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE,
            IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA,?IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA,
            ?is specified as a memory descriptor list (MDL) pointed to by the MdlAddress field of the IRP. If the Flags field in the
            device object specifies DO_BUFFERED_IO then the buffer for IRP_MJ_READ,
            IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA,?is a non-paged pool buffer pointed to by the AssociatedIrp.SystemBuffer field of the IRP. The most common case for a file system driver is that neither of
              these two flags is specified, in which case the buffer for IRP_MJ_READ,
              IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_SECURITY,
              IRP_MJ_SET_SECURITY, and IRP_MJ_QUERY_EA, IRP_MJ_SET_EA is a direct pointer to
              the caller-supplied buffer via the UserBuffer field
              of the IRP.
            
Interestingly,
            for IRP_MJ_QUERY_SECURITY and IRP_MJ_SET_SECURITY the buffer is always passed
            as METHOD_NEITHER.? Thus, the user buffer is pointed
            to by Irp->UserBuffer.? The file system is responsible for validating and
            managing that buffer directly.
            
For the
            control operations (IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL) the buffer descriptions are a
            function of the specified control code, of which there are four:
            METHOD_BUFFERED, METHOD_IN_DIRECT, METHOD_OUT_DIRECT and METHOD_NEITHER. 
            
            ? METHOD_BUFFERED - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. Upon completion of the operation the output data is in the same buffer. Transferring data between user and kernel mode is handled by the
              I/O Manager.
            ? METHOD_IN_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress.
            When the I/O Manager probed and locked the memory corresponding to the memory
            for the original buffer, it probed it for read access (hence the IN part of
            this transfer description). Typically, this is confused because it is referred to
            as the output buffer (although, probing it for input would suggest it is being
            used as a secondary input buffer).
              ? METHOD_OUT_DIRECT - in this case the input data is in the buffer
            pointed to by AssociatedIrp.SystemBuffer. The
            secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the
            memory corresponding to the memory for the original buffer, it probed it for
            write access (hence the OUT part of this transfer description). ?
            METHOD_NEITHER - in this case the input data is described by
              a pointer in the I/O stack location
              (Parameters.DeviceIoControl.Type3InputBuffer). The
                output data is described by a pointer in the IRP (UserBuffer).
            In both cases, these pointers are direct virtual address references to the
            original buffer. The memory may, or may not, be valid. 
            
            Regardless of the type of transfer, any pointers stored within these buffers
            will always be direct virtual memory references. For example, a driver that
            accepts a further buffer pointer will need to treat them as ordinary direct
            access to application address space. 
            
            For any driver, attempting to access a user buffer directly can lead to an
            invalid memory reference. Such references cause the memory manager to throw
            exceptions (such as is done using ExRaiseStatus for instance). If a driver has not protected against such exceptions, the default kernel exception handler will be invoked. This
            handler will call KeBugCheckEx indicating
            KMODE_EXCEPTION_NOT_HANDLED. The exception will be indicated as STATUS_ACCESS_VIOLATION (0xC0000005). Protecting against such exceptions
            requires the use of structured exception handling, which is described elsewhere
            (See Question Number 1.38 for more information). 
            
            A normal driver will associate an MDL it creates to describe the user buffer
            with the IRP. This is useful because it ensures that when the IRP is completed
            the I/O Manager will clean up these MDLs, which eliminates the need for the
            driver to provide special handling for such clean-up.
            As a result, it is normal that if there is both an MdlAddress and UserBuffer address, the MDL describes the
            corresponding user address (you can confirm this by comparing the value
            returned by MmGetMdlVirtualAddress with the value
            stored in the UserBuffer field). Of course, it is possible that a driver might associate multiple MDLs
            with a single IRP (using the Next field of the MDL itself). This could be done
            explicitly (by setting the field directly) or implicitly (by using IoAllocateMdl and indicating TRUE for the SecondaryBuffer parameter). This could be problematic for file system filter drivers, should a file system be
            implemented to exploit this behavior. 
            
            The other source of I/O operations is from other OS components. It is
            acceptable for such OS components to use the same access mechanism used by
            user-mode components, in which case the earlier analysis still applies. In
            addition, kernel mode components may utilize direct I/O - regardless of the
            value in the Flags field of the DEVICE_OBJECT for the given file system. For
            example, paging I/O operations are always submitted to
            the file system by utilizing MDLs that describe the new physical memory that
            will be used to store the data. File systems should not reference the memory
            pointed to by Irp->UserBuffer although this address will appear to be valid (and will even be the same as is
            returned by MmGetMdlVirtualAddress). The address are
            not, in fact valid, but may be used by the file system when constructing
            multiple sub-operations. Memory Manager MDLs cannot be used for direct reference to that memory, as those buffers have not been mapped into
            memory since they do not yet contain the correct data. 
            
            For a file system filter driver that wishes to modify
            the data in some way, it is important to keep in mind the use of that memory.
            For example, a traditional mistake for an encryption filter is to trap the
            IRP_MJ_WRITE where the IRP_NOCACHE bit is set (which catches both user
            non-cached I/O as well as paging I/O) and, using the provided MDL or user
            buffer, encrypt the data in-place. The risk here is that some other thread will
            gain access to that memory in its encrypted state. For example, if the file is
            memory mapped, the application will observe the modified data, rather than the
            original, cleartext data. Thus, there are a few rules
            that need to be observed by file system filter drivers that choose to modify
            the data buffers associated with a given IRP: 
            
            ? The IRP_PAGING_IO bit changes the completion behavior of the I/O Manager.
            MDLs in such IRPs are not discarded or cleaned up by the I/O
              Manager, because they belong to the Memory Manager (see IoPageRead as an example). Thus, filter drivers should be careful when setting this
            bit (e.g., if they create a new IRP and send it down with the resulting data).
              ? The Irp->UserBuffer must have the same value as is returned by MmGetMdlVirtualAddress.
            If it does not and the underlying file system must break up the I/O operation
            into a series of sub-operations, it will do so incorrectly (see how this is handled in deviosup.c within
            the FAT file system example in the IFS Kit, where it builds a partial MDL using IoBuildPartialMdl. It uses Irp->UserBuffer as an index reference for Irp->MdlAddress). For example, if substituting a new piece of
            memory (such as for the encryption driver), make sure that this parameter is
            set correctly.
              ? Never modify the buffer provided by the caller unless you are willing
            to make those changes visible to the caller immediately. Keep in mind that in a
            multi-threaded shared memory operating system the change is - literally -
            available and visible to other threads/processes/processors as you make them.
            Changes should always be made to a separate buffer
            component. That buffer can then be used in lieu of the original buffer, either
            within the original IRP, or by using a new IRP for that task.
              ? Use the correct routine for the type of buffer (e.g., MmBuildMdlForNonPagedPool if the memory is allocated from
            non-paged pool).
              ? Any reference to a pointer within the user's address space must be
            protected using __try and __except in order to prevent invalid user addresses
            from causing the system to crash.
  
            
What are the issues with respect to
            IRQL APC_LEVEL? What does it do? Why should I use (or not use) FsRtlEnterFileSystem? 
              
Windows is designed to be a fully re-entrant operating system. Thus,
            in general, kernel components may make calls back into the OS without worrying
            about deadlocks or other potential re-entrancy problems. 
            
            Windows also provides out-of-band execution of operations, such as asynchronous
            procedure calls (APC). And APC is a mechanism that
            allows the operating system to invoke a given routine within a specific thread
            context. This is, in turn, done by using a queue of
            pending APC objects that is stored in the control structure used by the OS for
            tracking a thread's state. Periodically, the kernel checks to see if there are pending APC objects that need to be processed by the
            given thread (where "periodically" is an arbitrary decision of the
            operating system). From a pragmatic programming standpoint, an APC can be
            "delivered" (that is, the routine can be called) by a thread between
            any two instructions. The delivery of APCs can be blocked by kernel code using
            one of two mechanisms: 
            
            ? Kernel APC delivery may be disabled by using KeEnterCriticalRegion and re-enabled by using KeLeaveCriticalRegion. Note that Special Kernel APCs
            are not disabled using this mechanism.
              ? Special Kernel APC delivery may be disabled by raising the IRQL of the thread to APC_LEVEL (or higher). 
  
            There are numerous uses for this, but some of the reasons for this are because
            of the nature of threads and APCs on Windows. First, we note that a given
            thread is restricted to running on only a single processor at any given time.
            Thus, the operating system can eliminate multi-processor serialization issues
            by requiring that an operation be done in one specific
            thread context. For example, each thread maintains a list of outstanding I/O
            operations it has initiated (this is ETHREAD->IrpList).
            This list is only modified in the given thread's
            context. By exploiting this, and by raising to
            APC_LEVEL the list can be safely modified without resorting to more expensive
            locking mechanisms (such as spin locks). 
            
            The primary disadvantage to APC_LEVEL is that it disables I/O completion APC
            routines from running. This in turn means that the driver must be careful to
            handle completion correctly (that is, it cannot use the Zw routines, and it must use its own signaling mechanism from its completion
            routine when sending an IRP so that it may signal completion of the I/O).
            
 


