Results 1 to 3 of 3

Thread: Directory scanning in windows

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,329 Times in 759 Posts

    Directory scanning in windows

    Written some notes about filesystem scanning and decided to post it here.
    Its kinda offtopic, but in fact any usable archiver has to deal with it somehow.

    Also here're some relevant sources/benchmarks from a older discussion:

    WIN32 Filenames and Directory scanning

    0. Filename examples


    * // any file
    *.* // any file (even without "."!)
    *xxx. // any file ending with "xxx" (w/o dot)
    ? // any single-letter name (UTF16 symbol!)
    *a*b* // any file with letters a and b in that order

    Symbols not allowed in filenames:
    < > : " / \ | ? * \x00-\x1F

    cmd.exe also matches any shortnames with these masks
    (so dir c:\windows\*1.* would show some unexpected results)

    1. Codepages

    There're 3 relevant codepages - the default legacy bytewise
    (OEM?), utf8 (for internal representation) and utf16 (for APIs).

    - UTF8 is more compact (2x most of the time), and its easier
    to manipulate filenames as bytewise strings.
    There're more efficient encodings for non-english codepages,
    but that would require to convert names even on unixes where
    utf8 is commonly used by file APIs.

    Real utf16 support in unicode can be postponed until explicitly
    requested and examples provided (a unicode filename with
    "surrogates" or whatever).

    - Apparently its possible to write the same string in
    multiple ways with unicode (eg. U+00C4==U+0041.U+030, and
    a NormalizeString() API is even available, but only in
    Vista+, and its probably too slow to apply to files anyway.

    2. Scanning order

    Should be Depth-first probably, as it seems to be common.
    Also Width-first would be faster on archive extraction (and
    later repacking of FS extracted in such a way), but that
    would make depth-first programs slower, and there's no
    reliable way to detect the method used for FS creation.

    The best way to scan would be sorting by cluster index,
    but FindFirst doesn't return such info.

    Btw, it seems necessary to build a real FS tree in memory
    which could be enumerated from the root, as reversed trees
    are inconvenient for many tasks (like FS comparisons etc).

    3. Scanning quirks

    - \\?\C:\ paths. Its supposedly faster and better to use
    such paths, as they don't have a 260-char MAX_PATH limit,
    and also there's less processing for these in APIs.
    - Short names. Short file/dir names result in shorter path,
    and shorter path is faster to construct and manage etc
    (including APIs probably).
    - Its necessary to enumerate all the names with "*" mask
    (added to FindFirst path), and then filter them with a
    custom wildcard filter.

    4. Truename

    Seems like its completely impractical to rely on any windows
    APIs to process all the enumerated file names (too slow at
    So the right way looks like calling GetFullPath once for the
    initial path, and then appending directories and masks to it
    without any further fixes.

    Btw, in 7-zip source, there're 842 lines in FileDir.cpp,
    403 lines in FileFind.cpp, 318 lines in FileIO.cpp,
    459 lines in Wildcard.cpp, and quite a few other related
    files (string operations etc).

    - No "current drive" in windows - there's "current path"
    instead. Seems like its necessary to either use GetFullPath,
    or be ready to _set_ the current directory to eg. "C:" to
    get the current path for C:

    - GetFullPathName is troublesome (lots of reported quirks etc),
    but seems like there's no way around other than using a bunch
    of Get/SetCurrentDirectory calls.

    5. Comparison of trees

    For backup etc its occasionally necessary to compare the
    filesystems and check for differences. But its troublesome
    to just compare them sequentially, as file order in the
    directory may change etc.
    So something like filling a hashtable with file/dir paths
    and comparing their stats on match seems reasonable.
    However, another hashtable would be probably required to
    detect file moves and/or renamings.

    6. NTFS streams and ACL

    Supporting these requires adding enumeration loops
    per each file with quite a few API calls, so it would be too
    slow to not make it optional. Even though that's
    problematic, as its easy to lose some streams like that.
    (especially for backup)

  2. #2
    Member chornobyl's Avatar
    Join Date
    May 2008
    Thanked 0 Times in 0 Posts
    tested different directory scnners
    Attached Files Attached Files

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,329 Times in 759 Posts
    Err.. this is what its about (chornobyl's post):

    Ah, and also this:
    Last edited by Shelwien; 26th November 2009 at 14:53.

Similar Threads

  1. Replies: 39
    Last Post: 10th April 2014, 23:26
  2. Non Windows or Linux compressors
    By Earl Colby Pottinger in forum Data Compression
    Replies: 6
    Last Post: 8th April 2010, 17:26
  3. GCC 4.4.1 for Windows
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 16th January 2010, 00:39
  4. RINGS 1.6 [Windows + Linux] version !
    By Nania Francesco in forum Data Compression
    Replies: 2
    Last Post: 17th August 2009, 03:38
  5. Memory Limits for Windows Operating Systems
    By LovePimple in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 13th July 2008, 00:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts