Create a dictionary (hash map) in Python with multiple lists (arrays) as values

I’ve been working on a personal project to create a program that will recursively scan multiple directories for duplicate files. A big part of the problem to solve for this project is not parsing the data, but how to store the data.

Think about this: You are recursively parsing filesystem paths and want to list files. But you are going to need to know if there are duplicate files which means that you will at some point be comparing them.

The data that is initially parsed will need to be stored and then you will possibly need to go back and reference information about matching files (duplicates) to obtain more information about them. Thus the parsed data must be stored.

Python’s os.walk stores its data in an iterable type. For example:

dirToScan = sys.argv[1]
fsIterator = os.walk(dirToScan, topdown=True)
for path, subdirs, files in fsIterator:
    # do stuff

The most suitable data type to store the data is a dictionary (Python’s name for a hash map) of lists (Python’s name for arrays). The keys of the dictionary are the file names, and the values are lists which contain the directory path and any other data. Right now the only other data I want is the size in bytes of the file. In the future I might also want to add another item to the list such as the md5sum of the file.

A practical matter that comes into consideration is that, when the program is busy parsing the filesystem and adding keys to the dictionary, what happens when there is not already a key in the dictionary, vs. what happens if they key already exists (in which case we have a duplicate named file)?

The best way to do this is to use the .append() method of Python’s collections.defaultdict, which is a special subclass of dict.

Here is a simulation of adding multiple entries to a dictonary for the same filename ‘a’ using a defaultdict:

from collections import defaultdict

filesDict = defaultdict(list)  # default_factory set to 'list'

filesDict['a'].append(['some dir 1', 35])
filesDict['a'].append(['another dir 2', 84])
filesDict['a'].append(['other dir 3', 15])

for key, value in filesDict.items():
    for item in value:
        print("file:", key, "dir:", item[0], "size:", item[1])

If you were to try to use .append() from a regular Python dictionary and the key did not already exist it would give an error.

This code above is basically the engine of what will be my duplicate file parsing program. Once the data is parsed and stored in a dictionary similar to above, it will be a matter of deciding which duplicate entries to remove.

I already plan on having several options for choosing which duplicates to remove. The first will be based on the depth of the duplicate: If the duplicate is at a deeper (or shallower) level in the filesystem, it will be removed.

Another option will be to prioritize filesystem paths provided to the program.

Ultimately I would also like the program to be able to compare files located on a remote machine over SSH.

Here’s something closer to the finished product:

import os, sys
from pathlib import Path
from collections import defaultdict

pathToScan = Path(sys.argv[1])
fsIterator = os.walk(pathToScan, topdown=True)
filesDict = defaultdict(list)  # default_factory set to 'list'

for path, subdirs, files in fsIterator:
    for name in files:
        filesDict[name].append([path, os.stat(Path(path, name)).st_size])

for key, value in filesDict.items():
    for item in value:
        print("file:", key, "dir:", item[0], "size:", item[1])

Comments

Leave a Reply