Find Duplicate Files in System

Problem Link:

Problem Statement

Given a list of paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

  • “root/d1/d2/…/dm f1.txt(f1_content) f2.txt(f2_content) … fn.txt(fn_content)”

It means there are n files (f1.txt, f2.txt … fn.txt) with content (f1_content, f2_content … fn_content) respectively in the directory “root/d1/d2/…/dm”. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • “directory_path/file_name.txt”

Example 1:

Input: paths = [“root/a 1.txt(abcd) 2.txt(efgh)”,”root/c 3.txt(abcd)”,”root/c/d 4.txt(efgh)”,”root 4.txt(efgh)”]

Output: [[“root/a/2.txt”,”root/c/d/4.txt”,”root/4.txt”],[“root/a/1.txt”,”root/c/3.txt”]]

Example 2:

Input: paths = [“root/a 1.txt(abcd) 2.txt(efgh)”,”root/c 3.txt(abcd)”,”root/c/d 4.txt(efgh)”]

Output: [[“root/a/2.txt”,”root/c/d/4.txt”],[“root/a/1.txt”,”root/c/3.txt”]]


  • 1 <= paths.length <= 2 * 104
  • 1 <= paths[i].length <= 3000
  • 1 <= sum(paths[i].length) <= 5 * 105
  • paths[i] consist of English letters, digits, ‘/’, ‘.’, ‘(‘, ‘)’, and ‘ ‘.
  • You may assume no files or directories share the same name in the same directory.
  • You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Disclaimer: Before proceeding forward towards the solution it is recommended to try first!!


As soon as we read the problem the very first thing that will come to our mind is that the problem relates to the way of searching in a real file system. But if we closely look at the problem we can look at it from a different perspective. The given problem can be solved using simple string splitting and hashmap/dictionary.

For every path of the directory, we can easily split the strings based on a blank space and then create a hashmap/dictionary in which the key will be “file content” and values will be a list of “directory path + filename”. Once we build the dictionary completely we can then iterate for every key and check if the size of the list of its values is greater than 1. If yes then we will include it in our answer.


  • First, we create an empty dictionary/hashmap with key as string and values as a list of strings which will be used to store the data in the required format.
  • We start a for loop, for every element in the given path array we split it on the basis of blank space. We store the directory path in a variable dtry. Then, we start another loop for the split array and create two variables path and data in which we split and store the file name and content respectively. We will check if the data exists in the list then we will add dtry + path to the existing list of values else we will add a new key as data to the dictionary with the dtry + path.
  • After we create the dictionary/hashmap now we will create an empty array ans to store the duplicate paths. We iterate over the keys in the dictionary/hashmap and check if the list of values has a size greater than one. If yes we add the whole list to the ans.


Time Complexity

  • T(n): O(n∗x). n strings of average length x is parsed.
  • S(n): O(n∗x). mapmapmap and resresres size grows upto n∗x.

If you like the post, don’t forget to share it with your peers and give your reactions!

%d bloggers like this: