Project

General

Profile

Feature #11993

Detect duplicates and near-duplicates in the database

Added by Aurélien PIERRE 17 days ago. Updated 15 days ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
Start date:
02/01/2018
Due date:
% Done:

0%

Affected Version:
git master branch
System:
all
bitness:
64-bit
hardware architecture:
amd64/x86

Description

Problem

It can happen that the same picture is saved mutiple times on the disk, either as exact duplicate or as near-duplicate (different version, corrupted copy). So it can happen that the user imports the same picture multiple times in the database. Therefore, it could be useful to display exact and candidate duplicates.

Solution

Exact duplicates can be detected with a naive MD5 hash or equivalent, which is just a string to store in the database. However, small variations in the pictures will generate — by design — huge variations in the hash. So they don't allow to detect near-duplicates (that can happen if one file is corrupted or modified).

Near duplicates can be detected with exotic hashs such as LSH : https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/SimilarityAnalysis.pdf

Implementation

It's not clear to me yet if the best way to do that is :
  1. an external script (like the one to purge inexistant files)
  2. a separate tool (like darktable-chart)
  3. a built-in GUI in the lighttable, possibly a popup displaying a list of duplicated files with their duplicates on the same row and actions button to input the behavior (keep, remove from database, remove from disk, and which one to keep)

History

#1 Updated by Tobias Ellinghaus 16 days ago

Our database already contains a field "sha1sum" for every image. Unfortunately that has never been filled with any data as far as I can tell. Populating that field would be a great first step. If we want to inform the user about duplicates would be a different question, but as long as we don't have any hashes we can't do that for sure.

Also available in: Atom PDF