Detect duplicates and near-duplicates in the database
It can happen that the same picture is saved mutiple times on the disk, either as exact duplicate or as near-duplicate (different version, corrupted copy). So it can happen that the user imports the same picture multiple times in the database. Therefore, it could be useful to display exact and candidate duplicates.
Exact duplicates can be detected with a naive MD5 hash or equivalent, which is just a string to store in the database. However, small variations in the pictures will generate — by design — huge variations in the hash. So they don't allow to detect near-duplicates (that can happen if one file is corrupted or modified).
Near duplicates can be detected with exotic hashs such as LSH : https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/SimilarityAnalysis.pdf
Implementation¶It's not clear to me yet if the best way to do that is :
- an external script (like the one to purge inexistant files)
- a separate tool (like darktable-chart)
- a built-in GUI in the lighttable, possibly a popup displaying a list of duplicated files with their duplicates on the same row and actions button to input the behavior (keep, remove from database, remove from disk, and which one to keep)
#1 Updated by Tobias Ellinghaus 16 days ago
Our database already contains a field "sha1sum" for every image. Unfortunately that has never been filled with any data as far as I can tell. Populating that field would be a great first step. If we want to inform the user about duplicates would be a different question, but as long as we don't have any hashes we can't do that for sure.
#2 Updated by Aurélien PIERRE 15 days ago
Dropping this for future reference : http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html