Rahul B.Shrestha, GSoC 2021
Link to my PR (contains all of my contributions): https://github.com/activeloopai/Hub/pull/1092
Link to my personal repository used for benchmarks: ****https://github.com/rahulbshrestha/hash-dataset
Blog documenting my weekly progress: https://blogs.python-gsoc.org/en/rahulbshresthas-blog/
Hi! I'm Rahul. This summer, I worked as a Google Summer of Code open-source contributor at Activeloop [1] under the umbrella organisation, Python Software Foundation [2]. This post serves as a documentation of my contributions to Activeloop's "Hub".
Hub [3] is a dataset management tool that enables users to stream unlimited amounts of data from the cloud to any machine. Datasets can be easily created and hosted on Activeloop Cloud or S3. Datasets can also be streamed to any machine and integrated with PyTorch and TensorFlow with no boilerplate code.
Machine learning datasets are often continuously modified yet there is no efficient way to determine if two datasets are the same. This challenge becomes especially problematic when the datasets are large (20 GB+) and cannot be easily loaded into memory. This project seeks to design and implement hashing / fingerprinting techniques for comparing large datasets.'
I started off by researching similar implementations. A list of ideas I considered but didn't work out:
Locality Sensitive hashing
My original proposal intended to use Locality Sensitive Hashing (LSH) with the MapReduce paradigm to compare large machine learning datasets.
LSH is an algorithm which hashes similar inputs into the same “buckets” with high probability. LSH is different from conventional cryptographic hashing functions as it tries to ensure hash collisions rather than avoid it. Input data is hashed and put into “buckets”. The more similar the objects are, the higher the probability that they are in the same bucket. Hash collisions are helpful as the goal is to reduce the number of comparisons required to find similar items and similar datasets are highly likely to have similar hash values. LSH is used for solving probabilistic dimension reduction of high dimensional spaces. For example, it is used in nearest neighbour search on large scale data [7].
LSH helps find similar pairs in a large dataset in optimal time. The time complexity for brute forcing i.e comparing every possible pair has a time complexity of N!/(2!(N-2)!) ~ N²/2
= O(N^2) . LSH improves this with a time complexity of O(N).
LSH outperforms other traditional hashing algorithms, and the MapReduce paradigm running on Apache Spark/Hadoop makes it scalable to large datasets. This could be integrated with Hub to compare large image and text datasets for duplicates.
After my initial meeting with my mentors, Vinn and Abhinav, I concluded that LSH was the incorrect idea for the problem we were trying to solve. LSH can be though of as fuzzy matching to find similar images/ text documents. For checking if a dataset has changed or not, storing hashes of samples is a better approach.
Merkle trees
Merkle trees are created by hashing pairs of root nodes until there is only one hash left (i.e root hash). They are constructed from the bottom up, from hashes of individual node.
The order of leaf nodes matter. A different order of lead nodes will result in a completely different root hash. So, this isn't suitable for hashing datasets as the order of samples depends on the file system.
Problem solution
To make two datasets comparable, I use the following algorithm: