Rahul B.Shrestha, GSoC 2021

0. Deliverables

Link to my PR (contains all of my contributions): https://github.com/activeloopai/Hub/pull/1092

Link to my personal repository used for benchmarks: ****https://github.com/rahulbshrestha/hash-dataset

Blog documenting my weekly progress: https://blogs.python-gsoc.org/en/rahulbshresthas-blog/

1. Introduction

Hi! I'm Rahul. This summer, I worked as a Google Summer of Code open-source contributor at Activeloop [1] under the umbrella organisation, Python Software Foundation [2]. This post serves as a documentation of my contributions to Activeloop's "Hub".

Hub [3] is a dataset management tool that enables users to stream unlimited amounts of data from the cloud to any machine. Datasets can be easily created and hosted on Activeloop Cloud or S3. Datasets can also be streamed to any machine and integrated with PyTorch and TensorFlow with no boilerplate code.

2. Problem

Machine learning datasets are often continuously modified yet there is no efficient way to determine if two datasets are the same. This challenge becomes especially problematic when the datasets are large (20 GB+) and cannot be easily loaded into memory. This project seeks to design and implement hashing / fingerprinting techniques for comparing large datasets.'

3. Possible solutions

I started off by researching similar implementations. A list of ideas I considered but didn't work out:

4. Implementation

Problem solution

To make two datasets comparable, I use the following algorithm: