Skip to content

feat(blocking): add a new parameter 'repeatable' to MinHashingBlocking

Simon Chabot requested to merge topic/default/repeatable_minhashing into branch/default

This new parameter, repeatable, enables one to obtain the same results, with the same parameters (documents, signature length, etc).

The MinHashing technique is based on randomly chosen hash functions. In order to get the same results at each calls (when repeatable is set to True), we set the random seed to a predetermined value (and customizable if need be). With the same random seed, one should get the same hash functions, and therefore the same documents signatures, and… the same results.

Edited by Simon Chabot

Merge request reports