feat(blocking): add a new parameter 'repeatable' to MinHashingBlocking
This new parameter, repeatable
, enables one to obtain the same results, with
the same parameters (documents, signature length, etc).
The MinHashing technique is based on randomly chosen hash functions. In order to
get the same results at each calls (when repeatable
is set to True), we set
the random seed to a predetermined value (and customizable if need be). With the
same random seed, one should get the same hash functions, and therefore the same
documents signatures, and… the same results.
Edited by Simon Chabot