|
Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
|
MinHash signature generator. More...
#include <minhash.H>
Public Member Functions | |
| MinHash (size_t k=128) | |
| Construct with signature size. | |
| void | update (const T &val) |
| Add an element to the set. | |
| template<std::input_iterator Itor> | |
| void | update (Itor beg, const Itor &end) |
| Add all elements in a range to the MinHash signature. | |
| double | similarity (const MinHash &other) const |
| Estimate Jaccard similarity with another signature. | |
| const Array< uint64_t > & | get_signature () const noexcept |
| Returns the current signature. | |
| void | clear () |
| Reset the signature to the initial all-maximum state. | |
| MinHash & | merge (const MinHash &other) |
| Merge another MinHash signature into this one. | |
| size_t | size () const noexcept |
| Number of hash functions used in the signature. | |
Private Attributes | |
| size_t | k_ |
| Signature size. | |
| Array< uint64_t > | signature_ |
| Minimized hash values. | |
| Array< uint32_t > | seeds_ |
| Random seeds for hash functions. | |
MinHash signature generator.
Maintain a signature of size k representing a set. The Jaccard similarity between two sets can be estimated by the fraction of positions where their MinHash signatures agree.
| T | Element type. |
|
inlineexplicit |
Construct with signature size.
| [in] | k | Number of hashes to use (higher k -> better precision). |
| std::domain_error | if k == 0. |
Definition at line 75 of file minhash.H.
References ah_domain_error_if, Aleph::Array< T >::append(), Aleph::dft_hash_fct(), Aleph::divide_and_conquer_partition_dp(), k, Aleph::MinHash< T >::k_, Aleph::Array< T >::reserve(), Aleph::MinHash< T >::seeds_, and Aleph::MinHash< T >::signature_.
|
inline |
Reset the signature to the initial all-maximum state.
| none |
Definition at line 154 of file minhash.H.
References Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.
|
inlinenoexcept |
Returns the current signature.
| none |
Definition at line 145 of file minhash.H.
References Aleph::MinHash< T >::signature_.
Merge another MinHash signature into this one.
Combines two MinHash signatures by taking the element-wise minimum, which corresponds to the union of the underlying sets.
| [in] | other | Another MinHash with the same number of hashes. |
*this after the merge. | std::domain_error | if signature sizes differ. |
Definition at line 170 of file minhash.H.
References ah_domain_error_if, Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.
Estimate Jaccard similarity with another signature.
| [in] | other | The MinHash signature to compare against. |
| std::domain_error | if signature sizes mismatch. |
Definition at line 127 of file minhash.H.
References ah_domain_error_if, Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.
|
inlinenoexcept |
Number of hash functions used in the signature.
| none |
Definition at line 185 of file minhash.H.
References Aleph::MinHash< T >::k_.
Add an element to the set.
| [in] | val | The element to incorporate into the signature. |
| none |
Definition at line 95 of file minhash.H.
References Aleph::divide_and_conquer_partition_dp(), h, Aleph::MinHash< T >::k_, Aleph::murmur3hash(), Aleph::MinHash< T >::seeds_, and Aleph::MinHash< T >::signature_.
Referenced by Aleph::MinHash< T >::update().
|
inline |
Add all elements in a range to the MinHash signature.
Calls update(val) for each element in [beg, end).
| Itor | Iterator whose value type is compatible with update(const T&). |
| [in] | beg | Iterator to the first element. |
| [in] | end | Past-the-end iterator. |
| none |
Definition at line 113 of file minhash.H.
References Aleph::divide_and_conquer_partition_dp(), and Aleph::MinHash< T >::update().
|
private |
Signature size.
Definition at line 66 of file minhash.H.
Referenced by Aleph::MinHash< T >::MinHash(), Aleph::MinHash< T >::clear(), Aleph::MinHash< T >::merge(), Aleph::MinHash< T >::similarity(), Aleph::MinHash< T >::size(), and Aleph::MinHash< T >::update().
Random seeds for hash functions.
Definition at line 68 of file minhash.H.
Referenced by Aleph::MinHash< T >::MinHash(), and Aleph::MinHash< T >::update().
Minimized hash values.
Definition at line 67 of file minhash.H.
Referenced by Aleph::MinHash< T >::MinHash(), Aleph::MinHash< T >::clear(), Aleph::MinHash< T >::get_signature(), Aleph::MinHash< T >::merge(), Aleph::MinHash< T >::similarity(), and Aleph::MinHash< T >::update().