Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
Loading...
Searching...
No Matches
Aleph::MinHash< T > Class Template Reference

MinHash signature generator. More...

#include <minhash.H>

Collaboration diagram for Aleph::MinHash< T >:
[legend]

Public Member Functions

 MinHash (size_t k=128)
 Construct with signature size.
 
void update (const T &val)
 Add an element to the set.
 
template<std::input_iterator Itor>
void update (Itor beg, const Itor &end)
 Add all elements in a range to the MinHash signature.
 
double similarity (const MinHash &other) const
 Estimate Jaccard similarity with another signature.
 
const Array< uint64_t > & get_signature () const noexcept
 Returns the current signature.
 
void clear ()
 Reset the signature to the initial all-maximum state.
 
MinHashmerge (const MinHash &other)
 Merge another MinHash signature into this one.
 
size_t size () const noexcept
 Number of hash functions used in the signature.
 

Private Attributes

size_t k_
 Signature size.
 
Array< uint64_tsignature_
 Minimized hash values.
 
Array< uint32_tseeds_
 Random seeds for hash functions.
 

Detailed Description

template<typename T>
class Aleph::MinHash< T >

MinHash signature generator.

Maintain a signature of size k representing a set. The Jaccard similarity between two sets can be estimated by the fraction of positions where their MinHash signatures agree.

Template Parameters
TElement type.

Definition at line 64 of file minhash.H.

Constructor & Destructor Documentation

◆ MinHash()

template<typename T >
Aleph::MinHash< T >::MinHash ( size_t  k = 128)
inlineexplicit

Construct with signature size.

Parameters
[in]kNumber of hashes to use (higher k -> better precision).
Exceptions
std::domain_errorif k == 0.

Definition at line 75 of file minhash.H.

References ah_domain_error_if, Aleph::Array< T >::append(), Aleph::dft_hash_fct(), Aleph::divide_and_conquer_partition_dp(), k, Aleph::MinHash< T >::k_, Aleph::Array< T >::reserve(), Aleph::MinHash< T >::seeds_, and Aleph::MinHash< T >::signature_.

Member Function Documentation

◆ clear()

template<typename T >
void Aleph::MinHash< T >::clear ( )
inline

Reset the signature to the initial all-maximum state.

Exceptions
none
Note
Complexity: O(k).

Definition at line 154 of file minhash.H.

References Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.

◆ get_signature()

template<typename T >
const Array< uint64_t > & Aleph::MinHash< T >::get_signature ( ) const
inlinenoexcept

Returns the current signature.

Returns
Const reference to the internal Array of k minimized hash values.
Exceptions
none
Note
Complexity: O(1).

Definition at line 145 of file minhash.H.

References Aleph::MinHash< T >::signature_.

◆ merge()

template<typename T >
MinHash & Aleph::MinHash< T >::merge ( const MinHash< T > &  other)
inline

Merge another MinHash signature into this one.

Combines two MinHash signatures by taking the element-wise minimum, which corresponds to the union of the underlying sets.

Parameters
[in]otherAnother MinHash with the same number of hashes.
Returns
Reference to *this after the merge.
Exceptions
std::domain_errorif signature sizes differ.

Definition at line 170 of file minhash.H.

References ah_domain_error_if, Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.

Referenced by TEST(), and TEST().

◆ similarity()

template<typename T >
double Aleph::MinHash< T >::similarity ( const MinHash< T > &  other) const
inline

Estimate Jaccard similarity with another signature.

Parameters
[in]otherThe MinHash signature to compare against.
Returns
Estimated Jaccard index in [0.0, 1.0].
Exceptions
std::domain_errorif signature sizes mismatch.

Definition at line 127 of file minhash.H.

References ah_domain_error_if, Aleph::divide_and_conquer_partition_dp(), Aleph::MinHash< T >::k_, and Aleph::MinHash< T >::signature_.

◆ size()

template<typename T >
size_t Aleph::MinHash< T >::size ( ) const
inlinenoexcept

Number of hash functions used in the signature.

Returns
size_t The value k passed at construction (signature size k_).
Exceptions
none
Note
Complexity: O(1).

Definition at line 185 of file minhash.H.

References Aleph::MinHash< T >::k_.

◆ update() [1/2]

template<typename T >
void Aleph::MinHash< T >::update ( const T val)
inline

Add an element to the set.

Parameters
[in]valThe element to incorporate into the signature.
Exceptions
none
Note
Complexity: O(k).

Definition at line 95 of file minhash.H.

References Aleph::divide_and_conquer_partition_dp(), h, Aleph::MinHash< T >::k_, Aleph::murmur3hash(), Aleph::MinHash< T >::seeds_, and Aleph::MinHash< T >::signature_.

Referenced by Aleph::MinHash< T >::update().

◆ update() [2/2]

template<typename T >
template<std::input_iterator Itor>
void Aleph::MinHash< T >::update ( Itor  beg,
const Itor &  end 
)
inline

Add all elements in a range to the MinHash signature.

Calls update(val) for each element in [beg, end).

Template Parameters
ItorIterator whose value type is compatible with update(const T&).
Parameters
[in]begIterator to the first element.
[in]endPast-the-end iterator.
Exceptions
none
Note
Complexity: O(k * n) where n = distance(beg, end).

Definition at line 113 of file minhash.H.

References Aleph::divide_and_conquer_partition_dp(), and Aleph::MinHash< T >::update().

Member Data Documentation

◆ k_

◆ seeds_

template<typename T >
Array<uint32_t> Aleph::MinHash< T >::seeds_
private

Random seeds for hash functions.

Definition at line 68 of file minhash.H.

Referenced by Aleph::MinHash< T >::MinHash(), and Aleph::MinHash< T >::update().

◆ signature_


The documentation for this class was generated from the following file: