Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
Loading...
Searching...
No Matches
Aleph::SimHash< T > Class Template Reference

SimHash fingerprint generator. More...

#include <simhash.H>

Collaboration diagram for Aleph::SimHash< T >:
[legend]

Public Member Functions

 SimHash ()
 Construct SimHash accumulator.
 
void update (const T &val, const double weight=1.0)
 Add a feature to the set.
 
template<typename Itor >
void update (Itor beg, const Itor &end)
 Add all features in a range to the SimHash accumulator.
 
uint64_t get_fingerprint () const noexcept
 Returns the fingerprint.
 
void clear ()
 Reset the accumulator.
 

Static Public Member Functions

static double similarity (const uint64_t f1, const uint64_t f2) noexcept
 Estimate similarity with another fingerprint.
 

Static Public Attributes

static constexpr size_t FINGERPRINT_SIZE = 64
 Number of bits in the SimHash fingerprint (always 64).
 

Static Private Member Functions

static uint64_t hash64 (const T &val) noexcept
 Compute a 64-bit hash value from a feature.
 

Private Attributes

Array< doublev_
 Accumulator vector for each bit.
 

Detailed Description

template<typename T>
class Aleph::SimHash< T >

SimHash fingerprint generator.

Maintains a 64-bit fingerprint representing a weighted set of features. The similarity between two SimHash values is proportional to the Hamming distance between their fingerprints.

Template Parameters
TFeature type.

Definition at line 65 of file simhash.H.

Constructor & Destructor Documentation

◆ SimHash()

template<typename T >
Aleph::SimHash< T >::SimHash ( )
inline

Construct SimHash accumulator.

Definition at line 98 of file simhash.H.

Member Function Documentation

◆ clear()

template<typename T >
void Aleph::SimHash< T >::clear ( )
inline

Reset the accumulator.

Definition at line 159 of file simhash.H.

References Aleph::SimHash< T >::FINGERPRINT_SIZE, and Aleph::SimHash< T >::v_.

◆ get_fingerprint()

template<typename T >
uint64_t Aleph::SimHash< T >::get_fingerprint ( ) const
inlinenoexcept

Returns the fingerprint.

Definition at line 139 of file simhash.H.

References Aleph::divide_and_conquer_partition_dp(), Aleph::SimHash< T >::FINGERPRINT_SIZE, and Aleph::SimHash< T >::v_.

◆ hash64()

template<typename T >
static uint64_t Aleph::SimHash< T >::hash64 ( const T val)
inlinestaticprivatenoexcept

Compute a 64-bit hash value from a feature.

On 64-bit platforms this uses one murmur3hash() call. On 32-bit platforms it combines two independently-seeded 32-bit hashes.

Definition at line 84 of file simhash.H.

References Aleph::divide_and_conquer_partition_dp(), and Aleph::murmur3hash().

Referenced by Aleph::SimHash< T >::update().

◆ similarity()

template<typename T >
static double Aleph::SimHash< T >::similarity ( const uint64_t  f1,
const uint64_t  f2 
)
inlinestaticnoexcept

Estimate similarity with another fingerprint.

Returns
Estimated similarity in [0.0, 1.0] based on Hamming distance.

Definition at line 152 of file simhash.H.

References Aleph::SimHash< T >::FINGERPRINT_SIZE.

Referenced by TEST().

◆ update() [1/2]

template<typename T >
void Aleph::SimHash< T >::update ( const T val,
const double  weight = 1.0 
)
inline

Add a feature to the set.

Parameters
[in]valThe feature to add.
[in]weightImportance of this feature (default: 1.0).

Definition at line 107 of file simhash.H.

References Aleph::divide_and_conquer_partition_dp(), Aleph::SimHash< T >::FINGERPRINT_SIZE, Aleph::SimHash< T >::hash64(), and Aleph::SimHash< T >::v_.

Referenced by Aleph::SimHash< T >::update().

◆ update() [2/2]

template<typename T >
template<typename Itor >
void Aleph::SimHash< T >::update ( Itor  beg,
const Itor &  end 
)
inline

Add all features in a range to the SimHash accumulator.

Calls update(item.first, item.second) for each element in [beg, end). Each element must expose .first (the feature value) and .second (its weight as double), e.g. std::pair<T, double>.

Template Parameters
ItorIterator whose value type provides .first and .second.
Parameters
[in]begIterator to the first feature.
[in]endPast-the-end iterator.
Note
Complexity: O(FINGERPRINT_SIZE * n) where n = distance(beg, end).

Definition at line 129 of file simhash.H.

References Aleph::divide_and_conquer_partition_dp(), and Aleph::SimHash< T >::update().

Member Data Documentation

◆ FINGERPRINT_SIZE

template<typename T >
constexpr size_t Aleph::SimHash< T >::FINGERPRINT_SIZE = 64
staticconstexpr

Number of bits in the SimHash fingerprint (always 64).

The fingerprint is always a 64-bit value. On platforms where murmur3hash() returns 32-bit size_t, SimHash combines two independent hash evaluations to populate all 64 bits.

Definition at line 74 of file simhash.H.

Referenced by Aleph::SimHash< T >::clear(), Aleph::SimHash< T >::get_fingerprint(), Aleph::SimHash< T >::similarity(), and Aleph::SimHash< T >::update().

◆ v_

template<typename T >
Array<double> Aleph::SimHash< T >::v_
private

Accumulator vector for each bit.

Definition at line 77 of file simhash.H.

Referenced by Aleph::SimHash< T >::clear(), Aleph::SimHash< T >::get_fingerprint(), and Aleph::SimHash< T >::update().


The documentation for this class was generated from the following file: