|
Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
|
SimHash fingerprint generator. More...
#include <simhash.H>
Public Member Functions | |
| SimHash () | |
| Construct SimHash accumulator. | |
| void | update (const T &val, const double weight=1.0) |
| Add a feature to the set. | |
| template<typename Itor > | |
| void | update (Itor beg, const Itor &end) |
| Add all features in a range to the SimHash accumulator. | |
| uint64_t | get_fingerprint () const noexcept |
| Returns the fingerprint. | |
| void | clear () |
| Reset the accumulator. | |
Static Public Member Functions | |
| static double | similarity (const uint64_t f1, const uint64_t f2) noexcept |
| Estimate similarity with another fingerprint. | |
Static Public Attributes | |
| static constexpr size_t | FINGERPRINT_SIZE = 64 |
| Number of bits in the SimHash fingerprint (always 64). | |
Static Private Member Functions | |
| static uint64_t | hash64 (const T &val) noexcept |
| Compute a 64-bit hash value from a feature. | |
Private Attributes | |
| Array< double > | v_ |
| Accumulator vector for each bit. | |
SimHash fingerprint generator.
Maintains a 64-bit fingerprint representing a weighted set of features. The similarity between two SimHash values is proportional to the Hamming distance between their fingerprints.
| T | Feature type. |
|
inline |
|
inline |
Reset the accumulator.
Definition at line 159 of file simhash.H.
References Aleph::SimHash< T >::FINGERPRINT_SIZE, and Aleph::SimHash< T >::v_.
|
inlinenoexcept |
Returns the fingerprint.
Definition at line 139 of file simhash.H.
References Aleph::divide_and_conquer_partition_dp(), Aleph::SimHash< T >::FINGERPRINT_SIZE, and Aleph::SimHash< T >::v_.
Compute a 64-bit hash value from a feature.
On 64-bit platforms this uses one murmur3hash() call. On 32-bit platforms it combines two independently-seeded 32-bit hashes.
Definition at line 84 of file simhash.H.
References Aleph::divide_and_conquer_partition_dp(), and Aleph::murmur3hash().
Referenced by Aleph::SimHash< T >::update().
|
inlinestaticnoexcept |
Estimate similarity with another fingerprint.
Definition at line 152 of file simhash.H.
References Aleph::SimHash< T >::FINGERPRINT_SIZE.
Referenced by TEST().
Add a feature to the set.
| [in] | val | The feature to add. |
| [in] | weight | Importance of this feature (default: 1.0). |
Definition at line 107 of file simhash.H.
References Aleph::divide_and_conquer_partition_dp(), Aleph::SimHash< T >::FINGERPRINT_SIZE, Aleph::SimHash< T >::hash64(), and Aleph::SimHash< T >::v_.
Referenced by Aleph::SimHash< T >::update().
|
inline |
Add all features in a range to the SimHash accumulator.
Calls update(item.first, item.second) for each element in [beg, end). Each element must expose .first (the feature value) and .second (its weight as double), e.g. std::pair<T, double>.
| Itor | Iterator whose value type provides .first and .second. |
| [in] | beg | Iterator to the first feature. |
| [in] | end | Past-the-end iterator. |
Definition at line 129 of file simhash.H.
References Aleph::divide_and_conquer_partition_dp(), and Aleph::SimHash< T >::update().
|
staticconstexpr |
Number of bits in the SimHash fingerprint (always 64).
The fingerprint is always a 64-bit value. On platforms where murmur3hash() returns 32-bit size_t, SimHash combines two independent hash evaluations to populate all 64 bits.
Definition at line 74 of file simhash.H.
Referenced by Aleph::SimHash< T >::clear(), Aleph::SimHash< T >::get_fingerprint(), Aleph::SimHash< T >::similarity(), and Aleph::SimHash< T >::update().
Accumulator vector for each bit.
Definition at line 77 of file simhash.H.
Referenced by Aleph::SimHash< T >::clear(), Aleph::SimHash< T >::get_fingerprint(), and Aleph::SimHash< T >::update().