Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
Loading...
Searching...
No Matches
quotient-filter.H
Go to the documentation of this file.
1
2/*
3 Aleph_w
4
5 Data structures & Algorithms
6 version 2.0.0b
7 https://github.com/lrleon/Aleph-w
8
9 This file is part of Aleph-w library
10
11 Copyright (c) 2002-2026 Leandro Rabindranath Leon
12
13 Permission is hereby granted, free of charge, to any person obtaining a copy
14 of this software and associated documentation files (the "Software"), to deal
15 in the Software without restriction, including without limitation the rights
16 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
17 copies of the Software, and to permit persons to whom the Software is
18 furnished to do so, subject to the following conditions:
19
20 The above copyright notice and this permission notice shall be included in all
21 copies or substantial portions of the Software.
22
23 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
24 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
25 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
26 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
27 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
28 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29 SOFTWARE.
30*/
31
118#ifndef QUOTIENT_FILTER_H
119#define QUOTIENT_FILTER_H
120
121#include <cstdint>
122#include <cstddef>
123#include <cmath>
124#include <algorithm>
125#include <utility>
126
127#include <ah-errors.H>
128#include <tpl_array.H>
129#include <hash-fct.H>
130
131namespace Aleph {
132
137template <typename T> class Quotient_Filter
138{
144
145 // Per-slot storage packed into a uint64_t:
146 // bit 0: is_occupied (belongs to the SLOT INDEX, not shifted data)
147 // bit 1: is_continuation
148 // bit 2: is_shifted
149 // bits 3..(r+2): remainder
151
152 static constexpr uint64_t OCC_BIT = 1ULL << 0;
153 static constexpr uint64_t CONT_BIT = 1ULL << 1;
154 static constexpr uint64_t SHFT_BIT = 1ULL << 2;
155 static constexpr uint64_t META_MASK = 0x7ULL;
156
158 {
159 return (1ULL << r_) - 1;
160 }
161
162 [[nodiscard]] bool get_occ(size_t i) const noexcept
163 {
164 return (slots_[i] & OCC_BIT) != 0;
165 }
166
167 [[nodiscard]] bool get_cont(size_t i) const noexcept
168 {
169 return (slots_[i] & CONT_BIT) != 0;
170 }
171
172 [[nodiscard]] bool get_shft(size_t i) const noexcept
173 {
174 return (slots_[i] & SHFT_BIT) != 0;
175 }
176
177 // A slot is truly empty when it holds no data and no occupied marker.
178 [[nodiscard]] bool slot_empty(size_t i) const noexcept
179 {
180 return slots_[i] == 0;
181 }
182
183 // A slot holds data if continuation or shifted is set, or if it
184 // is the unshifted start of a run (occupied and not empty remainder
185 // region... actually simplest: it holds data iff it is not
186 // "metadata-only"). A slot is "has element" iff is_shifted or
187 // is_continuation is set, OR (not shifted and not continuation and
188 // the slot is occupied — i.e., it's a run start at its canonical
189 // position). Equivalently: a slot has an element stored iff
190 // is_continuation or is_shifted, or (is_occupied and not shifted
191 // and not continuation), which simplifies to: the slot is not
192 // completely empty AND (if only occupied is set, there IS data
193 // here — it's the canonical run start).
194 //
195 // Actually the precise check: a slot has data iff any of the three
196 // bits is set (since occupied-only slots DO store data: the first
197 // element of an unshifted run).
198 // CORRECTION: that's wrong. A slot can have occupied=1 but hold
199 // another quotient's shifted data. The occupied bit marks that
200 // SOME run for this quotient exists, stored here or elsewhere.
201 //
202 // The simplest correct check: a slot holds element data iff
203 // shifted or continuation is set, OR (it is unshifted and not a
204 // continuation — i.e., it is a cluster start, meaning it stores
205 // the first element of a run at its canonical position).
206 //
207 // Even simpler: a slot is "element-occupied" iff it is not fully
208 // empty (meta != 0) and not "occupied-only" (meta == OCC_BIT).
209 // Wait, that's also wrong. A slot with only OCC_BIT set has the
210 // run start element at its canonical position!
211 //
212 // Let me think again. Slots that have data:
213 // 1. occ=1, cont=0, shft=0: canonical run start (has data)
214 // 2. occ=X, cont=0, shft=1: shifted run start (has data)
215 // 3. occ=X, cont=1, shft=1: shifted continuation (has data)
216 // 4. occ=X, cont=1, shft=0: impossible (continuation must be shifted
217 // unless it's at its canonical position, but continuation means
218 // it's not the first in its run, so it must be shifted)
219 // Actually wait — cont=1, shft=0: this IS possible if the
220 // continuation happens to land at its original quotient's
221 // canonical position... no, cont=1 means it's not the FIRST
222 // in the run, and shft=0 means it IS at its canonical position.
223 // Can this happen? Only if two elements have the same quotient
224 // and the continuation of the run happens to be at the same
225 // canonical slot. But runs are contiguous and sorted by
226 // remainder, and for the same quotient all have the same
227 // canonical slot. So the 2nd element of a run at slot fq+1
228 // would have cont=1, shft=1 (since it's shifted from fq).
229 // Unless fq itself has two elements? No, each slot holds one
230 // element. The run for quotient fq starts at slot fq (if
231 // unshifted), and continues at fq+1, fq+2, etc. with cont=1,
232 // shft=1.
233 //
234 // Bottom line: a slot has data stored iff (cont or shft) is set,
235 // OR (both are clear AND occ is set — canonical unshifted run start).
236 // Equivalently: (slots_[i] & META_MASK) != 0 and it's not the case
237 // that the slot has ONLY the occupied bit with nothing stored.
238 //
239 // SIMPLIFICATION: In practice, a slot "has data" iff
240 // (slots_[i] & (CONT_BIT | SHFT_BIT)) != 0 ||
241 // slot is a canonical start (occ=1, cont=0, shft=0)
242 // Which is the same as (slots_[i] & META_MASK) != 0 (any bit set).
243 // But the problem is occ-only: does occ=1, cont=0, shft=0 ALWAYS
244 // mean data is stored? YES. If occupied is set and the element is
245 // not shifted, it means this slot IS the start of the run, so data
246 // is here.
247 //
248 // FINAL ANSWER: has_element(i) iff (slots_[i] & META_MASK) != 0.
249 // This is the same as !slot_empty(i).
250 [[nodiscard]] bool has_element(const size_t i) const noexcept
251 {
252 return (slots_[i] & META_MASK) != 0;
253 }
254
255 // A slot is a "run start" if it has data and is_continuation is false.
256 [[nodiscard]] bool is_run_start(const size_t i) const noexcept
257 {
258 return has_element(i) and not get_cont(i);
259 }
260
261 void set_occ(const size_t i, const bool v) noexcept
262 {
263 if (v)
264 slots_(i) |= OCC_BIT;
265 else
266 slots_(i) &= ~OCC_BIT;
267 }
268
269 void set_cont(const size_t i, const bool v) noexcept
270 {
271 if (v)
272 slots_(i) |= CONT_BIT;
273 else
274 slots_(i) &= ~CONT_BIT;
275 }
276
277 void set_shft(const size_t i, const bool v) noexcept
278 {
279 if (v)
280 slots_(i) |= SHFT_BIT;
281 else
282 slots_(i) &= ~SHFT_BIT;
283 }
284
285 [[nodiscard]] uint64_t get_rem(const size_t i) const noexcept
286 {
287 return (slots_[i] >> 3) & rem_mask();
288 }
289
290 void set_rem(const size_t i, const uint64_t rem) noexcept
291 {
292 slots_(i) = (slots_[i] & META_MASK) | ((rem & rem_mask()) << 3);
293 }
294
295 // Set the full "element data" portion of a slot (cont, shft, rem),
296 // but preserve the occupied bit which belongs to the slot index.
297 void set_element(const size_t i, const bool cont, const bool shft, const uint64_t rem) noexcept
298 {
299 const bool occ = get_occ(i);
300 slots_(i) = ((rem & rem_mask()) << 3) | (shft ? SHFT_BIT : 0) | (cont ? CONT_BIT : 0) | (occ ? OCC_BIT : 0);
301 }
302
303 // Clear the element data (cont, shft, rem) but preserve occupied.
304 void clear_element(const size_t i) noexcept
305 {
306 const bool occ = get_occ(i);
307 slots_(i) = occ ? OCC_BIT : 0;
308 }
309
310 [[nodiscard]] size_t incr(const size_t i) const noexcept
311 {
312 return (i + 1) & (num_slots_ - 1);
313 }
314
315 [[nodiscard]] size_t decr(const size_t i) const noexcept
316 {
317 return (i - 1) & (num_slots_ - 1);
318 }
319
320 [[nodiscard]] std::pair<size_t, uint64_t> fingerprint(const T &item) const
321 {
322 const size_t h = murmur3hash(item, seed_);
323 const size_t fq = h & (num_slots_ - 1);
324 const uint64_t fr = (static_cast<uint64_t>(h) >> q_) & rem_mask();
325 return {fq, fr};
326 }
327
328 // Find the start of the run whose canonical slot is `fq`.
329 // Precondition: is_occupied(fq) is true.
330 [[nodiscard]] size_t find_run_start(size_t fq) const noexcept
331 {
332 // 1. Walk backward from fq to find the cluster start.
333 size_t j = fq;
334 while (get_shft(j))
335 j = decr(j);
336
337 // 2. Walk forward: for each occupied canonical slot between
338 // j and fq, skip past its run.
339 size_t i = j;
340 while (j != fq)
341 {
342 do // skip current run
343 i = incr(i);
344 while (get_cont(i));
345
346 do // advance to next occupied canonical slot
347 j = incr(j);
348 while (not get_occ(j));
349 }
350 return i;
351 }
352
353 // Find the first slot at or after `start` that has no element data.
354 [[nodiscard]] size_t find_first_unused(size_t start) const noexcept
355 {
356 size_t i = start;
357 while (has_element(i))
358 i = incr(i);
359 return i;
360 }
361
362 // Shift element data one slot to the right, from `pos` up to (but
363 // not including) `empty`. The occupied bit is never shifted — it
364 // belongs to the slot index and is preserved in place.
365 void shift_elements_right(size_t pos, size_t empty) noexcept
366 {
367 while (empty != pos)
368 {
369 const size_t prev = decr(empty);
370 const bool occ_at_empty = get_occ(empty);
371 slots_(empty) = slots_[prev];
372 set_occ(empty, occ_at_empty); // restore occupied at destination
373 set_shft(empty, true); // everything shifted is shifted
374 empty = prev;
375 }
376 }
377
378 enum class Insert_Result
379 {
380 inserted,
381 duplicate,
382 full
383 };
384
385 // Core insert of a pre-computed (fq, fr) pair.
387 {
388 if (num_elems_ >= num_slots_)
389 return Insert_Result::full;
390
391 // Case 1: canonical slot is completely empty.
392 if (slot_empty(fq))
393 {
394 slots_(fq) = OCC_BIT | ((fr & rem_mask()) << 3);
395 ++num_elems_;
397 }
398
399 const bool was_occupied = get_occ(fq);
400 set_occ(fq, true);
401
402 if (was_occupied)
403 {
404 // A run for this quotient already exists.
405 const size_t s = find_run_start(fq);
406
407 // Scan the run for a duplicate or ordered insertion point.
408 size_t pos = s;
409 do
410 {
411 const uint64_t cur = get_rem(pos);
412 if (cur == fr)
414 if (cur > fr)
415 break;
416 pos = incr(pos);
417 } while (get_cont(pos));
418
419 // `pos` is where we insert (before element at pos).
420 // If we scanned past the last continuation without finding
421 // cur > fr, pos is one past the run's end — still correct.
422
423 const size_t empty = find_first_unused(pos);
424 shift_elements_right(pos, empty);
425
426 const bool at_run_start = (pos == s);
427 set_element(pos, not at_run_start, pos != fq, fr);
428
429 // If we displaced the old run start, it becomes a continuation.
430 if (at_run_start)
431 set_cont(incr(pos), true);
432
433 ++num_elems_;
435 }
436
437 // No run for this quotient exists yet. We must find where the
438 // new single-element run should be inserted within the cluster.
439 // Since we already set occupied(fq), find_run_start now locates
440 // the correct position for this new run.
441 const size_t s = find_run_start(fq);
442
443 const size_t empty = find_first_unused(s);
444 shift_elements_right(s, empty);
445
446 set_element(s, false, s != fq, fr);
447 ++num_elems_;
449 }
450
451 // Core delete: shift elements left after removing position `pos`.
452 // The occupied bit is preserved at each index.
453 void shift_elements_left(size_t pos) noexcept
454 {
455 size_t curr = pos;
456 size_t nxt = incr(curr);
457
458 while (has_element(nxt) and get_shft(nxt))
459 {
460 const bool occ_at_curr = get_occ(curr);
461 slots_(curr) = slots_[nxt];
462 set_occ(curr, occ_at_curr);
463
464 // A run start that lands at its canonical slot is no longer shifted.
465 // By the quotient ordering invariant, a run start at position i
466 // where occupied(i) is set must be the run for quotient i.
467 if (not get_cont(curr) and get_occ(curr))
468 set_shft(curr, false);
469
470 curr = nxt;
471 nxt = incr(nxt);
472 }
473
474 // Clear the last vacated slot, preserving its occupied bit.
475 clear_element(curr);
476 }
477
478public:
487 : q_(q), r_(r), num_slots_(0), num_elems_(0), seed_(seed), slots_()
488 {
489 ah_domain_error_if(q < 1 or q > 32) << "Quotient_Filter: q must be in [1, 32] (got " << static_cast<int>(q) << ")";
490 ah_domain_error_if(r < 1 or r > 60) << "Quotient_Filter: r must be in [1, 60] (got " << static_cast<int>(r) << ")";
491 ah_domain_error_if(q + r > 64) << "Quotient_Filter: q + r must be <= 64 (got " << static_cast<int>(q + r) << ")";
492
493 num_slots_ = size_t{1} << q;
495 }
496
504 static Quotient_Filter from_capacity(size_t expected_n, double fp_rate, uint32_t seed = 0x5F3759DF)
505 {
506 ah_domain_error_if(expected_n == 0) << "Quotient_Filter::from_capacity: expected_n must be > 0";
508 << "Quotient_Filter::from_capacity: fp_rate must be in (0, 1) (got " << fp_rate << ")";
509
510 const double slots_needed = static_cast<double>(expected_n) / 0.75;
511 auto q_v = static_cast<uint8_t>(std::max(1.0, std::ceil(std::log2(slots_needed))));
512 auto r_v = static_cast<uint8_t>(std::max(1.0, std::ceil(-std::log2(fp_rate))));
513 if (q_v + r_v > 64)
514 q_v = static_cast<uint8_t>(64 - r_v);
515 return Quotient_Filter(q_v, r_v, seed);
516 }
517
526 {
527 auto [fq, fr] = fingerprint(item);
528 switch (do_insert(fq, fr))
529 {
532 return *this;
534 if (contains(item))
535 return *this;
536 ah_overflow_error() << "Quotient_Filter::insert: filter is full (" << num_elems_ << "/" << num_slots_ << ")";
537 }
538 return *this;
539 }
540
545 [[nodiscard]] bool contains(const T &item) const noexcept
546 {
547 auto [fq, fr] = fingerprint(item);
548
549 if (not get_occ(fq))
550 return false;
551
552 size_t s = find_run_start(fq);
553 do
554 {
555 uint64_t cur = get_rem(s);
556 if (cur == fr)
557 return true;
558 if (cur > fr)
559 return false;
560 s = incr(s);
561 } while (get_cont(s));
562
563 return false;
564 }
565
570 bool remove(const T &item) noexcept
571 {
572 auto [fq, fr] = fingerprint(item);
573
574 if (not get_occ(fq))
575 return false;
576
577 size_t s = find_run_start(fq);
578 size_t pos = s;
579
580 // Locate the remainder within the run.
581 do
582 {
583 if (get_rem(pos) == fr)
584 goto found;
585 if (get_rem(pos) > fr)
586 return false;
587 pos = incr(pos);
588 } while (get_cont(pos));
589 return false;
590
591 found: {
592 const bool was_run_start = not get_cont(pos);
593 size_t nxt = incr(pos);
595
597 set_cont(nxt, false); // promote next to run start
598
600 set_occ(fq, false); // run becomes empty
601
603 --num_elems_;
604 return true;
605 }
606 }
607
610 {
611 return num_elems_;
612 }
613
616 {
617 return num_slots_;
618 }
619
622 {
623 return static_cast<double>(num_elems_) / num_slots_;
624 }
625
628 {
629 return num_elems_ == 0;
630 }
631
634 {
635 return q_;
636 }
637
640 {
641 return r_;
642 }
643
646 {
647 return std::ldexp(1.0, -static_cast<int>(r_));
648 }
649
652 {
653 return seed_;
654 }
655
658 {
659 return num_slots_ * sizeof(uint64_t);
660 }
661
664 {
665 for (size_t i = 0; i < num_slots_; ++i)
666 slots_(i) = 0;
667 num_elems_ = 0;
668 }
669
674 {
675 ah_domain_error_if(q_ != other.q_ or r_ != other.r_) << "Quotient_Filter::merge: parameter mismatch";
676 ah_domain_error_if(seed_ != other.seed_) << "Quotient_Filter::merge: seed mismatch";
677
678 for (size_t canonical = 0; canonical < other.num_slots_; ++canonical)
679 {
680 if (not other.get_occ(canonical))
681 continue;
682
683 size_t s = other.find_run_start(canonical);
684 do
685 {
686 const auto result = do_insert(canonical, other.get_rem(s));
688 << "Quotient_Filter::merge: destination filter is full (" << num_elems_ << "/" << num_slots_ << ")";
689 s = other.incr(s);
690 } while (other.get_cont(s));
691 }
692 }
693
695 [[nodiscard]] static std::pair<uint8_t, uint8_t> estimate(size_t n, double fp_rate)
696 {
697 ah_domain_error_if(n == 0) << "n must be > 0";
698 ah_domain_error_if(fp_rate <= 0.0 or fp_rate >= 1.0) << "fp_rate must be in (0, 1)";
699 const double slots = static_cast<double>(n) / 0.75;
700 auto qv = static_cast<uint8_t>(std::max(1.0, std::ceil(std::log2(slots))));
701 auto rv = static_cast<uint8_t>(std::max(1.0, std::ceil(-std::log2(fp_rate))));
702 if (qv + rv > 64)
703 qv = static_cast<uint8_t>(64 - rv);
704 return {qv, rv};
705 }
706
708 void swap(Quotient_Filter &other) noexcept
709 {
710 std::swap(q_, other.q_);
711 std::swap(r_, other.r_);
712 std::swap(num_slots_, other.num_slots_);
713 std::swap(num_elems_, other.num_elems_);
714 std::swap(seed_, other.seed_);
715 slots_.swap(other.slots_);
716 }
717};
718
719} // end namespace Aleph
720
721#endif // QUOTIENT_FILTER_H
Exception handling system with formatted messages for Aleph-w.
#define ah_overflow_error_if(C)
Throws std::overflow_error if condition holds.
Definition ah-errors.H:463
#define ah_overflow_error()
Throws std::overflow_error unconditionally.
Definition ah-errors.H:495
#define ah_domain_error_if(C)
Throws std::domain_error if condition holds.
Definition ah-errors.H:522
long double h
Definition btreepic.C:154
Simple dynamic array with automatic resizing and functional operations.
Definition tpl_array.H:139
void swap(Array &s) noexcept
Swap this with s
Definition tpl_array.H:227
Quotient filter for probabilistic set membership with deletion.
bool get_occ(size_t i) const noexcept
static constexpr uint64_t CONT_BIT
Array< uint64_t > slots_
size_t decr(const size_t i) const noexcept
size_t incr(const size_t i) const noexcept
Quotient_Filter & insert(const T &item)
Insert an element.
static Quotient_Filter from_capacity(size_t expected_n, double fp_rate, uint32_t seed=0x5F3759DF)
Construct from desired capacity and false positive rate.
double load_factor() const noexcept
Current load factor.
bool has_element(const size_t i) const noexcept
size_t memory_usage() const noexcept
Memory usage in bytes.
uint8_t q() const noexcept
Quotient bits.
bool get_shft(size_t i) const noexcept
bool is_empty() const noexcept
True if empty.
void set_rem(const size_t i, const uint64_t rem) noexcept
double false_positive_rate() const noexcept
Theoretical FP probability ~ 2^{-r}.
void set_shft(const size_t i, const bool v) noexcept
Insert_Result do_insert(size_t fq, uint64_t fr)
size_t capacity() const noexcept
Number of slots (2^q).
uint8_t r() const noexcept
Remainder bits.
uint64_t get_rem(const size_t i) const noexcept
void set_occ(const size_t i, const bool v) noexcept
void merge(const Quotient_Filter &other)
Merge another filter (same q, r, seed).
size_t find_first_unused(size_t start) const noexcept
uint64_t rem_mask() const noexcept
bool is_run_start(const size_t i) const noexcept
size_t size() const noexcept
Number of elements.
bool contains(const T &item) const noexcept
Test membership.
bool get_cont(size_t i) const noexcept
void set_element(const size_t i, const bool cont, const bool shft, const uint64_t rem) noexcept
std::pair< size_t, uint64_t > fingerprint(const T &item) const
static constexpr uint64_t SHFT_BIT
static constexpr uint64_t OCC_BIT
static constexpr uint64_t META_MASK
uint32_t seed() const noexcept
Hash seed.
bool slot_empty(size_t i) const noexcept
Quotient_Filter(uint8_t q, uint8_t r, uint32_t seed=0x5F3759DF)
Construct a quotient filter.
void shift_elements_right(size_t pos, size_t empty) noexcept
static std::pair< uint8_t, uint8_t > estimate(size_t n, double fp_rate)
Estimate (q, r) for n elements and target FP rate.
size_t find_run_start(size_t fq) const noexcept
void set_cont(const size_t i, const bool v) noexcept
bool remove(const T &item) noexcept
Remove an element.
void clear() noexcept
Remove all elements.
void shift_elements_left(size_t pos) noexcept
void clear_element(const size_t i) noexcept
void swap(Quotient_Filter &other) noexcept
Swap in O(1).
Main namespace for Aleph-w library functions.
Definition ah-arena.H:89
and
Check uniqueness with explicit hash + equality functors.
Divide_Conquer_DP_Result< Cost > divide_and_conquer_partition_dp(const size_t groups, const size_t n, Transition_Cost_Fn transition_cost, const Cost inf=dp_optimization_detail::default_inf< Cost >())
Optimize partition DP using divide-and-conquer optimization.
std::decay_t< typename HeadC::Item_Type > T
Definition ah-zip.H:105
size_t murmur3hash(const Key &key, std::uint32_t seed)
Definition hash-fct.H:334
Dynamic array container with automatic resizing.