|
Aleph-w 3.0
A C++ Library for Data Structures and Algorithms
|
Incremental reservoir sampler (Algorithm R). More...
#include <reservoir-sampling.H>
Public Member Functions | |
| Reservoir_Sampler (const size_t k, const unsigned long seed=static_cast< unsigned long >(std::chrono::high_resolution_clock::now().time_since_epoch().count())) | |
| Initialize sampler. | |
| void | update (const T &val) |
| Process a new element from the stream. | |
| template<std::input_iterator Itor, std::sentinel_for< Itor > Sent> | |
| void | update (Itor beg, Sent end) |
| Process multiple elements. | |
| const Array< T > & | get_sample () const noexcept |
| Returns the current random sample. | |
| size_t | total_seen () const noexcept |
| Total elements processed so far. | |
| size_t | sample_size () const noexcept |
| Target sample size. | |
| uint64_t | rng_range () const noexcept |
| RNG output range (max - min + 1). | |
| void | set_n_seen_for_testing (const size_t n) |
| Forcibly set the internal stream counter (for testing only). | |
| void | clear () |
| Reset the sampler while keeping k and seed. | |
Private Types | |
| using | GslRngPtr = std::unique_ptr< gsl_rng, void(*)(gsl_rng *)> |
Private Attributes | |
| size_t | k_ |
| Sample size. | |
| size_t | n_seen_ = 0 |
| Total elements seen so far. | |
| unsigned long | seed_ |
| Stored seed for RNG reset. | |
| uint64_t | rng_min_ |
| Minimum value produced by RNG. | |
| uint64_t | rng_max_ |
| Maximum value produced by RNG. | |
| uint64_t | rng_range_ |
| Cached range of the RNG (max - min + 1). | |
| Array< T > | reservoir_ |
| Current random sample. | |
| GslRngPtr | rng_ |
| Random number generator. | |
Incremental reservoir sampler (Algorithm R).
Maintain a representative random sample of size k from a stream of elements.
This class is not thread-safe. External synchronization is required for concurrent access from multiple threads.
| T | Element type. |
k == 0, no elements are stored and get_sample() returns an empty array. Definition at line 72 of file reservoir-sampling.H.
|
private |
Definition at line 74 of file reservoir-sampling.H.
|
inlineexplicit |
Initialize sampler.
| [in] | k | Size of the sample to maintain. |
| [in] | seed | Optional seed for RNG (defaults to current time). |
| std::runtime_error | if GSL RNG allocation fails (checked via ah_runtime_error_unless). |
Definition at line 93 of file reservoir-sampling.H.
References ah_runtime_error_unless, Aleph::divide_and_conquer_partition_dp(), Aleph::Reservoir_Sampler< T >::k_, Aleph::Reservoir_Sampler< T >::reservoir_, Aleph::Reservoir_Sampler< T >::rng_, Aleph::Reservoir_Sampler< T >::rng_max_, Aleph::Reservoir_Sampler< T >::rng_min_, Aleph::Reservoir_Sampler< T >::rng_range_, and Aleph::Reservoir_Sampler< T >::seed_.
|
inline |
Reset the sampler while keeping k and seed.
Clears the reservoir, resets the element counter, and reseeds the RNG to its original state so the sampler is fully reproducible.
| none |
Definition at line 218 of file reservoir-sampling.H.
References Aleph::divide_and_conquer_partition_dp(), Aleph::Reservoir_Sampler< T >::n_seen_, Aleph::Reservoir_Sampler< T >::reservoir_, Aleph::Reservoir_Sampler< T >::rng_, and Aleph::Reservoir_Sampler< T >::seed_.
|
inlinenoexcept |
Returns the current random sample.
| none |
Definition at line 171 of file reservoir-sampling.H.
References Aleph::Reservoir_Sampler< T >::reservoir_.
|
inlinenoexcept |
RNG output range (max - min + 1).
| none |
Definition at line 192 of file reservoir-sampling.H.
References Aleph::Reservoir_Sampler< T >::rng_range_.
Referenced by TEST().
|
inlinenoexcept |
Target sample size.
| none |
Definition at line 186 of file reservoir-sampling.H.
References Aleph::Reservoir_Sampler< T >::k_.
|
inline |
Forcibly set the internal stream counter (for testing only).
Allows unit tests to exercise overflow and RNG-range-exhaustion paths without having to feed billions of elements. Do not use in production.
| [in] | n | New value for the internal element counter. |
| std::invalid_argument | if the state would be inconsistent. |
Definition at line 201 of file reservoir-sampling.H.
References ah_invalid_argument_if, Aleph::divide_and_conquer_partition_dp(), Aleph::Reservoir_Sampler< T >::k_, Aleph::Reservoir_Sampler< T >::n_seen_, and Aleph::Reservoir_Sampler< T >::reservoir_.
|
inlinenoexcept |
Total elements processed so far.
| none |
Definition at line 180 of file reservoir-sampling.H.
References Aleph::Reservoir_Sampler< T >::n_seen_.
Process a new element from the stream.
| [in] | val | The element to potentially include in the sample. |
| std::overflow_error | if the stream length exceeds size_t or unsigned long capacity. |
| std::runtime_error | if the stream length exceeds the range of the underlying random number generator. |
Definition at line 117 of file reservoir-sampling.H.
References ah_overflow_error_if, ah_runtime_error_unless, Aleph::divide_and_conquer_partition_dp(), Aleph::Reservoir_Sampler< T >::k_, Aleph::Reservoir_Sampler< T >::n_seen_, Aleph::Reservoir_Sampler< T >::reservoir_, Aleph::Reservoir_Sampler< T >::rng_, Aleph::Reservoir_Sampler< T >::rng_range_, and Aleph::upper_bound().
Referenced by TEST(), TEST(), and Aleph::Reservoir_Sampler< T >::update().
|
inline |
Process multiple elements.
| Itor | Input iterator type. |
| Sent | Sentinel type for Itor. |
| [in] | beg | Iterator to the first element of the range. |
| [in] | end | Sentinel to the end of the range. |
| std::overflow_error | if stream length exceeds safe limits. |
| std::runtime_error | if RNG range is exceeded. |
Definition at line 158 of file reservoir-sampling.H.
References Aleph::divide_and_conquer_partition_dp(), and Aleph::Reservoir_Sampler< T >::update().
|
private |
Sample size.
Definition at line 76 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler(), Aleph::Reservoir_Sampler< T >::sample_size(), Aleph::Reservoir_Sampler< T >::set_n_seen_for_testing(), and Aleph::Reservoir_Sampler< T >::update().
|
private |
Total elements seen so far.
Definition at line 77 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::clear(), Aleph::Reservoir_Sampler< T >::set_n_seen_for_testing(), Aleph::Reservoir_Sampler< T >::total_seen(), and Aleph::Reservoir_Sampler< T >::update().
Current random sample.
Definition at line 82 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler(), Aleph::Reservoir_Sampler< T >::clear(), Aleph::Reservoir_Sampler< T >::get_sample(), Aleph::Reservoir_Sampler< T >::set_n_seen_for_testing(), and Aleph::Reservoir_Sampler< T >::update().
|
private |
Random number generator.
Definition at line 83 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler(), Aleph::Reservoir_Sampler< T >::clear(), and Aleph::Reservoir_Sampler< T >::update().
|
private |
Maximum value produced by RNG.
Definition at line 80 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler().
|
private |
Minimum value produced by RNG.
Definition at line 79 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler().
|
private |
Cached range of the RNG (max - min + 1).
Definition at line 81 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler(), Aleph::Reservoir_Sampler< T >::rng_range(), and Aleph::Reservoir_Sampler< T >::update().
Stored seed for RNG reset.
Definition at line 78 of file reservoir-sampling.H.
Referenced by Aleph::Reservoir_Sampler< T >::Reservoir_Sampler(), and Aleph::Reservoir_Sampler< T >::clear().