NAME
    Data::HyperLogLog::Shared - shared-memory HyperLogLog cardinality
    estimator for Linux

SYNOPSIS
        use Data::HyperLogLog::Shared;

        # default precision 14 (16384 registers, ~0.8% std error), anonymous mapping
        my $hll = Data::HyperLogLog::Shared->new;

        $hll->add("alice");
        $hll->add("bob");
        $hll->add("alice");                 # already seen: no register increases

        my $n = $hll->count;                # ~2 (estimated distinct items)

        # bulk add in a single lock acquisition
        my $new = $hll->add_many([ map { "user-$_" } 1 .. 1000 ]);

        # merge another estimator of equal precision (register-wise max)
        my $other = Data::HyperLogLog::Shared->new;
        $other->add_many([ map { "user-$_" } 500 .. 1500 ]);
        $hll->merge($other);
        my $union = $hll->count;            # ~1500 distinct across both

        # share across processes via a backing file
        my $shared = Data::HyperLogLog::Shared->new("/tmp/visitors.hll", 14);

DESCRIPTION
    A HyperLogLog estimator in shared memory: it counts the number of
    distinct items it has seen (a probabilistic distinct-count) in a fixed,
    tiny amount of memory, regardless of how many items pass through it. It
    never stores the items; it keeps only an array of "m = 2**precision"
    single-byte registers (16 KB at the default precision 14) plus a fixed
    ~16 KB region for the cross-process reader table, so the whole mapping
    is about 32 KB at precision 14 (see "mmap_size" in "STATS"). It yields a
    relative standard error of roughly 0.8%.

    Each item is hashed with XXH3; the top "precision" bits of the hash
    select a register, and the position of the first set bit in the
    remaining bits is folded into that register as a running maximum. The
    harmonic mean of the registers, with a small-range linear-counting
    correction, gives the estimate.

    Because the register array lives in a shared mapping, several processes
    share one estimator: any process that opens the same backing file,
    inherits the anonymous mapping across "fork", or reopens a passed memfd,
    sees the others' additions and contributes its own. A write-preferring
    futex rwlock with dead-process recovery guards mutation, so many
    processes may "add" and "count" concurrently. Two estimators of equal
    precision can be combined with "merge" (register-wise max), which gives
    the cardinality of the union of their item sets without double-counting
    overlaps.

    Items are added by their byte content (encode wide/utf8 strings first).
    Linux-only. Requires 64-bit Perl.

METHODS
  Constructors
        my $hll = Data::HyperLogLog::Shared->new($path, $precision);
        my $hll = Data::HyperLogLog::Shared->new;                       # anonymous, precision 14
        my $hll = Data::HyperLogLog::Shared->new(undef, 16);            # anonymous, precision 16
        my $hll = Data::HyperLogLog::Shared->new_memfd($name, $precision);
        my $hll = Data::HyperLogLog::Shared->new_from_fd($fd);

    $path is the backing file ("undef" or omitted for an anonymous mapping).
    $precision is optional and defaults to 14; it must be between 4 and 18
    ("new" croaks otherwise). The register count is "m = 2**precision"
    (precision 4 -> 16 registers, 18 -> 262144), and the relative standard
    error is about "1.04 / sqrt(m)" (precision 14 -> ~0.81%, 16 -> ~0.41%,
    12 -> ~1.63%). When reopening an existing file or memfd, the stored
    precision wins and the caller's argument is ignored. "new_memfd" creates
    a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd"
    reopens one in another process.

  Adding and counting
        my $bumped = $hll->add($item);          # 1 if a register increased, else 0
        my $added  = $hll->add_many(\@items);   # count of items that increased a register
        my $n      = $hll->count;               # estimated distinct items (rounded)
        $hll->clear;                            # reset to empty (count -> 0)

    "add" hashes $item (taken by its bytes; wide characters croak, encode
    first) and updates the relevant register, returning 1 if that register
    increased and 0 otherwise. A return of 0 does not mean the item was seen
    before -- it means its hash did not beat the current register value;
    treat the return only as "did this change the sketch". "add_many" takes
    an array reference and does the whole batch under a single write lock,
    returning how many of its elements increased a register. "count" returns
    the current estimate rounded to the nearest integer.

  Merging
        $hll->merge($other);

    Folds $other's registers into $hll by register-wise maximum, so $hll
    then estimates the cardinality of the union of the two item sets. Both
    estimators must have the same precision ("merge" croaks on a mismatch).
    $other is read under its own lock into a private snapshot first, so
    merging is deadlock-free even if two processes merge each other
    concurrently; $other is not modified.

  Introspection and lifecycle
        $hll->precision; $hll->registers; $hll->count; $hll->stats;
        $hll->path; $hll->memfd; $hll->sync; $hll->unlink;   # or Class->unlink($path)

    "precision" is the configured precision; "registers" is the register
    count "m" ("2**precision"). "sync" flushes the mapping to its backing
    store (a no-op for anonymous and memfd estimators, which have none);
    "unlink" removes the backing file (also callable as
    "Class->unlink($path)"); "path" returns the backing path ("undef" for
    anonymous, memfd, or fd-reopened estimators) and "memfd" the backing
    descriptor -- the memfd of a "new_memfd" estimator or the dup'd fd of a
    "new_from_fd" estimator, and -1 for file-backed or anonymous estimators.

STATS
    stats() returns a hashref: "precision", "registers" (the register count
    "m"), "count" (the current rounded estimate), "ops" (running count of
    mutating operations -- "add", "add_many", "merge", "clear"), and
    "mmap_size" (bytes of the shared mapping).

ACCURACY
    The relative standard error is approximately "1.04 / sqrt(m)" where "m =
    2**precision". At the default precision 14 that is about 0.8%; roughly
    two thirds of estimates fall within one standard error of the truth and
    almost all within three. For very small cardinalities (when the estimate
    is at or below "2.5 * m" and some registers are still zero) the
    estimator switches to linear counting, which is accurate in that range.
    There is no large-range (2^32) correction: with 64-bit hashes the hash
    space is effectively collision-free for any cardinality this is used
    for.

SHARING ACROSS PROCESSES
    The estimator lives in a shared mapping, shared the same three ways as
    the rest of the family: a backing file (every process calls "new($path,
    ...)" on the same path), an anonymous mapping inherited across "fork",
    or a memfd whose descriptor is passed to an unrelated process (over a
    UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened
    with new_from_fd($fd). Because the mapping is shared, every process adds
    into and reads from the same set of registers, so the cardinality
    reflects the union of what all of them have seen.

        # producer and consumer share one estimator with no coordination
        my $hll = Data::HyperLogLog::Shared->new;            # before fork
        unless (fork) { $hll->add_many([ map { "ev-$_" } 1 .. 1000 ]); exit }
        wait;
        print $hll->count, "\n";   # ~1000, counting the child's additions

SECURITY
    The mmap region is writable by all processes that open it. Do not share
    backing files with untrusted processes.

CRASH SAFETY
    Mutation is guarded by a futex-based write-preferring rwlock with
    PID-encoded ownership; if a holder dies, the next contender detects the
    dead owner and recovers. Each register update is a single byte store, so
    a crash leaves the estimator consistent up to the last completed "add".
    Limitation: PID reuse is not detected (very unlikely in practice).

SEE ALSO
    Data::Intern::Shared, Data::SortedSet::Shared,
    Data::SpatialHash::Shared, and the rest of the "Data::*::Shared" family.

AUTHOR
    vividsnow

LICENSE
    This is free software; you can redistribute it and/or modify it under
    the same terms as Perl itself.

