Comment made by: thomasathorne
Hi Alex,
Sorry, my description was a little terse and 'never return the first
ct
members of the collection' is misleading---I meant it can never
return exactly those members (ie. your example would never return
(link: 0 1 2 3 4 5 6 7 8 9)
, though it should have the possibility---very
unlikely of course---of doing that).
The bias I mentioned is statistical so to observe it properly you
should run the function many times---here is the example I've been
using to test this:
(sort-by first (frequencies (into [] cat (repeatedly 10000 #(g/reservoir-sample 5 (range 10))))))
=> ([0 4534] [1 4388] [2 4371] [3 4417] [4 4426] [5 5607] [6 5609] [7 5600] [8 5436] [9 5612])
As you can see, the first half of the vector is under-represented and
the second half is over-represented. A more extreme example is the
fact that the expression
(g/reservoir-sample 1 (range 2))
will always return (link: 1)
and never (link: 0)
.
The patch has the effect of slightly lowering the probability of
swapping out one of the result vector at each iteration of the loop.
The new version is correct in that the probability of each element
being sampled is the same. (You can see this with a simple proof by
induction, I will post that here if you wish.)