Brief explanation for Reservoir Sampling


  • 124

    Problem:

    • Choose k entries from n numbers. Make sure each number is selected with the probability of k/n

    Basic idea:

    • Choose 1, 2, 3, ..., k first and put them into the reservoir.
    • For k+1, pick it with a probability of k/(k+1), and randomly replace a number in the reservoir.
    • For k+i, pick it with a probability of k/(k+i), and randomly replace a number in the reservoir.
    • Repeat until k+i reaches n

    Proof:

    • For k+i, the probability that it is selected and will replace a number in the reservoir is k/(k+i)
    • For a number in the reservoir before (let's say X), the probability that it keeps staying in the reservoir is
      • P(X was in the reservoir last time) × P(X is not replaced by k+i)
      • = P(X was in the reservoir last time) × (1 - P(k+i is selected and replaces X))
      • = k/(k+i-1) × (1 - k/(k+i) × 1/k
      • = k/(k+i)
    • When k+i reaches n, the probability of each number staying in the reservoir is k/n

    Example

    • Choose 3 numbers from [111, 222, 333, 444]. Make sure each number is selected with a probability of 3/4
    • First, choose [111, 222, 333] as the initial reservior
    • Then choose 444 with a probability of 3/4
    • For 111, it stays with a probability of
      • P(444 is not selected) + P(444 is selected but it replaces 222 or 333)
      • = 1/4 + 3/4*2/3
      • = 3/4
    • The same case with 222 and 333
    • Now all the numbers have the probability of 3/4 to be picked

    This Problem <Linked List Random Node>

    • This problem is the sp case where k=1

    P.S. Thanks for @WKVictor for pointing out my mistake!


  • 20
    W

    Your explanation is nice, but there is an artifact in your proof and the example.
    P(k+i is not chosen) = 1- k/(k+i) = i/(k+i) instead of 1/(k+i). Then, your final result in the proof is not k/(k+i) any more.

    As a matter of fact, the probability of X being kept in reservoir should be
    P(X was in the reservoir last time) * (P(k+i is not chosen) + P(k+i is chosen but X is not replaced)) = k/(k+i-1) * (i/(k+i) + k/(k+i) * (k-1)/k) = k/(k+i). In another word, the term P(X was in the reservoir) was neglected in your proof. In your example, i happens to be 1, so the result in the example seems to be correct.

    Another easier way of calculating the probability is P(X was in the reservoir last time) * P(X is not replaced this time) = P(X was in the reservoir last time) * (1- P(X is replaced this time)) = k/(k+i-1) * (1 - k/(k+i) * 1/k) = k/(k+i).


  • 2

    @WTIFS Why don't you just shuffle the array (with a perfect shuffle) and take the first k elements?


  • 0

    @WKVictor Cool! I'll revise my post. Thanks a lot!


  • 2

    @agave

    Reservoir Sampling is typically used when we don't know the exact n
    For this question and the cases where n is clear, shuffle is fine.


  • 3
    B

    Share a proof:

    1.Probability of first k items still in reservoir finally is that of never getting chosen in the process:
    kth item...k+1th item......nth item
    k/(k+1) * (k+1)/(k+2) * ... * (n-1)/n = k/n

    2.Probability of i'th (i is not the first k) item in reservoir finally is that of getting chosen at its round and never getting chosen later:
    k/i * i/(i + 1) * ... * (n-1)/n = k/n


  • 0
    Q
    P(444 is not selected) + P(444 is selected but it replaces 222 or 333)
    = (1 - k/k+i) + (k/k+i * k-1/k)
    = k+i-1/k+i
    here, k = 3, i = 1, so the result is 3/4, but its NOT k/k+i.
    

  • 0
    W

    @qianzhige Please pay attention to the parentheses -- It is k/(k+i), not k/k+i.


  • 3
    M

    This is so smart! Where did you guys learn it? Why I do not know about this at all?!


  • 0

    The best provement I've ever seen, thanks for the clear explanation!


  • 0

    @myanonymos Now you know this : )


  • 0

    @bwv825 Thanks for your appreciation!


  • 0
    P

    @bigoffer4all easy to be understood


  • 6

    If anybody feels hard to understand those notation: k, k+1,k+i, X... like me.
    You can take a look of my explanation:
    https://discuss.leetcode.com/topic/55049/java-solution-with-cases-explain


  • 0
    J
    This post is deleted!

  • 6

    https://gregable.com/2007/10/reservoir-sampling.html

    This is the best explanation I've ever seen.


  • 2
    B

    Thanks for the nice explanation. Following is my implementation

    public class Solution {
        ListNode head = null;
        Random r = new Random();
    
        public Solution(ListNode head) {
            this.head = head;
        }
        
        public int getRandom() {
            int result = this.head.val;
            ListNode node = this.head.next;
            int k =1;
            int i = 1;
            while(node != null){
                double x = r.nextDouble();
                double y = k / (k+i *1.0);           
                if(x <= y){
                    result = node.val;
                }
                
                i++;
                node = node.next;
            }
            
            return result;
        }
    }
    

  • 1
    T

    Thank you for the explanation! Attached the Java code here for reference. Set k = 1 for this case.

    public class Solution {
        private ListNode head;
        private Random rand;
        /** @param head The linked list's head.
            Note that the head is guaranteed to be not null, so it contains at least one node. */
        public Solution(ListNode head) {
            this.head = head;
            this.rand = new Random();
        }
        
        /** Returns a random node's value. */
        public int getRandom() {
            int k = 1;
            ListNode node = this.head;
            int res = node.val;
            int i = 0;
            ArrayList<Integer> reservoir = new ArrayList<Integer>();
            
            while (i < k && node != null) {
                reservoir.add(node.val);
                node = node.next;
                i++;
            }
            i++; // i == k  =>  i == k+1
            while (node != null) {
                if (rand.nextInt(i) < k) {
                    reservoir.set(rand.nextInt(k), node.val);
                }
                i++;
                node = node.next;
            }
            return reservoir.get(0);// or return reservoir when k > 1;
        }
    }
    

  • 1

    @Tong.Liu Thanks for sharing. I would revise the last loop to be like this:

            while (node != null) {
                int j = rand.nextInt(i);
                if (j < k) {
                    reservoir.set(j, node.val);
                }
                i++;
                node = node.next;
            }
    

    It uses one less random API call, which is controversially expensive, and it does the job anyway.

    The sole random number j is used to

    • calculate the probability of k / (k + i)
    • calculate the index of the candidate in the reservoir to be replaced if success

  • 0
    L

    Shouldn't you keep track of the reservoir?
    Consider 123.
    First 1 gets entered into our reservoir.
    Now choose between 2 and 1, let's pick 1 over 2: so in our reservoir we only have 1.
    When 3 enters we have a 2/3 chance of picking 1 and a 1/3 chance of picking 3, but a 0 chance of picking 2. Meaning moving forward, there is no chance of getting 2. A correct reservoir sampling solution would keep track of all the options that haven't been accepted and possible bring one of those elements in.


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.