"Reservoir sampling" seems inefficient

  • 4

    "What if the linked list is extremely large and its length is unknown to you?"

    Well, then I simply count first. Silly question :-P

    The solution below takes about 180 ms while the "reservoir sampling" Python solutions posted by others all take about 400 ms or more, presumably due to the many costly requests to the random number generator. So they seem rather inefficient.

    class Solution(object):
        def __init__(self, head):
            self.head = head
        def getRandom(self):
            n = 0
            node = self.head
            while node:
                n += 1
                node = node.next
            node = self.head
            for _ in xrange(random.randrange(n)):
                node = node.next
            return node.val

  • 0

    @StefanPochmann Any reason you don't count it in init? Not like the linked list can changed after init.

  • 1

    @o_sharp Counting in __init__ is slow if I do get a long list but getRandom never gets called. What I did do at first was to determine the length only the first time getRandom is called, but it turned out to be only slightly faster than just counting every time (~170 ms instead of ~180 ms). And my point here isn't my solution but that "reservoir sampling" seems to be an inefficient bad idea here and that the problem should maybe be modified to make it a good idea. So I went with counting every time, as that is shorter/simpler and still shows my point.

  • 1

    @StefanPochmann I thought about the same thing, but in the end i still think reservoir sampling is very useful. It's just impossible to test using OJ. In real life, you may not love the idea of doing a complete second pass because of the remote, database or harddisk overhead.

  • 0

    @o_sharp Great, I was hoping someone would bring that up. While it does "seem" inefficient when looking at the runtimes here, it could indeed be better in other situations. But I hadn't seen anybody talk about that other than @wei88 in a comment:

    Maybe it is easier to paraphrase like "its length may be varying". In this way the fixed-length solution (using extra space) won't work well any more.

    Though I think the current format with __init__ and a fixed list was intentionally used to allow using the fixedness.

    I think it's possible for the OJ to make reservoir sampling look good. By artificially making it slow to access the "next" node, simulating slow disk access. Could be done by the OJ providing a special Node class with a next method. Might not be worth the trouble, though.

    Another challenge: I think all solutions so far need either O(n) time or O(n) space. Can you come up with a solution that's better than O(n) time and better than O(n) space?

  • 3

    @StefanPochmann How about this? Choose sqrt(n) as the bucket size. Each bucket is a linked list. To get kth element, use bucket number k/sqrt(n) and find its k%sqrt(n) element.

    The data structure takes O(sqrt(n)) space. It takes O(sqrt(n)) time to get a random element.

  • 0

    @o_sharp Yep, that's good.

  • 0

    @StefanPochmann It is an amazing design~ Lazy initialization~cool

  • 3

    @StefanPochmann to make reservoir sampling looks good, I think it would be better to change the question to getting k random elements from list, not one element.

  • 0

    I still can't understand why it cost much less time to count n in getRandom(), compared to counting in init. @StefanPochmann said that it will be time consuming if getRandom() is never gets called. However, the actual situation is that init would be called only once, while getRandom would be called even thousand of times. Thus, you count the whole list thousands of times, which is completely a waste of time. Am I right?

  • 0

    geekforgeek has a nice analysis for getting k random elements.

    fixed size solution will cost O(k * n) to generating k random elements
    But Reservoir sampling is in picking k random elements in O(n)

  • 0

    Sure. But for this case, k=1 so O(k*n) == O(n)

  • 0
    This post is deleted!

  • 0

    @parvez.h.shaikh said in "Reservoir sampling" seems inefficient:

    I compute length

    Where? I don't see it.

    Also why dwell in specific random number generation when they're present in libraries :-)

    Huh? I am using a library function. And a more appropriate one than you are. You're the one making it complicated.

  • 0

    Follow up: If this list has a cycle, how could we do this?
    Simply use a hashset or any other ideas?

Log in to reply

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.