Content-Type: text/shitpost


Subject: More reservoir sampling
Path: you​!your-host​!ultron​!uunet​!batcomputer​!plovergw​!shitpost​!mjd
Date: 2018-02-16T16:30:05
Newsgroup: talk.mjd.reservoir-sampling
Message-ID: <11f9d0d7d8212b12@shitpost.plover.com>
Content-Type: text/shitpost

Regarding my recent discussion of algorithms to select exactly half the items from a finite but unbounded stream: Although I am persuaded by the argument I gave that there is no algorithm better than loading the whole thing into a buffer first, I still find this quite counterintuitive and I feel that if there is tnot some wrinkle that none of us have appreciated yet, then my intuition here is in need of a significant correction.

In the original problem, we had a stream of !!2N!! items, but we didn't know what !!N!! was. We were required to select exactly !!N!! of them for output, with each possible subset appearing with probability !!{2N\choose N}^{-1}!!. The question was essentially “is there a method for doing this that requires less memory than just sucking the whole stream into a buffer”, and the answer was “no”.

Let's take what appears to be a much easier version of the same problem. First, instead of requiring the outputs be equiprobable, we will only require that each possible output set appear with some positive probability.

Let's also say that instead of selecting exactly half the items, we are only required to select !!\lfloor\log_{10} N\rfloor!! of them. (We can even constrain things further by requring !!N!! to be a power of 10, but I don't think it makes a difference.) So:

  • We are only required to emit !!\lfloor\log_{10} N\rfloor!! of the !!N!! inputs, say if there are ten billion we only need to select and emit ten.

  • We are not required to select the possible outputs with any particular probability, as long as there is some possibility of producing each possible selection.

  • The only memory constraint is that we are not allowed to read the entire input into a buffer in every case. There must be some !!N!! for which, when the input contains !!N!! items, we have a chance of using at most !!N-1!! memory.

It seems incredible to me that there is no way to solve this problem, but I think that is so.

One possible strategy is to start producing output before reading all the input. I am pretty confident that this can never work. Say that after reading the first !!k!! items, we emit some item !!R!!. We have foreclosed the possibility of producing an output that omits !!R!!, but that itself is no problem, as long as we don't do it every time. The real problem is: we have also foreclosed the possibility of producing an output that omits all of the first !!k!! items, and that has been foreclosed every time. If the input contains fewer than !!10^k!! items we can escape detection, but if it has more than that we have screwed the pooch. So it can never work to emit items before reading the end of the stream.

Now say we have just read the first item. We can store it, emit it, or discard it. Emitting it immediately would be a failure, as I just discussed. And discarding it is also a failure, because if we discard it immediately we might later discover that !!N\ge 10!!, and that we were required to have a positive probability of emitting item 1. So we must store it, at least temporarily.

Now we read the second item. Again, emitting it and discarding it are losers, for the same reasons as before. Emitting 1 now is just as bad is doing it before would have been. Overwriting item 1 with item 2 is a little better, as long as we do it stochastically. (This is what makes ordinary reservoir sampling work.) But it forecloses the possibility of emitting both 1 and 2, and if !!N\ge100!! we are required to have had a chance of doing that. So the only remaining choice is to store 2 along with 1.

Say we have stored the first 9 items, and we have just read the 10th. Now the situation is a little different. We know, for the first time, that that output must include at least one item. But we still can't select one of these ten items to emit immediately, for the same reasons as before: an algorithm that does that has zero chance to emit, say, just !!\{s_{43}, s_{57}\}!!, which it is required to do with positive probability if !!N=100!!. And we can't select an item and discard it, because an algorithm that does that has zero chance to emit exactly !!\{s_1, \ldots s_{10}\}!!, which it is required to do with positive probability if !!N=10^{10}!!.

How long do we have to hold onto those first ten items? We certainly can't do anything about them until we resolve the question of whether !!N\ge10^{10}!!. If we hit the end of the stream, we know what to do, but the whole point was to not hold onto them until we got all the way to the end. And the situation isn't any better once we get out to the !!10^{10}!!th item, just as the situation with item 1 was no better once we got to the 10th item. We have to store all of the first !!10^{10}!! items, against the possibility that the input might contain at least !!10^{10^{10}}!! items so that our output has a chance to contain all of the first !!10^{10}!!.

If we are required to select, say, one million items from an unbounded stream, we can do it with a constant amount of memory, regardless of the size of the stream. But if we are required to select !!\frac N2!! or !!\sqrt N!! or !!\log N!! items, even if that is a tiny tiny fraction of the whole, even an inverse Ackermann function or something, there is no way to do it without storing the entire stream in an array. I am amazed.