Will Murphy's personal home page

Closing the Loop on Go Iterators

The alternatives to go iterators (AKA go range-over-func experiment) are all worse.

The situation I have is: I have a struct that represents a large collection of large structs, and I have a few reasonable requirements:

  1. Encapsulation: clients of this package should not be able to mess with the collection; its members are validated and sorted in some way
  2. Iteration: clients of this package need to be able to make a subset of the collection, hunt through the collection for a specific item, etc. In general, clients need to be able to loop over the collection.
  3. Performance: the collection and its members are large, but the application is time and memory sensitive, so I don’t want to copy large slices
  4. Simplicity to consume: clients of this package should not hate me.

We’ll look at some examples of things I could try to meet these requirements, and see why all of them fall short in some way without using go iterators.

1. Make the backing slice public

I could just say:

type BigCollection struct {
    BackingSlice []BigStruct
}

Calling this couldn’t easier. The client can just say:

for _, v := range collection.BackingSlice {
    // make filtered collection
}

This is silly because it breaks encapsulation. The client can just do:

collection.BackingSlice = someSliceThatBreaksInvariants // oh no!

This method is out because it inadvertently exposes to clients of the package the ability to mutate some collection that’s not theirs. Let’s try again!

Option 1 fails encapsulation. Can we do better?

2. Return a reference to a private slice

If I can’t make the slice public, can I just make it private?

type BigCollection struct {
    backingSlice []BigStruct
}

func (b BigCollection) All() []BigStruct {
    return b.backingSlice
}

This looks better, because the client can’t assign to BackingSlice any more, but it still has a pretty big problem: the client can mess with elements in the backing slice.

toMessWith := collection.All()
toMessWith[18] = someValueThatShouldntBeThere // oh no!

(Don’t believe me! Check out the playground)

So option 2 still fails encapsulation. Can we do better?

3. Copy the backing slice

What if we give the client a copy of the backing slice?

type BigCollection struct {
    backingSlice []BigStruct
}

func (b BigCollection) All() []BigStruct {
    result = make([]BigStruct, len(b.backingSlice)) // holy allocation batman!
    copy(result, b.backingSlice)
}

This keeps encapsulation - each time you call All() you get your own, brand new chunk of memory and if you want to break my package’s invariants there, well, you’re not in my package any more so you do you.

So option 3 preserves encapsulation, but fails performance

Option 4. Return a channel

Go already has a construct that’s sort of like a stream: a channel! A channel is really for moving data between go routines, but it can yield one item at a time. Performance!

type BigCollection struct {
    backingSlice []BigStruct
}

func (b BigCollection) All() <- chan BigStruct {
    result := make(chan BigStruct)
    go func() {
        defer close(result) // this will always get called, right? :)
        for _, v := range b.backingSlice {
            result <- v
        }
    }()

    return result
}

This looks fine, until a client exits early.

for v := range collection.All() {
    if youreTheOneForMe(v) {
        break // oops! Leaked a go routine
    }
}

So now instead of allocating a giant slice, we’ll have a goroutine blocking forever.

Option 4 fails the performance criterion, because in the super common case of looping over a collection until you find the item you want, it leaks a go routine.

Option 5: take a context, return a channel

Give the client a way to tell us that they’re done, then we won’t leak the channel!

import "context"

type BigCollection struct {
	backingSlice []BigStruct
}

func (b BigCollection) All(ctx context.Context) <-chan BigStruct {
	result := make(chan BigStruct)
	go func() {
		defer close(result) // this will always get called, right? :)
		for _, v := range b.backingSlice {
			select {
			case <-ctx.Done():
				return
			case result <- v:
                // successful send, go around again
			}
		}
	}()

	return result
}

This is the best option so far, but callers have to remember to cancel the context.

Correct caller:

func getTheBigStructForMe(collection BigCollection) BigStruct {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel() // caller must remember this; incorrect caller still compiles
    for v := range collection.All(ctx) {
        if youreTheOneForMe(v) {
            return v
        }
    }
    return BigStruct{}
}

You can definitely have an incorrect caller, though:

	for v := range collection.All(context.TODO()) {
		if v.Value == 2 {
			found = v
			break // go routine leaks forever; no one called cancel!
		}
	}

This option isn’t terrible, it meets some of the criteria:

  1. It has encapsulation - caller can’t mess up collection
  2. Client can iterate through some or all of the items
  3. No one has to re-allocate a giant slice just to improve encapsulation
  4. It’s not bad to consume

I think we could almost stop here, but there are still 2 drawbacks:

  1. We’re using a concurrency primitive (chan) on the API boundary, but we’re not doing concurrency. The goroutine that sends on the returned channel is engaged in concurrent access to the collection, and I don’t want to worry about thread safety.
  2. It’s still possible to consume the package wrong and leak a goroutine. Yes, the client of the library is wrong in this case. But I don’t want them to be wrong. I want to write my package so that, as far as possible, erroneous use of it doesn’t compile.

Option 6: Build my own iterator!

Fine! I’ll do it myself:

func (b BigColleciton) GetIterator() MyIterator {
    return newIterator(b)
}

type MyIterator interface {
    Next() bool, BigStruct
}

type iterator struct {
    b *BigColleciton
    index int
}

func newIterator(b) MyIterator {
    return &iterator{
        b: &b,
        index: -1
    }
}

func (i *iterator) Next() (bool, BigStruct) {
    if i.index >= len(b.backingStruct) - 1 {
        return false, BigStruct{}
    }
    i.index++
    return true, b.backingStruct[i]
}

Then, to call it, I can’t do this:

for done, next := collection.GetIterator() // does not compile!
// only map, slice, and channel can be ranged over (and soon function)

So what I have to do is this:

i := collection.GetIterator()
for {
    notDone, current := i.Next()
    if !notDone { // really? We're not notDone? What's a better name here?
        break
    }
    if youreTheOneForMe(current) {
        return current
    }
}

This is a lot of code to get right, in my opinion. To illustrate, here are some questions:

  1. If Next() returns false, should it be safe to use the value we just got back? In other words, does break go before or after I use the current value?
  2. Should a false in the first position mean we’re done, or mean we still have stuff left?
  3. What should happen if I ask for the iterator on an empty collection?
  4. Is the index handling in the custom iterator correct? Is it clear? Would you do it that way?

Option 7: Visitor Pattern

So building my own iterator is annoying. Let’s just take a callback! Then we can call the callback once each item in the collection, and never have to copy or expose the collection:

func (b BigCollection) Visit(visitor func(BigStruct) error) error {
    for _, v := range b.backingSlice {
        err := visitor(v)
        if err != nil {
            return err
        }
    }
    return nil
}

Now I want to go through this collection and look for the one I want (playground):

func findInCollection(collection BigCollection) {
    var theOneIWant BigStruct
    visitor := func(v BigStruct) error {
        if youreTheOneForMe(v) {
            theOneIWant = v
        }
        return nil
    }
    collection.Visit(visitor)
}

This is not terrible, I still have some thoughts:

  1. Should my visitor return an error? If so, does returning an error halt iteration? (sometimes it depends on which error)
  2. It requires using closures to return from the visitor. The caller ends up declaring a func literal to close over its return value. Not a huge deal, just not my favorite. Since go is supposed to be a simple language here, shouldn’t we be able to just use a for loop?

Ok fine, how would an iterator look?

We’ve tried a lot of gymnastics to avoid using the range-over-func feature of Go. Let’s see how bad it would look to just use the new feature.

func (b BigCollection) All() iter.Seq[BigStruct] {
    return func(yield func(BigStruct) bool) {
        for _, v := range b.backingSlice {
            if !yield(v) {
                break
            }
        }
    }
}

And using it:

for v := range collection.Enumerate() {
    if youreTheOneForMe(v) {
        return v
    }
}

I think this is my favorite solution on the list:

  1. Encapsulation - client never has to know about backingSlice at all
  2. Iteration - client gets to use a simple for loop
  3. Performance - no extra allocations, no concurrency primitives, no goroutine leaks
  4. Simple to consume - the client is just using the for loop. Zero difficulty to consume. No “you have to look up the calling convention for this homemade Next() implementation.” No “you leaked a go routine because you’re holding it wrong.” It keeps the client code simple.

Conclusion

Is Go evolving in the wrong direction because of this feature?

That post laments:

It is sad that Go started evolving in the direction of increased complexity and implicit code execution.

I find this sentiment a little surprising, for a few reasons:

  1. The iterator solution is as simple as the simplest solutions described above for clients of BigCollection
  2. The implementation is a little more complex on the BigCollection implementer side, but I’d rather have complexity as a library maintainer than as a library user, and the amount of complexity is low.
  3. The other choices above include channel sends, which are hardly free of implicit code and complexity.

I think the places where the iterator is beneficial are numerous and important:

  1. The colllection is large and we don’t want to copy it or expose it for modification (the examples above)
  2. The iterator exposes a non-terminating series (a probability distribution, a mathematical sequence, etc.)
  3. The iterator exposes a non-array-like datastructure, e.g. a tree or something, that should still be enumerable
  4. The same collection is exposed for different methods of iteration, e.g. iterating by priority, by classification, or by some filtered subset

On the other hand, Valialkin (the author of the post linked above) I think has one very important point: all of those things feel kind of object oriented; they push go from a language where you have to write for every time you want to add another factor of N into your time complexity, towards a language with convenient ways to write down business rules that hide that complexity.

That said, I think all the workarounds are worse.

Anyway, that was a long one. Till next time, happy learning!
– Will

Discussion

Love this post? Hate it? Is someone wrong on the internet? Is it me? Please feel free to @me on mastodon

I used to host comments on the blog, but have recently stopped. Previously posted comments will still appear, but for new ones please use the mastodon link above.

Join the conversation!