Generators in Python

Posted on Thu 30 July 2020 in Data Science • 4 min read

Generators are a special type of function in python, letting you 'lazy load' data; a function becomes a generator is with the yield statement. Lazy loading is when you access just a portion of a data set that you are interested in (eg, the part you are working with), as opposed to loading the entire data set. While this gives marginal increases in efficiency/speed with small data sets, it can provide dramatic improvements on larger data sets; especially if you are using a system with limited memory.

Before we dive into some examples on where you could use generators, let's look where you should not use generators:

  • the data needs to be accessed multiple times (eg, join)
  • need to access the data randomly (or any other method of access that's not forward)
  • using a different compiler

Some examples where you should use generators are:

  • you don't know if you'll need all the results
  • you don't want to allocate all the results into memory
  • you don't know how many results there may be (eg, user interaction)
  • a potentially infinite series

A great example use is that of a search, it could be implemented to gather all the elements to search and return them after the search is complete. Rather by using a generator you can return the results as they are found!

First off, let's start with the hello world of generators, the fibonacci sequence. If you don't know what the fibonacci sequence is, it's just adding the number that precedes itself and so on (eg, 1,1,2,3,5,8,etc), this is an infinite series.

In [1]:
def fibon(n):
    "Generator version"
    a = b = 1
    for i in range(n):
        yield a
        a, b = b, a + b

for x in fibon(10):
    print(x)
1
1
2
3
5
8
13
21
34
55

You can also generate the entire sequence into memory by using the list function on the generator function!

In [2]:
print(list(fibon(10)))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

If we were to use a function for this instead, which compiles the entire list of the fibonacci sequence before returning, this would mean fibon(number_of_elements) would need to be computed and stored in memory. By using generators we can provide fibon(number_of_elements) any number of elements and only the latest and it's preceding value are stored in memory.

Now imagine if you had a database with 100 million records, and you needed to calculate something for every single row. If you tried to retrieve the entire database and store it in memory, it would possibly just fall over. Rather by using a generator we can iterate through the database in smaller chunks. Another handy feature is to have separate generator instances which you can interweave throughout your project if you need to maintain multiple states in a simplistic manner.

In this example, the entire 'data set' is stored in memory, but in a real use we'd request the small chunk from a database

The generator function returns a generator object, so to get each 'iteration' we use the next function.

In [3]:
def load_big_data():
    return range(100)

def batch_process(chunksize):
    from itertools import zip_longest

    def grouper(n, iterable, fillvalue=None):
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)

    # While big_data is currently stored entirely in memory, if you are requesting from an external database you could just fetch n results
    data = load_big_data()

    for i in grouper(chunksize,data):

        yield list(i)

batch_processor_1 = batch_process(5)
batch_processor_2 = batch_process(3)

print(next(batch_processor_1))
print(next(batch_processor_1))
print(next(batch_processor_2))
print(next(batch_processor_1))
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[0, 1, 2]
[10, 11, 12, 13, 14]

Another good example is to use generators for user interaction (UI), where the program is waiting on the user to input something, but you don't know how long the program will have to wait for interaction. This enables the program to run until the user decides to close it.

In [4]:
def user_input():
    while True:
        cmd = input()
        yield cmd

for command in user_input():
    # Do something with command
    print(command)
hello computer
i am user

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-4-3e09f71ee7b6> in <module>
      4         yield cmd
      5 
----> 6 for command in user_input():
      7     # Do something with command
      8     print(command)

<ipython-input-4-3e09f71ee7b6> in user_input()
      1 def user_input():
      2     while True:
----> 3         cmd = input()
      4         yield cmd
      5 

c:\Users\jackm\Documents\GitHub\jackmckew.dev\drafts\2020\generators-in-python\.env\lib\site-packages\ipykernel\kernelbase.py in raw_input(self, prompt)
    858                 "raw_input was called, but this frontend does not support input requests."
    859             )
--> 860         return self._input_request(str(prompt),
    861             self._parent_ident,
    862             self._parent_header,

c:\Users\jackm\Documents\GitHub\jackmckew.dev\drafts\2020\generators-in-python\.env\lib\site-packages\ipykernel\kernelbase.py in _input_request(self, prompt, ident, parent, password)
    902             except KeyboardInterrupt:
    903                 # re-raise KeyboardInterrupt, to truncate traceback
--> 904                 raise KeyboardInterrupt("Interrupted by user") from None
    905             except Exception as e:
    906                 self.log.warning("Invalid Message:", exc_info=True)

KeyboardInterrupt: Interrupted by user