Shallow vs Deep Copy in Python

Posted on Fri 09 October 2020 in Data Science • 4 min read

Shallow vs Deep Copy in Python

One of the utmost crucial parts in all programming languages is maintaining variables. We create, modify, compare, delete our variables to build more complex systems that eventually make up the software we use. This is typically done by using the = operator (eg x = 5), but sometimes this doesn't always do what we expect. This is going to be a deep dive into different types of copy in Python.

When we say x = 5, we're actually not creating a new object (as in object oriented software), we're creating a binding between a target and an object. We can see this in action by using the id() function on our variables to see the 'identity' of an object.

In [1]:
x = 5
print(id(x))
print(id(5))
140711684720416
140711684720416

As we can see, both x and 5 share an 'identity' meaning they are both the same object and the variable is merely a 'pointer' to the right object.

But sometimes we actually want to create a new object, and this comes into using the copy module. But even with copy there's still 2 types of copy:

  • shallow copy
  • deep copy

Let's take a look at this a bit closer, we'll start by creating a list (an object itself), with some integer elements and an embedded list.

In [2]:
A = [1,2,[3,4],5]

print("A contents: ",A)
A contents:  [1, 2, [3, 4], 5]

Now similar to our x = 5 example before, let's assign a new variable B and set it to A to see what happens to the identities.

In [3]:
B = A

print(f"A's object id is {id(A)}")

print(f"B's object id is {id(B)}")
A's object id is 2392779886784
B's object id is 2392779886784

Funnily enough, the ids are the same! Meaning they are both the same object. This would mean if we were to modify the contents of the elements in A, the same modifications would be made in B, which is not obvious.

In [4]:
print("Let's modify A[2][0] = 100")

A[2][0] = 100

print("A contents: ",A)

print("B contents: ",B)

print("Is A == B? ", A==B)
Let's modify A[2][0] = 100
A contents:  [1, 2, [100, 4], 5]
B contents:  [1, 2, [100, 4], 5]
Is A == B?  True

Now if we were trying to use B as a separate entity to A this could cause all sorts of grief, and be very difficult to track down.

Let's reset our variable(s) back to it's original state so we can see how shallow & deep copies could change this behaviour.

In [5]:
print("Let's reset A[2][0] = 3")
A[2][0] = 3
Let's reset A[2][0] = 3

Shallow Copy

In [6]:
import copy

C = copy.copy(A)

print(f"A's object id is {id(A)}")

print(f"C's object id is {id(C)}")
A's object id is 2392779886784
C's object id is 2392779896256

Fantastic! Now we can see that our A and C have separate identities, now we would expect this to behave like separate entities, right? Unfortunately not, while this does have a use case, the contents inside the list still have matching identities, meaning if we modify the contents of C, it'll be reflected in A, again a not obvious behaviour. But this is known as a shallow copy, meaning a new object is created but it still references the original data.

Let's demonstrate this by modifying one of the elements, and seeing if it's reflected in both variables.

In [7]:
print(f"A[2][0]'s object id is {id(A[2][0])}")

print(f"C[2][0]'s object id is {id(C[2][0])}")

print("Let's modify C[2][0] = 100 (note if this was not an embedded list this will creates a new instance of the C[0] element and won't update original list)")

C[2][0] = 100

print("A contents: ",A)

print("C contents: ",C)

print("Is A == C? ", A==C)

print("Is A[2][0] == C[2][0]? ", A[0]==C[0])
A[2][0]'s object id is 140711684720352
C[2][0]'s object id is 140711684720352
Let's modify C[2][0] = 100 (note if this was not an embedded list this will creates a new instance of the C[0] element and won't update original list)
A contents:  [1, 2, [100, 4], 5]
C contents:  [1, 2, [100, 4], 5]
Is A == C?  True
Is A[2][0] == C[2][0]?  True

But why are we using an embedded list specifically? This is one percularity, that like most things in this blog post, isn't obvious. Note that if we modified the contents of an element in the shallow copy that was an integer, it wou;dn't be reflected in both variables. Let's try this out.

In [8]:
print(f"A[1]'s object id is {id(A[1])}")

print(f"C[1]'s object id is {id(C[1])}")

print("Let's modify C[1] = 100")

C[1] = 100

print("A contents: ",A)

print("C contents: ",C)

print("Is A == C? ", A==C)

print("Is A[1] == C[1]? ", A[1]==C[1])
A[1]'s object id is 140711684720320
C[1]'s object id is 140711684720320
Let's modify C[1] = 100
A contents:  [1, 2, [100, 4], 5]
C contents:  [1, 100, [100, 4], 5]
Is A == C?  False
Is A[1] == C[1]?  False

This is due to the fact that the only difference between shallow and deep copies is for compound objects (objects that contain other objects, like lists within lists).

Next let's reset our list, and take a look at deep copy.

In [9]:
print("Let's reset A[2][0] = 3 and A[1] = 2")

A[1] = 2
A[2][0] = 3
Let's reset A[2][0] = 3 and A[1] = 2

Deep Copy

Now we're at the deep copy, and as we'd expect it creates a completely new object, and recursively creates new objects for embedded objects (compound objects). This means when we edit anything inside one of these compound objects, the changes won't be reflected in the other object as we'd sometimes originally expect. Let's demonstrate this.

In [10]:
print("Let's do a deep copy")

D = copy.deepcopy(A)

print(f"A's object id is {id(A)}")

print(f"D's object id is {id(D)}")

print("Let's modify A[2][0] = 100")

A[2][0] = 100

print(A)

print(D)
Let's do a deep copy
A's object id is 2392779886784
D's object id is 2392779846656
Let's modify A[2][0] = 100
[1, 2, [100, 4], 5]
[1, 2, [3, 4], 5]

Hopefully being aware of how the default behaviour works, and the potential solutions will help when debugging strange behaviour when using variables in Python!