Shallow vs Deep Copy in Python
Posted on Fri 09 October 2020 in Data Science • 4 min read
Shallow vs Deep Copy in Python¶
One of the utmost crucial parts in all programming languages is maintaining variables. We create, modify, compare, delete our variables to build more complex systems that eventually make up the software we use. This is typically done by using the =
operator (eg x = 5
), but sometimes this doesn't always do what we expect. This is going to be a deep dive into different types of copy
in Python.
When we say x = 5
, we're actually not creating a new object (as in object oriented software), we're creating a binding between a target and an object. We can see this in action by using the id()
function on our variables to see the 'identity' of an object.
x = 5
print(id(x))
print(id(5))
As we can see, both x
and 5
share an 'identity' meaning they are both the same object and the variable is merely a 'pointer' to the right object.
But sometimes we actually want to create a new object, and this comes into using the copy
module. But even with copy
there's still 2 types of copy:
- shallow copy
- deep copy
Let's take a look at this a bit closer, we'll start by creating a list (an object itself), with some integer elements and an embedded list.
A = [1,2,[3,4],5]
print("A contents: ",A)
Now similar to our x = 5
example before, let's assign a new variable B
and set it to A
to see what happens to the identities.
B = A
print(f"A's object id is {id(A)}")
print(f"B's object id is {id(B)}")
Funnily enough, the ids are the same! Meaning they are both the same object. This would mean if we were to modify the contents of the elements in A
, the same modifications would be made in B
, which is not obvious.
print("Let's modify A[2][0] = 100")
A[2][0] = 100
print("A contents: ",A)
print("B contents: ",B)
print("Is A == B? ", A==B)
Now if we were trying to use B
as a separate entity to A
this could cause all sorts of grief, and be very difficult to track down.
Let's reset our variable(s) back to it's original state so we can see how shallow & deep copies could change this behaviour.
print("Let's reset A[2][0] = 3")
A[2][0] = 3
Shallow Copy¶
import copy
C = copy.copy(A)
print(f"A's object id is {id(A)}")
print(f"C's object id is {id(C)}")
Fantastic! Now we can see that our A
and C
have separate identities, now we would expect this to behave like separate entities, right? Unfortunately not, while this does have a use case, the contents inside the list still have matching identities, meaning if we modify the contents of C
, it'll be reflected in A
, again a not obvious behaviour. But this is known as a shallow copy, meaning a new object is created but it still references the original data.
Let's demonstrate this by modifying one of the elements, and seeing if it's reflected in both variables.
print(f"A[2][0]'s object id is {id(A[2][0])}")
print(f"C[2][0]'s object id is {id(C[2][0])}")
print("Let's modify C[2][0] = 100 (note if this was not an embedded list this will creates a new instance of the C[0] element and won't update original list)")
C[2][0] = 100
print("A contents: ",A)
print("C contents: ",C)
print("Is A == C? ", A==C)
print("Is A[2][0] == C[2][0]? ", A[0]==C[0])
But why are we using an embedded list specifically? This is one percularity, that like most things in this blog post, isn't obvious. Note that if we modified the contents of an element in the shallow copy that was an integer, it wou;dn't be reflected in both variables. Let's try this out.
print(f"A[1]'s object id is {id(A[1])}")
print(f"C[1]'s object id is {id(C[1])}")
print("Let's modify C[1] = 100")
C[1] = 100
print("A contents: ",A)
print("C contents: ",C)
print("Is A == C? ", A==C)
print("Is A[1] == C[1]? ", A[1]==C[1])
This is due to the fact that the only difference between shallow and deep copies is for compound objects (objects that contain other objects, like lists within lists).
Next let's reset our list, and take a look at deep copy.
print("Let's reset A[2][0] = 3 and A[1] = 2")
A[1] = 2
A[2][0] = 3
Deep Copy¶
Now we're at the deep copy, and as we'd expect it creates a completely new object, and recursively creates new objects for embedded objects (compound objects). This means when we edit anything inside one of these compound objects, the changes won't be reflected in the other object as we'd sometimes originally expect. Let's demonstrate this.
print("Let's do a deep copy")
D = copy.deepcopy(A)
print(f"A's object id is {id(A)}")
print(f"D's object id is {id(D)}")
print("Let's modify A[2][0] = 100")
A[2][0] = 100
print(A)
print(D)
Hopefully being aware of how the default behaviour works, and the potential solutions will help when debugging strange behaviour when using variables in Python!