Scrapy deepcopy
Why write this post
The code:
1 | def parse_category(self, response): |
I am using scrapy.Request
and set two attributes from two pages by pass a meta
argument
and the expected result when I execute
1 | scrapy crawl mySpider -o res.json |
should be like:
1 | {"name":"Alpha", "quote":"A"} |
but in fact what’s in res.json
is like:
1 | {"name":"Alpha", "quote":"A"} |
This is wired cuz my code is logical superficially but can’t figure it out
until I searched on Internet…That’s a very common question when it comes to High-level programming language like Python:
The difference between Value Type
and Reference Type
What’s Copy in Python
Think about this python code, what’s the output?
1 | # Python |
Surely in python it’s [1,2,3,4]
. But what you changed is b, why a was changed too?
That’s because in Python a
and b
is two pointer
(just like C programming) pointing the same memory address(which is called id
in python)
Copying items
To copy an item, you must first decide whether you want a shallow copy or a deep copy.
If your item contains mutable values like lists or dictionaries, a shallow copy will keep references to the same mutable values across all different copies.
For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item and the copy have the same list of tags. Adding a tag to the list of one of the items will add the tag to the other item as well.
If that is not the desired behavior, use a deep copy instead.
Back to the start
From the Scrapy document
we can know about meta
:
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the
copy()
orreplace()
methods, and can also be accessed, in your spider, from theresponse.meta
attribute.
And so does it when it’s being passed like
1 | yield scrapy.Request(meta={"item":flower}, callback=self.parse_detail) |
Scrapy
is asynchronous by default, which means when your code looks like it’s logical, but it will probably not work as you wish.
So a easy way to avoid that is to deepcopy
this item, namely duplicating it in the memory, not pointing:
1 | yield scrapy.Request(meta={"item":deepcopy(flower)}, callback=self.parse_detail) |
Reference given by @Gallecio:
Scrapy deepcopy