Posted 2021-04-20

Scrapy deepcopy

Why write this post

The code:

def parse_category(self, response):
    flower = FlowerItem()
    root = response.css("div.zhiwuImg")
    for li in root.css("li"):
        flower["name"] = li.css("img").attrib['title']
        yield scrapy.Request(url=self.webroot+li.css("a:first-child::attr(href)").get(), meta={"item":flower}, callback=self.parse_detail)

def parse_detail(self, response):
    flower = response.meta['item']
    root = response.css("div.contentDiv")
    flower['firstletter'] = root.css("blockquote p::text").get()
    yield flower

I am using scrapy.Request and set two attributes from two pages by pass a meta argument

and the expected result when I execute

1	scrapy crawl mySpider -o res.json

should be like:

1
2
3

{"name":"Alpha", "quote":"A"}
{"name":"Beta", "quote":"B"}
{"name":"Zeta", "quote":"Z"}

but in fact what’s in res.json is like:

1
2
3

{"name":"Alpha", "quote":"A"}
{"name":"Alpha", "quote":"B"}
{"name":"Alpha", "quote":"Z"}

This is wired cuz my code is logical superficially but can’t figure it out

until I searched on Internet…That’s a very common question when it comes to High-level programming language like Python:

The difference between Value Type and Reference Type

What’s Copy in Python

Think about this python code, what’s the output?

# Python
a = [1,2,3]
b = a
b.append(4)
print(a)

Surely in python it’s [1,2,3,4]. But what you changed is b, why a was changed too?

That’s because in Python a and b is two pointer(just like C programming) pointing the same memory address(which is called id in python)

Copying items
To copy an item, you must first decide whether you want a shallow copy or a deep copy.

If your item contains mutable values like lists or dictionaries, a shallow copy will keep references to the same mutable values across all different copies.

For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item and the copy have the same list of tags. Adding a tag to the list of one of the items will add the tag to the other item as well.

If that is not the desired behavior, use a deep copy instead.

Back to the start

From the Scrapy document we can know about meta:

A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.

See Request.meta special keys for a list of special meta keys recognized by Scrapy.

This dict is shallow copied when the request is cloned using the copy() or replace() methods, and can also be accessed, in your spider, from the response.meta attribute.

And so does it when it’s being passed like

1	yield scrapy.Request(meta={"item":flower}, callback=self.parse_detail)

Scrapy is asynchronous by default, which means when your code looks like it’s logical, but it will probably not work as you wish.

So a easy way to avoid that is to deepcopy this item, namely duplicating it in the memory, not pointing:

1	yield scrapy.Request(meta={"item":deepcopy(flower)}, callback=self.parse_detail)

Reference given by @Gallecio:

https://github.com/scrapy/scrapy/issues/3194

Scrapy deepcopy

https://bakaft.github.io/2021/04/20/Scrapy-deepcopy/

Author

BakaFT

Posted on

2021-04-20

Updated on

2023-12-28

Licensed under

#Python Scrapy

Scrapy deepcopy

Why write this post

What’s Copy in Python

Copying items

Back to the start

Author

Posted on

Updated on

Licensed under

Catalogue

Your browser is out-of-date!