How to Check Which Object Cause TypeError: cannot pickle '_thread.RLock' object?

Hey,

I am trying to build a distributed spider with scrapy and ray framework. I got the following error after puting my spider into different process. I am wondering if there is a way to check which object cause this error. Any idea about debugging it? Thanks!

  File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1432, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 406, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 386, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 346, in _serialize_to_pickle5
    raise e
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 342, in _serialize_to_pickle5
    inband = pickle.dumps(
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 574, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object

Hey @ChihweiLHBird great question :slight_smile: !!

I think from ray 1.2.0, we give you an inspection tool to test this; https://docs.ray.io/en/master/serialization.html#troubleshooting

Please let us know after you try this out!

Also, just note that the error indicates you have a lock object in your remote method that’s serialized. Lock objects are famous by “unserializable”. The common cause is like this;

object_a = create_object_a() # Let's say this object contains a lock
@ray.remote
def f():
    return object_a # object_a is captured to this remote method!! If it has a lock, the function cannot be serialized.

Instead you can

object_a = create_object_a()
@ray.remote
def f(object_a): # pass object_a explicitly, so that it doesn't need to be captured and serialized
    return object_a

f.remote(object_a)
2 Likes

Thank you, Sang. I finally figure out what the unserializable object is. It is a function inside twisted package. For some reason, maybe because of complicated mechanism inside twisted and scrapy packages, the ray inspect tool was not able to find the unserializable object directly, lol. If you would like to know the detailed information, please let me know.

Hmm, that’s unfortunate! Could you post the message output by inspect_serializable?

Source code:

import ray
import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess


ray.init()


@ray.remote
class DistributedCrawlerProcess(CrawlerProcess):
    pass


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        pass


if __name__ == "__main__":
    from ray.util import inspect_serializability
    inspect_serializability(DistributedCrawlerProcess, name="test")
    inspect_serializability(DistributedCrawlerProcess.crawl, name="test")
    inspect_serializability(DistributedCrawlerProcess.start, name="test")
    inspect_serializability(QuotesSpider, name="test")

    num_parallel_processes = 1 # Can be changed to more.

    ray_obj_list = []
    for i in range(num_parallel_processes):
        settings = get_project_settings()
        process = DistributedCrawlerProcess.remote(settings)
        process.crawl.remote(QuotesSpider)
        ray_obj_list.append(process.start.remote())

    ray.get(ray_obj_list)

Output:

2021-02-28 14:04:19,617 INFO services.py:1172 -- View the Ray dashboard at http://127.0.0.1:8265
================================================================================
Checking Serializability of <__main__.ActorClass(DistributedCrawlerProcess) object at 0x7f92b0493040>
================================================================================
================================================================================
Checking Serializability of <bound method CrawlerRunner.crawl of <__main__.ActorClass(DistributedCrawlerProcess) object at 0x7f92b0493040>>
================================================================================
================================================================================
Checking Serializability of <bound method CrawlerProcess.start of <__main__.ActorClass(DistributedCrawlerProcess) object at 0x7f92b0493040>>
================================================================================
===========================================================
Checking Serializability of <class '__main__.QuotesSpider'>
===========================================================
(pid=26064) 2021-02-28 14:04:23 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: product_spiders)
(pid=26064) 2021-02-28 14:04:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
(pid=26064) 2021-02-28 14:04:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
(pid=26064) 2021-02-28 14:04:23 [scrapy.crawler] INFO: Overridden settings:
(pid=26064) {'AUTOTHROTTLE_ENABLED': True,
(pid=26064)  'BOT_NAME': 'product_spiders',
(pid=26064)  'COOKIES_ENABLED': False,
(pid=26064)  'DOWNLOAD_DELAY': 0.5,
(pid=26064)  'NEWSPIDER_MODULE': 'product_spiders.spiders',
(pid=26064)  'ROBOTSTXT_OBEY': True,
(pid=26064)  'SPIDER_MODULES': ['product_spiders.spiders']}
(pid=26064) 2021-02-28 14:04:23 [scrapy.extensions.telnet] INFO: Telnet Password: 356900ad59191af5
(pid=26064) 2021-02-28 14:04:23 [scrapy.middleware] INFO: Enabled extensions:
(pid=26064) ['scrapy.extensions.corestats.CoreStats',
(pid=26064)  'scrapy.extensions.telnet.TelnetConsole',
(pid=26064)  'scrapy.extensions.memusage.MemoryUsage',
(pid=26064)  'scrapy.extensions.logstats.LogStats',
(pid=26064)  'scrapy.extensions.throttle.AutoThrottle']
(pid=26064) 2021-02-28 14:04:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
(pid=26064) ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
(pid=26064)  'scrapy.downloadermiddlewares.stats.DownloaderStats']
(pid=26064) 2021-02-28 14:04:23 [scrapy.middleware] INFO: Enabled spider middlewares:
(pid=26064) ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
(pid=26064)  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
(pid=26064)  'scrapy.spidermiddlewares.referer.RefererMiddleware',
(pid=26064)  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
(pid=26064)  'scrapy.spidermiddlewares.depth.DepthMiddleware']
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from before-call.apigateway to before-call.api-gateway
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
(pid=26064) 2021-02-28 14:04:23 [botocore.loaders] DEBUG: Loading JSON file: /home/chihwei/.local/lib/python3.8/site-packages/boto3/data/dynamodb/2012-08-10/resources-1.json
(pid=26064) 2021-02-28 14:04:23 [botocore.loaders] DEBUG: Loading JSON file: /home/chihwei/.local/lib/python3.8/site-packages/botocore/data/endpoints.json
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f5932d0e040>
(pid=26064) 2021-02-28 14:04:23 [botocore.loaders] DEBUG: Loading JSON file: /home/chihwei/.local/lib/python3.8/site-packages/botocore/data/dynamodb/2012-08-10/service-2.json
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-client-class.dynamodb: calling handler <function add_generate_presigned_url at 0x7f5932d3fdc0>
(pid=26064) 2021-02-28 14:04:23 [botocore.endpoint] DEBUG: Setting dynamodb timeout as (60, 60)
(pid=26064) 2021-02-28 14:04:23 [botocore.loaders] DEBUG: Loading JSON file: /home/chihwei/.local/lib/python3.8/site-packages/botocore/data/_retry.json
(pid=26064) 2021-02-28 14:04:23 [botocore.client] DEBUG: Registering retry handlers for service: dynamodb
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:dynamodb
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.ServiceResource: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:Table
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020f70>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f5932d0e040>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-client-class.dynamodb: calling handler <function add_generate_presigned_url at 0x7f5932d3fdc0>
(pid=26064) 2021-02-28 14:04:23 [botocore.endpoint] DEBUG: Setting dynamodb timeout as (60, 60)
(pid=26064) 2021-02-28 14:04:23 [botocore.client] DEBUG: Registering retry handlers for service: dynamodb
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:dynamodb
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.ServiceResource: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:Table
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020f70>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x7f5932d0e040>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-client-class.dynamodb: calling handler <function add_generate_presigned_url at 0x7f5932d3fdc0>
(pid=26064) 2021-02-28 14:04:23 [botocore.endpoint] DEBUG: Setting dynamodb timeout as (60, 60)
(pid=26064) 2021-02-28 14:04:23 [botocore.client] DEBUG: Registering retry handlers for service: dynamodb
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:dynamodb
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.ServiceResource: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [boto3.resources.factory] DEBUG: Loading dynamodb:Table
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020f70>
(pid=26064) 2021-02-28 14:04:23 [botocore.hooks] DEBUG: Event creating-resource-class.dynamodb.Table: calling handler <function lazy_call.<locals>._handler at 0x7f5931020ee0>
(pid=26064) 2021-02-28 14:04:23 [scrapy.middleware] INFO: Enabled item pipelines:
(pid=26064) ['product_spiders.pipelines.UpdateDatabasePipeline']
(pid=26064) 2021-02-28 14:04:23 [scrapy.core.engine] INFO: Spider opened
(pid=26064) 2021-02-28 14:04:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
(pid=26064) 2021-02-28 14:04:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
(pid=26064) 2021-02-28 14:04:23 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2021-02-28 14:04:28,238 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::DistributedCrawlerProcess.crawl() (pid=26064, ip=172.31.171.50)
  File "python/ray/_raylet.pyx", line 509, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 510, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1466, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 319, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 299, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 259, in _serialize_to_pickle5
    raise e
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/serialization.py", line 255, in _serialize_to_pickle5
    inband = pickle.dumps(
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/chihwei/.local/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 574, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
(pid=26064) 2021-02-28 14:04:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
(pid=26064) 2021-02-28 14:04:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
(pid=26064) 2021-02-28 14:04:36 [scrapy.core.engine] INFO: Closing spider (finished)
(pid=26064) 2021-02-28 14:04:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
(pid=26064) {'downloader/request_bytes': 681,
(pid=26064)  'downloader/request_count': 3,
(pid=26064)  'downloader/request_method_count/GET': 3,
(pid=26064)  'downloader/response_bytes': 5642,
(pid=26064)  'downloader/response_count': 3,
(pid=26064)  'downloader/response_status_count/200': 2,
(pid=26064)  'downloader/response_status_count/404': 1,
(pid=26064)  'elapsed_time_seconds': 13.422068,
(pid=26064)  'finish_reason': 'finished',
(pid=26064)  'finish_time': datetime.datetime(2021, 2, 28, 21, 4, 36, 641320),
(pid=26064)  'log_count/DEBUG': 45,
(pid=26064)  'log_count/INFO': 10,
(pid=26064)  'memusage/max': 126164992,
(pid=26064)  'memusage/startup': 126164992,
(pid=26064)  'response_received_count': 3,
(pid=26064)  'robotstxt/request_count': 1,
(pid=26064)  'robotstxt/response_count': 1,
(pid=26064)  'robotstxt/response_status_count/404': 1,
(pid=26064)  'scheduler/dequeued': 2,
(pid=26064)  'scheduler/dequeued/memory': 2,
(pid=26064)  'scheduler/enqueued': 2,
(pid=26064)  'scheduler/enqueued/memory': 2,
(pid=26064)  'start_time': datetime.datetime(2021, 2, 28, 21, 4, 23, 219252)}
(pid=26064) 2021-02-28 14:04:36 [scrapy.core.engine] INFO: Spider closed (finished)

Oh, got it; it looks like the response object (scrapy.Request(url=url, callback=self.parse)) is non-serializable…

It seems still fail to check the serializability of the Request object…

inspect_serializability(scrapy.Request, name="test")
inspect_serializability(scrapy.Request(url="https://www.google.com/"), name="test")
=================================================================
Checking Serializability of <class 'scrapy.http.request.Request'>
=================================================================
=========================================================
Checking Serializability of <GET https://www.google.com/>
=========================================================

May I explore the way to detect this kind of non-serializable issue? I can shares details here if I can figure it out.

1 Like

Hmm, well one possibility is to just call pickle.dumps(object), and then check all the attributes + scope of the object, and narrow down the scope of the object manually/iteratively.

1 Like