python – scrapy – 处理多种类型的项目 – 多个和相关的Django
发布时间:2020-12-20 13:08:27 所属栏目:Python 来源:网络整理
导读:我有以下Django模型.我不确定在使用scrapy管道将蜘蛛扫描到Django中的数据库时,保存这些相互关联对象的最佳方法是什么.似乎scrapy管道只是为了处理一种“种类”的物品 models.py class Parent(models.Model): field1 = CharField()class ParentX(models.Mode
我有以下Django模型.我不确定在使用scrapy管道将蜘蛛扫描到Django中的数据库时,保存这些相互关联对象的最佳方法是什么.似乎scrapy管道只是为了处理一种“种类”的物品
models.py class Parent(models.Model): field1 = CharField() class ParentX(models.Model): field2 = CharField() parent = models.OneToOneField(Parent,related_name = 'extra_properties') class Child(models.Model): field3 = CharField() parent = models.ForeignKey(Parent,related_name='childs') items.py # uses DjangoItem https://github.com/scrapy-plugins/scrapy-djangoitem class ParentItem(DjangoItem): django_model = Parent class ParentXItem(DjangoItem): django_model = ParentX class ChildItem(DjangoItem): django_model = Child spiders.py class MySpider(scrapy.Spider): name = "myspider" allowed_domains = ["abc.com"] start_urls = [ "http://www.example.com",# this page has ids of several Parent objects whose full details are in their individual pages ] def parse(self,response): parent_object_ids = [] #list from scraping the ids of the parent objects for parent_id in parent_object_ids: url = "http://www.example.com/%s" % parent_id yield scrapy.Request(url,callback=self.parse_detail) def parse_detail(self,response): p = ParentItem() px = ParentXItem() c = ChildItem() # populate p,px and c1,c2 with various data from the response.body yield p yield px yield c1 yield c2 ... etc c3,c4 pipelines.py – 不知道该怎么做 class ScrapytestPipeline(object): def process_item(self,item,spider): # This is where typically storage to database happens # Now,I dont know whether the item is a ParentItem or ParentXItem or ChildItem # Ideally,I want to first create the Parent obj and then ParentX obj (and point p.extra_properties = px),and then child objects # c1.parent = p,c2.parent = p # But I am not sure how to have pipeline do this in a sequential way from any order of items received 解决方法
如果你想按顺序进行操作,如果你将一个项目存储在另一个项目中,我会支持,一个depakage – 它在管道中,它可能会起作用.
我认为在保存db之前更容易关联对象. 在spiders.py中,当你“使用来自response.body的各种数据填充p,px和c1,c2”时,你可以填充从对象数据构造的“假”主键. 然后你可以保存数据并在模型中更新 – 如果已经只在一个管道中被删除: class ItemPersistencePipeline(object): def process_item(self,spider): try: item_model = item_to_model(item) except TypeError: return item model,created = get_or_create(item_model) try: update_model(model,item_model) except Exception,e: return e return item 当然方法: def item_to_model(item): model_class = getattr(item,'django_model') if not model_class: raise TypeError("Item is not a `DjangoItem` or is misconfigured") return item.instance def get_or_create(model): model_class = type(model) created = False try: #We have no unique identifier at the moment #use the model.primary for now obj = model_class.objects.get(primary=model.primary) except model_class.DoesNotExist: created = True obj = model # DjangoItem created a model for us. return (obj,created) from django.forms.models import model_to_dict def update_model(destination,source,commit=True): pk = destination.pk source_dict = model_to_dict(source) for (key,value) in source_dict.items(): setattr(destination,key,value) setattr(destination,'pk',pk) if commit: destination.save() return destination 来自:How to update DjangoItem in Scrapy 您还应该在django模型中定义字段“primary”以搜索是否已经在新项目中进行了搜索 models.py class Parent(models.Model): field1 = CharField() #primary_key=True primary = models.CharField(max_length=80) class ParentX(models.Model): field2 = CharField() parent = models.OneToOneField(Parent,related_name = 'extra_properties') primary = models.CharField(max_length=80) class Child(models.Model): field3 = CharField() parent = models.ForeignKey(Parent,related_name='childs') primary = models.CharField(max_length=80) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
相关内容
- python开发_platform_获取操作系统详细信息工具
- Python自定义函数实现求两个数最大公约数、最小公倍数示例
- Python使用百度API上传文件到百度网盘代码分享
- python实现智力问答测试小程序
- DataFrame中去除指定列为空的行方法
- python – 绘制pandas数据框架与年度数据
- python – 无法将bot连接到Bale messenger API:网络连接已
- python-2.7 – 升级到ubuntu-16.10后,Pip不起作用
- python – InternalError:当前事务被中止,命令被忽略,直到
- python – Django Crispy Forms添加Div提交按钮