scrapy with mongodb

scrapy用来处理数据(Item)的部分叫做Pipeline

当xx_spider.py中yield一个item,将按照settings.ITEM_PIPELINES的顺序保存数据,其中

1
2
3
4
5
ITEM_PIPELINES = {
# 'xx.pipelines.FirstPipeline': 1,
'xx.pipelines.DuplicatesPipeline': 2,
'xx.pipelines.MongoPipeline': 3,
}

这里后面的数字代表优先级,0-1000,按照从小到达执行Pipeline

关键在于pipelines.py文件,直接给示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# -*- coding: utf-8 -*-
import pymongo
from scrapy.exceptions import DropItem
class FirstPipeline(object):
def process_item(self, item, spider):
pass
class DuplicatesPipeline(object):
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
#item['url']是和items.py的Fields保持一致的
if item['url'] in self.urls_seen:
raise DropItem("Duplicat item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item
最近的文章

ingress-API

说明 目标一共有两个,一个是Intel官方地图,另一个是游戏使用的API …

于  ingress 继续阅读
更早的文章

i春秋30强CTF

¶web狗的writeup id:l4t0_0 ¶能看到吗? 查看源代码得到flag{a2714506-b3e2-417d-bac9-e8d078ed4d96} ¶加密的地址 注释里面 flag{455ec542-5f3e-4cd6-beb0-26a5e67338fe} ¶看仔细了 base64 …

于  i春秋, writeup 继续阅读