scrapy with mongodb

scrapy用来处理数据(Item)的部分叫做Pipeline

当xx_spider.py中yield一个item,将按照settings.ITEM_PIPELINES的顺序保存数据,其中

1
2
3
4
5
ITEM_PIPELINES = {
# 'xx.pipelines.FirstPipeline': 1,
'xx.pipelines.DuplicatesPipeline': 2,
'xx.pipelines.MongoPipeline': 3,
}

这里后面的数字代表优先级,0-1000,按照从小到达执行Pipeline

关键在于pipelines.py文件,直接给示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

# -*- coding: utf-8 -*-

import pymongo
from scrapy.exceptions import DropItem


class FirstPipeline(object):

def process_item(self, item, spider):
pass


class DuplicatesPipeline(object):

def __init__(self):
self.urls_seen = set()

def process_item(self, item, spider):
#item['url']是和items.py的Fields保持一致的
if item['url'] in self.urls_seen:
raise DropItem("Duplicat item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item

class MongoPipeline(object):

def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db

@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DATABASE', 'items')
)

def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]

def close_spider(self, spider):
self.client.close()

def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item
Newer Post

ingress-API

说明 目标一共有两个,一个是Intel官方地图,另一个是游戏使用的API …

ingress undefined
Older Post

i春秋30强CTF

¶web狗的writeup id:l4t0_0 ¶能看到吗? 查看源代码得到flag{a2714506-b3e2-417d-bac9-e8d078ed4d96} ¶加密的地址 注释里面 flag{455ec542-5f3e-4cd6-beb0-26a5e67338fe} ¶看仔细了 base64 …

i春秋, writeup undefined