Metadata-Version: 1.2
Name: requests-spider
Version: 0.0.8
Summary: Web crawling framework like flask.
Home-page: https://github.com/Tooooomy/requests_spider
Author: Tommy
Author-email: tooooommy@163.com
License: MIT
Description: <div align=center>
        <img src="./spider.png">
        </div>
        
        ## 描述
        requests_spider 是一个轻量级的异步爬虫框架，基于requests_html进行二次开发，类似flask
        
        ## 安装
        pip install requests_spider
        
        ## 依赖
        python: > 3.6
        uvloop
        requests_html
        
        ## 用法
        ##### 基础例子
        ```python3
        import json
        from spider import XField, Spider, Model, Response, Request
        
        
        class Proxy(Model):
            ip = XField(rule='//tr[contains(@class, "odd")]/td[2]', first=False)
            port = XField(rule='//tr[contains(@class, "odd")]/td[3]', first=False)
        
            async def process(self, response: Response):
                with open('proxy1.txt', 'a+') as file:
                    for result in self.merge():
                        file.write(json.dumps(result) + '\n')
        
        
        spider = Spider('proxy', workers=15)
        spider.domains = ['www.xicidaili.com']
        spider.init_requests = [
            Request(url='http://www.xicidaili.com/nn/{}'.format(x), model=Proxy) for x in range(1, 10)
        ]
        
        spider.async_limit = 5
        
        if __name__ == '__main__':
            spider.run()
        
        ```
        爬取代理网站ip
        
        ##### 中间组件
        ```python3
        import random
        import re
        
        from spider import Spider, Request, XRequest, Model, XField, RField, Response, Field, asyncio
        
        # 获取某个用户的所有的视频信息 ===> 获取aid / page
        videos_url = "https://space.bilibili.com/ajax/member/getSubmitVideos?mid={mid}" \
                     "&pagesize=30&tid=0&page={page}&keyword=&order=pubdate"
        
        # 某个视频推荐的视频  ===> 获取aid
        recommend_url = "https://comment.bilibili.com/playtag,{cid}-{aid}?html5=1"
        
        # 用户信息 post csrf/mid
        user_url = "https://space.bilibili.com/ajax/member/GetInfo"
        
        # av页面，====> 获取下载视频的url、cid-aid, mid
        av_url = "https://www.bilibili.com/video/av{aid}"
        
        
        class AV(Model):
            urls = RField(rule='"url":"(.*?)","backup_url"', first=False)
            cid = RField(rule='cid=(.*?)&aid=')
            aid = RField(rule='&aid=(.*?)&pre_ad=')
            mid = RField(rule='"owner":{"mid":(.*?),')
        
            async def process(self, response: Response):
                print(self['urls'])
                print(self['cid'])
                print(self['aid'])
                print(self['mid'])
                print(self.json())
                if self['mid'] and self['aid'] and self['urls'] and self['cid']:
                    # 推荐视频
                    yield Request(url=recommend_url.format(cid=self['cid'], aid=self['aid']), model=Recommend)
        
                    # 用户信息
                    yield Request(url=user_url, method='POST', data={'csrf': '', 'mid': self['mid']},
                                  model=UserInfo, not_filter=True)
        
                    # 下载视频
                    for order, url in enumerate(self['urls']):
                        yield Request(url=url.replace('http', 'https'),
                                      meta={'name': self['aid'] + '_' + str(order)}, model=Video)
        
        
        class UserInfo(Model):
            mid = Field()
            name = Field()
            sex = Field()
            rank = Field()
            face = Field()
            regtime = Field()
            birthday = Field()
            sign = Field()
            level_info = Field()
        
            async def process(self, response: Response):
                status = response.json().get('status')
                if status:
                    data = response.json().get('data')
                    for k in self.keys():
                        if k in data:
                            self[k] = data[k]
                    with open('user_' + str(self['mid']) + '.txt', 'w') as f:
                        f.write(self.dumps() + '\n')
        
        
        class Recommend(Model):
        
            async def process(self, response: Response):
                for data in response.json():
                    yield Request(av_url.format(aid=data[1]), model=AV)
        
        
        class VideoInfo(Model):
        
            async def process(self, response: Response):
                status = response.json().get('status')
                if status:
                    data = response.json().get('data')
                    pattern = 'mid=(\d+?)&pagesize=30&tid=0&page=(\d+?)&keyword=&order=pubdate'
                    patn = re.findall(pattern, response.url)[0]
                    print(patn)
                    yield Request(url=videos_url.format(mid=patn[0], page=int(patn[1]) + 1), model=VideoInfo),
                    for v in data['vlist']:
                        yield Request(url=av_url.format(aid=v.get('aid')), model=AV)
        
        
        class Video(Model):
        
            async def process(self, response: Response):
                file_name = response.current_request.meta.get('name')
                if file_name and response.status_code == 200:
                    with open(file_name + '.mp4', 'wb') as f:
                        for content in response.iter_content(chunk_size=512):
                            f.write(content)
                            f.flush()
        
        
        spider = Spider('bilibili', workers=5)
        
        spider.init_requests = [
            Request(url=videos_url.format(mid='35789774', page=1), model=VideoInfo),
        ]
        spider.async_limit = 5
        
        
        @spider.Middleware('request')
        async def test(request):
            print(request.url)
            if request.url.startswith('https://space.bilibili.com/'):
                request.info.update({'headers': {'Referer': 'https://space.bilibili.com/'}})
            else:
                request.info.update({'headers': {'Referer': 'https://bilibili.com/'}})
        
            asyncio.sleep(round(random.random() * 5))
            return request
        
        
        if __name__ == '__main__':
            spider.run()
        ```
        爬取bilibili用户视频，用户资料，视频资料，利用中间组件进行切换headers
        
        ## API
        #### Spider
        继承requests_html的HTMLSession
        
        - **Spider.async_limit**
        
            利用asyncio.Semaphore限制并发数量
        
        
        - **Spider.queue_timeout**
        
            从队列获取数据时候超时设置
        
        - **Spider.request_depth**
        
            请求的深度
        
        - **Spider.init_requests**
        
            初始化请求
        
        - **Spider.domains**
        
            爬取域名设置
        
        - **Spider.rules**
        
            从响应的数据中获取下次请求的信息，并加入队列
        
        - **Spider.Middleware**
        
            中间组件
            Middleware('request'), request入队之前执行，返回request, response, None
            Middleware('response'), response入队之前，返回request, response, None
        
        
        ##### Model
        Model类似一个字典的数据模型
        
        - **Model.keys**
        
            类似字典的keys
        
        - **Model.values**
        
            类似字典的values
        
        - **Model.items**
        
            类似字典的items
        
        - **Model.json**
        
            获取所有Field的字典形式
        
        - **Model.dumps**
        
            获取所有的Field的字符串
        
        - **Model.merge**
        
            当所有的Field从响应数据获取的数据是列表的时候，将获取的列表合并成为json数据
        
        - **Model.process**
        
            处理响应数据
        
        
        ##### Field
        
        - **Field**
        
            不处理或待处理数据项
        
        - **XField**
        
            利用xpath从响应数据中获取数据
        
        - **CField**
        
            利用css获取数据
        
        - **RField**
        
            利用正则获取数据
        
        ##### Request
        
        - **Request**
            正常的请求
        
        - **XRequest**
        
            利用xpath, 用于Spider.rules
        
        - **RRequest**
        
            利用正则，用于Spider.rules
        
        
        
        ## 例子
        examples目录下
        bilibili.py 爬取哔哩哔哩用户信息、视频信息和视频
        qidian.py 爬取起点小说月票排行包括评分
        proxy.py 爬取代理ip网站代理
        pearvideo.py 爬取梨视频网站的视频
        
        # License
        MIT
Platform: all
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.6.0
