[컴][파이썬] page 의 url 긁어오기, crawling



html page 에 있는 모든 url 을 긁어올 필요가 있어서 좋은 tool 을 찾았는데 python 으로 된 Scrapy 가 있어서 사용해 보려 한다.


install Scrapy

http://doc.scrapy.org/en/latest/intro/install.html#intro-install





C:\Users\namhadmin>"c:\Python27\Scripts\pip.exe" install Scrapy
Requirement already satisfied (use --upgrade to upgrade): Scrapy in c:\python27\lib\site-packages
Collecting Twisted>=10.0.0 (from Scrapy)
  Downloading Twisted-15.0.0-cp27-none-win32.whl (3.1MB)
    100% |################################| 3.1MB 126kB/s
Collecting w3lib>=1.8.0 (from Scrapy)
  Downloading w3lib-1.11.0-py2.py3-none-any.whl
Collecting queuelib (from Scrapy)
  Downloading queuelib-1.2.2-py2.py3-none-any.whl
Collecting lxml (from Scrapy)
  Downloading lxml-3.4.1-cp27-none-win32.whl (3.0MB)
    100% |################################| 3.0MB 62kB/s
Collecting pyOpenSSL (from Scrapy)
  Downloading pyOpenSSL-0.14.tar.gz (128kB)
    100% |################################| 131kB 148kB/s
Collecting cssselect>=0.9 (from Scrapy)
  Downloading cssselect-0.9.1.tar.gz
Collecting six>=1.5.2 (from Scrapy)
  Downloading six-1.9.0-py2.py3-none-any.whl
Collecting zope.interface>=3.6.0 (from Twisted>=10.0.0->Scrapy)
  Downloading zope.interface-4.1.2.tar.gz (919kB)
    100% |################################| 921kB 199kB/s
Collecting cryptography>=0.2.1 (from pyOpenSSL->Scrapy)
  Downloading cryptography-0.7.2-cp27-none-win32.whl (909kB)
    100% |################################| 913kB 265kB/s
Requirement already satisfied (use --upgrade to upgrade): setuptools in c:\python27\lib\site-packages\setuptools-12.0.6.dev0-py2.7.egg (from zope.interface>=3.6.0->Twisted>=10.0.0->Scrapy)
Collecting enum34 (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
  Downloading enum34-1.0.4.tar.gz
Collecting cffi>=0.8 (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
  Downloading cffi-0.8.6-cp27-none-win32.whl (77kB)
    100% |################################| 77kB 481kB/s
Collecting pyasn1 (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
  Downloading pyasn1-0.1.7.tar.gz (68kB)
    100% |################################| 69kB 660kB/s
Collecting pycparser (from cffi>=0.8->cryptography>=0.2.1->pyOpenSSL->Scrapy)
  Downloading pycparser-2.10.tar.gz (206kB)
    100% |################################| 208kB 505kB/s
Installing collected packages: pycparser, pyasn1, cffi, enum34, cryptography, zo
pe.interface, six, cssselect, pyOpenSSL, lxml, queuelib, w3lib, Twisted
  Running setup.py install for pycparser
  Running setup.py install for pyasn1

  Running setup.py install for enum34

  Running setup.py install for zope.interface
    building 'zope.interface._zope_interface_coptimizations' extension
    ********************************************************************************
    WARNING:
            An optional code optimization (C extension) could not be compiled.
            Optimizations for this package will not be available!
    ()
    Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from http://aka.ms/vcpython27
    ********************************************************************************
    Skipping installation of c:\Python27\Lib\site-packages\zope\__init__.py (namespace package)
    Installing c:\Python27\Lib\site-packages\zope.interface-4.1.2-py2.7-nspkg.pth

  Running setup.py install for cssselect
  Running setup.py install for pyOpenSSL




Successfully installed Twisted-15.0.0 cffi-0.8.6 cryptography-0.7.2 cssselect-0.9.1 enum34-1.0.4 lxml-3.4.1 pyOpenSSL-0.14 pyasn1-0.1.7 pycparser-2.10 queuelib-1.2.2 six-1.9.0 w3lib-1.11.0 zope.interface-4.1.2


Scrapy 사용하기


Scrapy Tutorial

  1. c:\Python27\Scripts\scrapy.exe startproject <project_name>
  2. <scrapy.Spider>.py 만들기
  3. c:\Python27\Scripts\scrapy.exe crawl <scrapy.Spider.name>
  4. response.css('img::attr(src)').extract() 이런식으로 추출하면 된다. 자세한것은 tutorial 을 참고하자.



c:\Users\namhadmin\Documents\mine\programming\python\parsingDns\crawl>c:\Python27\Scripts\scrapy.exe crawl dmoz
:0: UserWarning: You do not have a working installation of the service_identity
module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2015-02-06 14:50:04+0900 [scrapy] INFO: Scrapy 0.24.4 started (bot: crawl)
2015-02-06 14:50:04+0900 [scrapy] INFO: Optional features available: ssl, http11

2015-02-06 14:50:04+0900 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE' : 'crawl.spiders', 'SPIDER_MODULES': ['crawl.spiders'], 'BOT_NAME': 'crawl'}
2015-02-06 14:50:04+0900 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled item pipelines:
2015-02-06 14:50:05+0900 [dmoz] INFO: Spider opened
2015-02-06 14:50:05+0900 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped0 items (at 0 items/min)
2015-02-06 14:50:05+0900 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-06 14:50:05+0900 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080

2015-02-06 14:50:06+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2015-02-06 14:50:06+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2015-02-06 14:50:06+0900 [dmoz] INFO: Closing spider (finished)
2015-02-06 14:50:06+0900 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 516,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 16342,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 2, 6, 5, 50, 6, 441000),
         'log_count/DEBUG': 4,
         'log_count/INFO': 7,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2015, 2, 6, 5, 50, 5, 490000)}
2015-02-06 14:50:06+0900 [dmoz] INFO: Spider closed (finished)



HttpFox

web browser 에서 request 하는 url 을 긁어올때는 firefox 의 httpfox 가 괜찮다고 한다.





댓글 없음:

댓글 쓰기