html page 에 있는 모든 url 을 긁어올 필요가 있어서 좋은 tool 을 찾았는데 python 으로 된 Scrapy 가 있어서 사용해 보려 한다.
install Scrapy pip 설치 : Installation — pip 6.0.8 documentation
- pip upgrade : python -m pip install -U pip
- setuptools 설치
- pip install Scrapy
- pywin32 설치 :
C:\Users\namhadmin>"c:\Python27\Scripts\pip.exe" install Scrapy Requirement already satisfied (use --upgrade to upgrade): Scrapy in c:\python27\lib\site-packages Collecting Twisted>=10.0.0 (from Scrapy) Downloading Twisted-15.0.0-cp27-none-win32.whl (3.1MB) 100% |################################| 3.1MB 126kB/s Collecting w3lib>=1.8.0 (from Scrapy) Downloading w3lib-1.11.0-py2.py3-none-any.whl Collecting queuelib (from Scrapy) Downloading queuelib-1.2.2-py2.py3-none-any.whl Collecting lxml (from Scrapy) Downloading lxml-3.4.1-cp27-none-win32.whl (3.0MB) 100% |################################| 3.0MB 62kB/s Collecting pyOpenSSL (from Scrapy) Downloading pyOpenSSL-0.14.tar.gz (128kB) 100% |################################| 131kB 148kB/s Collecting cssselect>=0.9 (from Scrapy) Downloading cssselect-0.9.1.tar.gz Collecting six>=1.5.2 (from Scrapy) Downloading six-1.9.0-py2.py3-none-any.whl Collecting zope.interface>=3.6.0 (from Twisted>=10.0.0->Scrapy) Downloading zope.interface-4.1.2.tar.gz (919kB) 100% |################################| 921kB 199kB/s Collecting cryptography>=0.2.1 (from pyOpenSSL->Scrapy) Downloading cryptography-0.7.2-cp27-none-win32.whl (909kB) 100% |################################| 913kB 265kB/s Requirement already satisfied (use --upgrade to upgrade): setuptools in c:\python27\lib\site-packages\setuptools-12.0.6.dev0-py2.7.egg (from zope.interface>=3.6.0->Twisted>=10.0.0->Scrapy) Collecting enum34 (from cryptography>=0.2.1->pyOpenSSL->Scrapy) Downloading enum34-1.0.4.tar.gz Collecting cffi>=0.8 (from cryptography>=0.2.1->pyOpenSSL->Scrapy) Downloading cffi-0.8.6-cp27-none-win32.whl (77kB) 100% |################################| 77kB 481kB/s Collecting pyasn1 (from cryptography>=0.2.1->pyOpenSSL->Scrapy) Downloading pyasn1-0.1.7.tar.gz (68kB) 100% |################################| 69kB 660kB/s Collecting pycparser (from cffi>=0.8->cryptography>=0.2.1->pyOpenSSL->Scrapy) Downloading pycparser-2.10.tar.gz (206kB) 100% |################################| 208kB 505kB/s Installing collected packages: pycparser, pyasn1, cffi, enum34, cryptography, zo pe.interface, six, cssselect, pyOpenSSL, lxml, queuelib, w3lib, Twisted Running install for pycparser Running install for pyasn1 Running install for enum34 Running install for zope.interface building 'zope.interface._zope_interface_coptimizations' extension ******************************************************************************** WARNING: An optional code optimization (C extension) could not be compiled. Optimizations for this package will not be available! () Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from ******************************************************************************** Skipping installation of c:\Python27\Lib\site-packages\zope\ (namespace package) Installing c:\Python27\Lib\site-packages\zope.interface-4.1.2-py2.7-nspkg.pth Running install for cssselect Running install for pyOpenSSL Successfully installed Twisted-15.0.0 cffi-0.8.6 cryptography-0.7.2 cssselect-0.9.1 enum34-1.0.4 lxml-3.4.1 pyOpenSSL-0.14 pyasn1-0.1.7 pycparser-2.10 queuelib-1.2.2 six-1.9.0 w3lib-1.11.0 zope.interface-4.1.2
Scrapy 사용하기
Scrapy Tutorial
- c:\Python27\Scripts\scrapy.exe startproject <project_name>
- <scrapy.Spider>.py 만들기
- c:\Python27\Scripts\scrapy.exe crawl <>
- response.css('img::attr(src)').extract() 이런식으로 추출하면 된다. 자세한것은 tutorial 을 참고하자.
c:\Users\namhadmin\Documents\mine\programming\python\parsingDns\crawl>c:\Python27\Scripts\scrapy.exe crawl dmoz :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. 2015-02-06 14:50:04+0900 [scrapy] INFO: Scrapy 0.24.4 started (bot: crawl) 2015-02-06 14:50:04+0900 [scrapy] INFO: Optional features available: ssl, http11 2015-02-06 14:50:04+0900 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE' : 'crawl.spiders', 'SPIDER_MODULES': ['crawl.spiders'], 'BOT_NAME': 'crawl'} 2015-02-06 14:50:04+0900 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-02-06 14:50:05+0900 [scrapy] INFO: Enabled item pipelines: 2015-02-06 14:50:05+0900 [dmoz] INFO: Spider opened 2015-02-06 14:50:05+0900 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped0 items (at 0 items/min) 2015-02-06 14:50:05+0900 [scrapy] DEBUG: Telnet console listening on 2015-02-06 14:50:05+0900 [scrapy] DEBUG: Web service listening on 2015-02-06 14:50:06+0900 [dmoz] DEBUG: Crawled (200) <GET> (referer: None) 2015-02-06 14:50:06+0900 [dmoz] DEBUG: Crawled (200) <GET> (referer: None) 2015-02-06 14:50:06+0900 [dmoz] INFO: Closing spider (finished) 2015-02-06 14:50:06+0900 [dmoz] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 516, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 16342, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 2, 6, 5, 50, 6, 441000), 'log_count/DEBUG': 4, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2015, 2, 6, 5, 50, 5, 490000)} 2015-02-06 14:50:06+0900 [dmoz] INFO: Spider closed (finished)
web browser 에서 request 하는 url 을 긁어올때는 firefox 의 httpfox 가 괜찮다고 한다.- Can I copy the list of HTTP requests made by a web page out of Firebug’s Net panel?
- httpFox :
댓글 없음:
댓글 쓰기