Tag Archive for 'pycurl'

pyCurl获取网页问题

终于解决了这个问题,原来是我的代码中构造HTTP header的时候多了可以接受gzip压缩,支持gzip压缩的网页就下载了也不能用BeautifulSoup分析了,原来1ting.com现在支持gzip压缩了,还换了一个nProxy,多半是把ngnix的代码改了配置重新编译了~ 真是很~~

1
2
3
4
5
6
7
8
9
10
11
12
13
# Use Pycurl
def buildHeaders(browser, referer=""):
    """
    Build HTTP Headers, So we can download wma files.
    Arguments:
    - `browser`: Which browser will use
    - `referer`: Referer url
    """

    if referer != "":
        buildHeaders = ['User-Agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==', 'Referer:' + referer]
    else:
        buildHeaders = ['User-agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==']
    return buildHeaders

PycURL example

Here’s a little sample of Python code demonstrating the use of PycURL, the Python interface to libcURL. It does the same thing as my cURL example. Refer to this page for a detailed list of libcurl options.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pycurl, StringIO
# Constants
DOWNLOADED_FILE = r'C:\temp\downloaded_file.txt'
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0)'
LOGIN_URL = 'http://interesting.website.com/LogIn.asp'
LOGIN_POST_DATA = 'FormField=URL%20Encoded%20Value'
DOWNLOAD_URL = 'http://interesting.website.com/do_it.asp?do=0&something=0&interesting=0'
DOWNLOAD_REFERER = 'http://interesting.website.com/referer.asp'
FILE_MODE = 'wb'

# Set up objects
dev_null = StringIO.StringIO()
slurpp = pycurl.Curl()

# Request login page
slurpp.setopt(pycurl.USERAGENT, USER_AGENT)
slurpp.setopt(pycurl.FOLLOWLOCATION, 1)
#slurpp.setopt(pycurl.AUTOREFERER, 1) # not yet implemented in pycURL
slurpp.setopt(pycurl.WRITEFUNCTION, dev_null.write)
slurpp.setopt(pycurl.COOKIEFILE, '')
slurpp.setopt(pycurl.URL, LOGIN_URL)
slurpp.perform()

# Log in to site
slurpp.setopt(pycurl.POSTFIELDS, LOGIN_POST_DATA)
slurpp.setopt(pycurl.POST, 1)
slurpp.perform()

# Download relevant data
slurpp.setopt(pycurl.HTTPGET, 1)
slurpp.setopt(pycurl.URL, DOWNLOAD_URL)
slurpp.setopt(pycurl.REFERER, DOWNLOAD_REFERER)
outfile = file(DOWNLOADED_FILE, FILE_MODE)
slurpp.setopt(pycurl.WRITEFUNCTION, outfile.write)
slurpp.perform()

# Clean up and close out
outfile.close()
dev_null.close()
slurpp.close()