Daily Archive for August 11th, 2009

pyCurl获取网页问题

终于解决了这个问题,原来是我的代码中构造HTTP header的时候多了可以接受gzip压缩,支持gzip压缩的网页就下载了也不能用BeautifulSoup分析了,原来1ting.com现在支持gzip压缩了,还换了一个nProxy,多半是把ngnix的代码改了配置重新编译了~ 真是很~~

1
2
3
4
5
6
7
8
9
10
11
12
13
# Use Pycurl
def buildHeaders(browser, referer=""):
    """
    Build HTTP Headers, So we can download wma files.
    Arguments:
    - `browser`: Which browser will use
    - `referer`: Referer url
    """

    if referer != "":
        buildHeaders = ['User-Agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==', 'Referer:' + referer]
    else:
        buildHeaders = ['User-agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==']
    return buildHeaders
Share