for loop - Crawl again from the exceptional page using Python

I use a for loop to crawl web pages. However, i encounter ip request limit error when i am crawling some pages. I have tried to make python sleep some seconds when i has crawled every 20 pages, however, the error holds. I can start to crawl again after python sleeps 60 secs.

The problem is each time, when there is an exception, i will lose a page of information. It seems that python jumps over the exceptional page using the try-except method.

I am wondering the best way is to restart to crawl again from the page which has encountered the exception.

My question is how to restart to crawl from the exceptional page.

pageNum = 0

for page in range(1, 200):
    pageNum += 1
    if(pageNum % 20 ==0):  # every 20 pages sleep 180 secs
        print 'sleeep 180 secs'
        time.sleep(180)  # to oppress the ip request limit
    try:
        for object in api.repost_timeline(id=id, count=200, page=page): 
            mid = object.__getattribute__("id")
            # my code here to store data
    except:
        print "Ip request limit", page
        sleep.time(60)

2 Answers

  1. Lawrence- Reply

    2019-11-15

    Use a stack of pages. pop a page, if it fails then append again.

    from collections import deque
    
    page_stack = deque()
    for page in range(199, 0, -1):
        page_stack.append(page)
    
    while len(page_stack):
        page = page_stack.pop()
    
        try:
            ## Do something
        except IPLimitException, e:
            page_stack.append(page)
    

    The code can run into infinite loop. Based on your need you can keep a threshold of trials that you can make. Keep a counter and do not append the page back to stack if that threshold is exhausted.

  2. Leander- Reply

    2019-11-15

    To keep the code as closest as possible to yours, you could just do something like:

    pageNum = 0
    
    for page in range(1, 200):
        pageNum += 1
        if(pageNum % 20 ==0):  # every 20 pages sleep 180 secs
            print 'sleeep 180 secs'
            time.sleep(180)  # to oppress the ip request limit
        succeeded = False
        while not succeeded:
            try:
                for object in api.repost_timeline(id=id, count=200, page=page): 
                    mid = object.__getattribute__("id")
                    # my code here to store data
                succeeded = True
            except:
                print "Ip request limit", page
                sleep.time(60)
    

    Of course you may want to include some sort of limit instead of risking to enter an endless loop. Btw, you can also get rid of pageNum (just use page).

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>