I use a for loop to crawl web pages. However, i encounter ip request limit error when i am crawling some pages. I have tried to make python sleep some seconds when i has crawled every 20 pages, however, the error holds. I can start to crawl again after python sleeps 60 secs.
The problem is each time, when there is an exception, i will lose a page of information. It seems that python jumps over the exceptional page using the try-except method.
I am wondering the best way is to restart to crawl again from the page which has encountered the exception.
My question is how to restart to crawl from the exceptional page.
pageNum = 0 for page in range(1, 200): pageNum += 1 if(pageNum % 20 ==0): # every 20 pages sleep 180 secs print 'sleeep 180 secs' time.sleep(180) # to oppress the ip request limit try: for object in api.repost_timeline(id=id, count=200, page=page): mid = object.__getattribute__("id") # my code here to store data except: print "Ip request limit", page sleep.time(60)