Many web server administrators are actively monitoring for incoming HTTP traffic patterns that indicate that an automated tool may be extracting data from the web server at a very high rate. If a site administrator identifies an IP address range from which they observe these traffic pattern occurring, they can block access to their site to any incoming requests originating from that IP address range.
- A robot can perform activities much faster than a person trying to mimic those same clicks/steps. There are some website administrators that look at their web server logs to measure the time between incoming requests from the same IP address. They are looking for patterns which appear to occur too fast to be a person; thus, the traffic was more likely some automated tool(s) sending the requests.
- If a robot is able to return data very quickly as it iterates over the data on a site, it is often a good idea to insert one or more WAIT commands into the "command stream" so that the rate of activity more closely resembles a pattern that could be generated by a person. Having said that, these remote sites don't publish what criteria they use to identify automated tools interacting with their site, so there is no one size fits all answer as to how to make a robot seem less like a robot.
- A common alert is raised for a site administrator when looking at the total number of incoming HTTP requests from a given IP address or IP network range and if the volume of data being sent/received is higher than "normal" usage pattern (defined strictly by the owner of the remote site), then that can be the trigger for an IP block.
In either case, the only real solution would involve some means of distributing the incoming HTTP requests to the remote server across multiple, unique IP addresses so that they are originating from different sources than those that have previously been marked for IP address blocking. This normally involves utilizing an HTTP Proxy Service of some kind. There are multiple vendor options available for such services which can be found by performing a web search.
RoboServer can be configured to route outbound robot traffic to one or more defined proxy servers that can be used to spread the outbound HTTP reqeusts across one or more different IP addresses. We have a video tutorial on how to configure RoboServer to communicate across multiple proxy servers available on the Technical Support Portal. It is possible within the flow of a robot to use a CHANGE PROXY command such that different iterations of a loop within a robot will send outbound traffic to a different proxy server.
- The other thing we have encountered in the field is that many sites are now employing a "robots.txt" file to define the specific resources on a site that the site administrators will allow automated tools to traverse and scrape.
- More to the point, the file defines those directories and resources which should NOT be processed by an automated tool.
- While Kapow Katalyst cannot automatically consume a robots.txt file, it is possible to review the contents of a site's robots.txt file and identify those URLs that should NOT be traversed then add those to the "File > Configure Robot > click BASIC tab > click [CONFIGURE] > URL Filter > Blocked URL Patterns" which prevents links that go to those URLs from being called by the robot.
Not every website actually uses this tool, but for those that do, it can helpful to identify resources that should be avoided by the bot and can keep the client from being blacklisted on the site. The site owners look at the IP addresses that are hitting those restricted URLs and if they see a pattern of the same IP address hitting those very fast, that can get the client (robot, brower, etc.) blocked.
More information about robots.txt can be found on-line:
URL to article: http://en.wikipedia.org/wiki/Robots_exclusion_standard
Example robots.txt file: http://en.wikipedia.org/robots.txt
This issue of having an IP address blocked by a remote site is not specific to Kapow Katalyst robots at all. Any tool that can be used to automate the interaction with a remote site can potentially trigger an IP address block by the remote site administrators. Remember, they aren't looking for a specific tool, but rather a specific pattern of usage behavior.
Kapow Software has had clients that have successfully scraped from a site for years, but if the site's administrators changes their monitoring policy and/or the type of activity they define as a violation of their usage policy, then all of the sudden a robot can be blocked despite the fact it had been working for a period of time with no problems.
Clients that are pulling very large data sets from a remote site (such as those performing competitive price analysis) are amongst the most likely to encounter having a remote site or source initiate an IP address block against a Kapow robot.
Keywords: Kapow Web Data Server version 7.x, Kapow 8.0, 8.1, 8.2, robot blocked, proxy