Web scraping tips and tricks
I've listed various webscraping-related tips and tricks below. I have collected these throughout the years, hopefully you will find them useful.
Tip #1 : don't do it
Scraping should only be considered as a last resort. If the website from which you intend to extract information offers an API, use the API instead. It'll be easier to parse a nicely formatted JSON response then it would be to download an entire web page and go through verbose and sometimes malformed HTML just to extract a small piece of information.
Tip #2 : check for mobile versions
Tip #3 : avoid regular expressions if possible
While regular expressions are capable of getting the job done, it is recommended that you use more suitable tools like XPath queries or CSS selectors. These will prove to be easier to write and to maintain on the long run. Besides, you wouldn't want to [accidentally summon a malevolent entity that brings about the end of the world], would you ? Most languages offer some form of HTTP client and, if you're lucky, will also include an XML parser out of the box. PHP natively offers DOMDocument and DOMXPath. While they do not support CSS selectors, it's worth noting that they are resistant in the face of malformed HTML. For Java, you can use [JSoup]. Despite being a Java library, it's very pleasant to use as it doesn't adhere to the language's verbose philosophy. In fact, I'd go even so far as to say that it's comparable to [BeautifulSoup], the popular Python library.
Tip #4 : be polite
This means that you shouldn't be aggressively sending one HTTP request after another. During development, work on a local version until you get the parsing right. Avoid multithreading if possible. Ignoring these steps might trigger the website's immune system, thus forcing you to solve captchas and whatnot.
Tip #5 : do not ignore robots.txt
This goes hand in hand with the previous tip. Be sure to inspect the robots.txt file and exclude the paths it tells you to skip. This can be done either manually, or by periodically parsing the file. Failure to do so might cause your bot to be stuck in a [spider trap]
Tip #6 : hide behind seven proxies
This is somewhat unethical, but as a precaution, you might want to make your robot go through a proxy instead of accessing the website directly. On the off chance that your robot gets banned, you'll need to do after identifying the cause and fixing it is to use another proxy.
Tip #7 : pretend to be human
Some websites discriminate against robots because of bad experiences they might have had with them in the past. But just because a few bots are terrorists doesn't mean all robots are. In fact, the majority of robots are peaceful. Besides, robots are not inherently bad, they are only as bad as their programmer.
But until robots have rights, it is easier for everyone if they swallowed their pride and pretended to be human. This can be achieved by mimicking human behavior to a certain degree. Fire up your browser's development tools and study the HTTP requests your browser sends. Take note of the user agent, the referrers, the cookies, the hidden form fields, and the HTTP headers. Make your robot send similar information.
Tip #8 : the browser's development tools are your best friends
Development tools are a great ally when it comes to web scraping. I'll be referring to Firefox as it is my browser of choice, but most browsers offer some form or another of a development suite.
One of my favorite features is that they allow you to inspect the source and automatically extract CSS selectors. Although they might need some tweaking before becoming generalized, it's already a step in the right direction. The console will let you test them in an interactive manner. For instance, you can run a selector with document.querySelectorAll and inspect the nodes it matches or reason about them programatically.
Development tools also let you study the HTTP requests that your browser sends in great detail, a feature that comes in handy when dealing with AJAX-heavy websites. Certain websites already retrieve JSON-formatted responses from the backend before displaying them, a behavior that is difficult to detect from just reading the source code of a given page. It, however, quickly becomes apparent after studying the HTTP requests that go back and forth between the browser and the website in question, an operation that development tools allow you to witness in real time. In such a scenario, you can make your robot interact with the backend directly and skip the middleman altogether.