{"id":156998,"date":"2019-10-07T10:01:18","date_gmt":"2019-10-07T14:01:18","guid":{"rendered":"https:\/\/www.countingpips.com\/?p=156998"},"modified":"2019-10-07T10:01:18","modified_gmt":"2019-10-07T14:01:18","slug":"what-is-web-scraping","status":"publish","type":"post","link":"https:\/\/www.investmacro.com\/forex\/2019\/10\/what-is-web-scraping\/","title":{"rendered":"What is Web Scraping?"},"content":{"rendered":"<div id=\"inves-581161121\" class=\"inves-below-title-posts inves-entity-placement\"><div id =\"posts_date_custom\"><div align=\"left\">October 7, 2019<\/div><hr style=\"border: none; border-bottom: 3px solid black;\">\r\n<\/div><\/div><p><strong>By Zac Clancy for <a href=\"https:\/\/kite.com\" target=\"_blank\" rel=\"noopener noreferrer\">Kite.com<\/a><\/strong><\/p>\n<div class=\"homepage__section\">\n<div class=\"homepage__section__content blog__content\">\n<div class=\"content-block\">\n<h3>Table of Contents<\/h3>\n<ul>\n<li>Introducing web scraping<\/li>\n<li>Some use cases of web scraping<\/li>\n<li>How does it work?<\/li>\n<li>Robots.txt<\/li>\n<li>A simple example<\/li>\n<li>Working with HTML<\/li>\n<li>Data processing<\/li>\n<li>Next steps<\/li>\n<\/ul>\n<h2>Introducing web scraping<\/h2>\n<p>Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet.<\/p>\n<p>Some websites and platforms offer application programming interfaces (APIs) which we can use to access information in a structured way, but others might not. While APIs are certainly becoming the standard way of interacting with today\u2019s popular platforms, we don\u2019t always have this luxury when interacting with most of the websites on the internet.<\/p>\n<p>Rather than reading data from standard API responses, we\u2019ll need to find the data ourselves by reading the website\u2019s pages and feeds.<\/p>\n<h2>Some use cases of web scraping<\/h2>\n<p>The World Wide Web was born in 1989 and\u00a0<i>web scraping<\/i>\u00a0and\u00a0<i>crawling <\/i>entered the conversation not long after in 1993.<\/p>\n<p>Before scraping, search engines were compiled lists of links collected by the website administrator, and arranged into a long list of links somewhere on their website. The first web scraper and crawler, the <i>World Wide Web Wanderer<\/i>, were created to follow all these indexes and links to try and determine how big the internet was.<\/p>\n<p>It wasn\u2019t long after this that developers started using crawlers and scrapers to create\u00a0<i>crawler-based search engines<\/i>\u00a0that didn\u2019t require human assistance. These crawlers would simply follow links that would come across each page and save information about the page. Since the web is a collaborative effort, the crawler could easily and infinitely follow embedded links on websites to other platforms, and the process would continue forever.<\/p>\n<p>Nowadays, web scraping has its place in nearly every industry. In newsrooms, web scrapers are used to pull in information and trends from thousands of different internet platforms in real time.<\/p>\n<p>Spending a little too much on Amazon this month? Websites exist that will let you know, and, in most cases, will do so by using web scraping to access that specific information on your behalf.<\/p>\n<p>Machine learning and artificial intelligence companies are scraping\u00a0<strong>billions<\/strong>\u00a0of social media posts to better learn how we communicate online.<\/p>\n<h2>So how does it work?<\/h2>\n<p>The process a developer builds for web scraping looks a lot like the process a user takes with a browser:<\/p>\n<ol>\n<li>A URL is given to the program.<\/li>\n<li>The program downloads the response from the URL.<\/li>\n<li>The program processes the downloaded file depending on data required.<\/li>\n<li>The program starts over at with a new URL<\/li>\n<\/ol>\n<p>The nitty gritty comes in steps 3 and, in which data is processed and the program determines how to continue (or if it should at all). For Google\u2019s crawlers, step 3 likely includes collecting all URL links on the page so that the web scraper has a list of places to begin checking next. This is\u00a0<i>recursive<\/i>by design and allows Google to efficiently follow paths and discover new content.<\/p>\n<p>There are many heavily used, well built libraries for reading and working with the downloaded HTML response. In the Ruby ecosystem\u00a0<a href=\"https:\/\/nokogiri.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Nokogiri<\/a>\u00a0is the standard for parsing HTML. For Python,\u00a0<a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\" target=\"_blank\" rel=\"noopener noreferrer\">BeautifulSoup<\/a>\u00a0has been the standard for 15 years. These libraries provide simple ways for us to interact with the HTML from our own programs.<\/p>\n<p>These code libraries will accept the page source as text, and a parser for handling the content of the text. They\u2019ll return helper functions and attributes which we can use to navigate through our HTML structure in predictable ways and find the values we\u2019re looking to extract.<\/p>\n<p>Scraping projects involve a good amount of time spent analyzing a web site\u2019s HTML for classes or identifiers, which we can use to find information on the page. Using the HTML below we can begin to imagine a strategy to extract product information from the table below using the HTML elements with the classes\u00a0<code>products<\/code>\u00a0and\u00a0<code>product<\/code>.<\/p>\n<\/div>\n<div class=\"code-block python html-blocks\">\n<pre><code class=\"html hljs xml\"><span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">table<\/span> <span class=\"hljs-attr\">class<\/span>=<span class=\"hljs-string\">\"products\"<\/span>&gt;<\/span>\r\n  <span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">tr<\/span> <span class=\"hljs-attr\">class<\/span>=<span class=\"hljs-string\">\"product\"<\/span>&gt;<\/span>...<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">tr<\/span>&gt;<\/span>\r\n  <span class=\"hljs-tag\">&lt;<span class=\"hljs-name\">tr<\/span> <span class=\"hljs-attr\">class<\/span>=<span class=\"hljs-string\">\"product\"<\/span>&gt;<\/span>...<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">tr<\/span>&gt;<\/span>\r\n<span class=\"hljs-tag\">&lt;\/<span class=\"hljs-name\">table<\/span>&gt;<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>In the wild, HTML isn\u2019t always as pretty and predictable. Part of the web scraping process is learning about your data and where it lives on the pages as you go along. Some websites go to great lengths to prevent web scraping, some aren\u2019t built with scraping in mind, and others just have complicated user interfaces which our crawlers will need to navigate through.<\/p>\n<h2>Robots.txt<\/h2>\n<p>While not an enforced standard, it\u2019s been common since the early days of web scraping to check for the existence and contents of a robots.txt file on each site before scraping its content. This file can be used to define inclusion and exclusion rules that web scrapers and crawlers should follow while crawling the site. You can check out\u00a0<a href=\"https:\/\/facebook.com\/robots.txt\" target=\"_blank\" rel=\"noopener noreferrer\">Facebook\u2019s\u00a0robots.txt<\/a>\u00a0file for a robust example: this file is always located at \/robots.txt so that scrapers and crawlers can always look for it in the same spot. Additionally,\u00a0<a href=\"https:\/\/github.com\/robots.txt\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub\u2019s robots.txt<\/a>, and\u00a0<a href=\"https:\/\/twitter.com\/robots.txt\" target=\"_blank\" rel=\"noopener noreferrer\">Twitter\u2019s<\/a>\u00a0are good examples.<\/p>\n<p>An example robots.txt file prohibits web scraping and crawling would look like the below:<br \/>\n<code>User-agent: *<\/code><br \/>\n<code>Disallow: \/<\/code><\/p>\n<p>The <code>User-agent: *<\/code> section is for all web scrapers and crawlers. In Facebook\u2019s, we see that they set <code>User-agent<\/code>\u00a0to be more explicit and have sections for\u00a0<i>Googlebot, Applebot,<\/i>\u00a0and others.<\/p>\n<p>The <code>Disallow: \/<\/code> line informs web scrapers and crawlers who observe the robots.txt file that they aren\u2019t permitted to visit any pages on this site. Conversely, if this line read <code>Allow: \/<\/code>, web scrapers and crawlers would be allowed to visit any page on the website.<\/p>\n<p>The robots.txt file can also be a good place to learn information about the website\u2019s architecture and structure. Reading where our scraping tools are allowed to go \u2013 and not allowed to go \u2013 can help inform us on sections of the website we perhaps didn\u2019t know existed, or may not have thought to look at.<\/p>\n<p>If you\u2019re running a website or platform it\u2019s important to know that this file isn\u2019t always respected by\u00a0<strong>every<\/strong>\u00a0web crawler and scraper. Larger properties like Google, Facebook, and Twitter respect these guidelines with their crawlers and information scrapers, but since\u00a0robots.txt\u00a0is considered a best practice rather than an enforceable standard, you may see different results from different parties. It\u2019s also important not to disclose private information which you wouldn\u2019t want to become public knowledge, like an admin panel on\u00a0<code>\/admin<\/code> or something like that.<\/p>\n<\/div>\n<div class=\"content-block\">\n<h2>A simple example<\/h2>\n<p>To illustrate this, we\u2019ll use Python plus the <code>BeautifulSoup<\/code>\u00a0and\u00a0<a href=\"http:\/\/docs.python-requests.org\/en\/master\/\" target=\"_blank\" rel=\"noopener noreferrer\">Requests<\/a> libraries.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-keyword\">import<\/span> requests\r\n<span class=\"hljs-keyword\">from<\/span> bs4 <span class=\"hljs-keyword\">import<\/span> BeautifulSoup\r\n\r\npage = requests.get(<span class=\"hljs-string\">'https:\/\/google.com'<\/span>)\r\nsoup = BeautifulSoup(page.text, <span class=\"hljs-string\">'html.parser'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>We\u2019ll go through this line-by-line:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">page = requests.get(<span class=\"hljs-string\">'https:\/\/google.com'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>This uses the\u00a0<code>requests<\/code>\u00a0library to make a request to\u00a0<code>https:\/\/google.com<\/code>\u00a0and return the response.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">soup = BeautifulSoup(page.text, <span class=\"hljs-string\">'html.parser'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>The <code>requests<\/code>\u00a0library assigns the text of our response to an attribute called\u00a0<code>text<\/code>\u00a0which we use to give\u00a0<code>BeautifulSoup<\/code> our HTML content. We also tell <code>BeautifulSoup<\/code> to use Python 3\u2019s built-in HTML parser\u00a0<code>html.parser<\/code>.<\/p>\n<p>Now that <code>BeautifulSoup<\/code> has parsed our HTML text into an object that we can interact with, we can begin to see how information may be extracted.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">paragraphs = soup.find_all(<span class=\"hljs-string\">'p'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Using\u00a0<code>find_all<\/code> we can tell <code>BeautifulSoup<\/code> to only return HTML paragraphs <code>&lt;p&gt;<\/code> from the document.<\/p>\n<p>If we were looking for a div with a specific ID (<code>#content<\/code>) in the HTML we could do that in a few different ways:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">element = soup.select(<span class=\"hljs-string\">'#content'<\/span>)\r\n<span class=\"hljs-comment\"># or<\/span>\r\nelement = soup.find_all(<span class=\"hljs-string\">'div'<\/span>, id=<span class=\"hljs-string\">'content'<\/span>)\r\n<span class=\"hljs-comment\"># or<\/span>\r\nelement = soup.find(id=<span class=\"hljs-string\">'content'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>In the Google scenario from above, we can imagine that they have a function that does something similar to grab all the links off of the page for further processing:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">links = soup.find_all(<span class=\"hljs-string\">'a'<\/span>, href=<span class=\"hljs-literal\">True<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>The above snippet will return all of the\u00a0<code>&lt;a&gt;<\/code>\u00a0elements from the HTML which are acting as links to other pages or websites. Most large-scale web scraping implementations will use a function like this to capture local links on the page, outbound links off the page, and then determine some priority for the links\u2019 further processing.<\/p>\n<h2>Working with HTML<\/h2>\n<p>The most difficult aspect of web scraping is analyzing and learning the underlying HTML of the sites you\u2019ll be scraping. If an HTML element\u00a0<i>has<\/i>\u00a0a consistent ID or set of classes, then we should be able to work with it fairly easily, we can just select it using our HTML parsing library (Nokogiri, <code>BeautifulSoup<\/code>, etc). If the element on the page\u00a0<i>doesn\u2019t have consistent classes or identifiers<\/i>, we\u2019ll need to access it using a different selector.<\/p>\n<p>Imagine our HTML page contains the following table which we\u2019d like to extract product information from:<\/p>\n<div class=\"text__table full full-width\">\n<table>\n<tbody>\n<tr>\n<td>\n<p class=\"text__centered\"><strong>NAME<\/strong><\/p>\n<\/td>\n<td>\n<p class=\"text__centered\"><strong>CATEGORY<\/strong><\/p>\n<\/td>\n<td>\n<p class=\"text__centered\"><strong>PRICE<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>Shirt<\/td>\n<td>Athletic<\/td>\n<td>$19.99<\/td>\n<\/tr>\n<tr>\n<td>Jacket<\/td>\n<td>Outdoor<\/td>\n<td>$124.99<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p><code>BeautifulSoup<\/code> allows us to parse tables and other complex elements fairly simply. Let\u2019s look at how we\u2019d read the table\u2019s rows in Python:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-comment\"># Find all the HTML tables on the page<\/span>\r\ntables = soup.find_all(<span class=\"hljs-string\">'table'<\/span>)\r\n\r\n<span class=\"hljs-comment\"># Loop through all of the tables<\/span>\r\n<span class=\"hljs-keyword\">for<\/span> table <span class=\"hljs-keyword\">in<\/span> tables:\r\n\t<span class=\"hljs-comment\"># Access the table's body<\/span>\r\n\ttable_body = table.find(<span class=\"hljs-string\">'tbody'<\/span>)\r\n\t<span class=\"hljs-comment\"># Grab the rows from the table body<\/span>\r\n\trows = table_body.find_all(<span class=\"hljs-string\">'tr'<\/span>)\r\n\r\n\t<span class=\"hljs-comment\"># Loop through the rows<\/span>\r\n\t<span class=\"hljs-keyword\">for<\/span> row <span class=\"hljs-keyword\">in<\/span> rows:\r\n    \t    <span class=\"hljs-comment\"># Extract each HTML column from the row<\/span>\r\n    \t    columns = row.find_all(<span class=\"hljs-string\">'td'<\/span>)\r\n\r\n    \t    <span class=\"hljs-comment\"># Loop through the columns<\/span>\r\n    \t    <span class=\"hljs-keyword\">for<\/span> column <span class=\"hljs-keyword\">in<\/span> columns:\r\n        \t  <span class=\"hljs-comment\"># Print the column value<\/span>\r\n        \t  print(column.text)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>The above code snippet would print\u00a0<code>Shirt<\/code>, followed by\u00a0<code>Athletic<\/code>, and then\u00a0<code>$19.99<\/code>\u00a0before continuing on to the next table row. While simple, this example illustrates one of the many strategies a developer might take for retrieving data from different HTML elements on a page.<\/p>\n<h2>Data processing<\/h2>\n<p>Researching and inspecting the websites you\u2019ll be scraping for data is a crucial component to each project. We\u2019ll generally have a model that we\u2019re trying to fill with data for each page. If we were scraping restaurant websites we\u2019d probably want to make sure we\u2019re collecting the name, address, and the hours of operation at least, with other fields being added as we\u2019re able to find the information. You\u2019ll begin to notice that some websites are much easier to scrape for data than others \u2013 some are even defensive against it!<\/p>\n<p>Once you\u2019ve got your data in hand there are a number of different options for handling, presenting, and accessing that data. In many cases you\u2019ll probably want to handle the data yourself, but there\u2019s a slew of services offered for many use cases by various platforms and companies.<\/p>\n<ul>\n<li><strong>Search indexing:<\/strong>\u00a0Looking to store the text contents of websites and easily search?\u00a0<a href=\"https:\/\/www.algolia.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Algolia<\/a>\u00a0and\u00a0<a href=\"https:\/\/github.com\/elastic\/elasticsearch\" target=\"_blank\" rel=\"noopener noreferrer\">Elasticsearch<\/a>\u00a0are good for that.<\/li>\n<li><strong>Text analysis:<\/strong>\u00a0Want to extract people, places, money and other entities from the text? Maybe\u00a0<a href=\"https:\/\/spacy.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">spaCy<\/a>\u00a0or Google\u2019s\u00a0<a href=\"https:\/\/cloud.google.com\/natural-language\/\" target=\"_blank\" rel=\"noopener noreferrer\">Natural Language API<\/a>\u00a0are for you.<\/li>\n<li><strong>Maps and location data:<\/strong>\u00a0If you\u2019ve collected some addresses or landmarks, you can use\u00a0<a href=\"https:\/\/www.openstreetmap.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">OpenStreetMap<\/a>\u00a0or\u00a0<a href=\"https:\/\/www.mapbox.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">MapBox<\/a>\u00a0to bring that location data to life.<\/li>\n<li><strong>Push notifications:<\/strong>\u00a0If you want to get a text message when your web crawler finds a specific result, check out\u00a0<a href=\"https:\/\/www.twilio.com\/sms\" target=\"_blank\" rel=\"noopener noreferrer\">Twilio<\/a>\u00a0or\u00a0<a href=\"https:\/\/pusher.com\/beams\" target=\"_blank\" rel=\"noopener noreferrer\">Pusher<\/a>.<\/li>\n<\/ul>\n<h2>Next steps<\/h2>\n<p>In this post, we learned about the basics of web scraping and looked at some simplistic crawling examples which helped demonstrate how we can interact with HTML pages from our own code. Ruby\u2019s\u00a0Nokogiri, Python\u2019s\u00a0<code>BeautifulSoup<\/code>, and JavaScript\u2019s\u00a0<a href=\"http:\/\/www.nightmarejs.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Nightmare<\/a>\u00a0are powerful tools to begin learning web scraping with. These libraries are relatively simple to start with, but offer powerful interfaces to begin to extend in more advanced use cases.<\/p>\n<p>Moving forward from this post, try to create a simple web scraper of your own! You could potentially write a simple script that reads a tweet from a URL and prints the tweet text into your terminal. With some practice, you\u2019ll be analyzing HTML on all the websites you visit, learning its structure, and understanding how you\u2019d navigate its elements with a web scraper.<\/p>\n<p class=\"blog__content--footer\">This post is a part of Kite\u2019s new series on Python. You can check out the code from this and other posts on our\u00a0<a href=\"https:\/\/github.com\/kiteco\/kite-python-blog-post-code\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a>.<\/p>\n<\/div>\n<\/div>\n<p><a href=\"https:\/\/kite.com\/blog\/python\/what-is-web-scraping\/\" target=\"_blank\" rel=\"noopener noreferrer\">This article<\/a> originally appeared on <a href=\"https:\/\/kite.com\" target=\"_blank\" rel=\"noopener noreferrer\">Kite.com<\/a> (Reprinted with permission)<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Zac Clancy for Kite.com Table of Contents Introducing web scraping Some use cases of web scraping How does it work? Robots.txt A simple example Working with HTML Data processing Next steps Introducing web scraping Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet. Some [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-156998","post","type-post","status-publish","format-standard","hentry","no-post-thumbnail"],"_links":{"self":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/156998","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/comments?post=156998"}],"version-history":[{"count":1,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/156998\/revisions"}],"predecessor-version":[{"id":156999,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/156998\/revisions\/156999"}],"wp:attachment":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/media?parent=156998"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/categories?post=156998"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/tags?post=156998"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}