Webscraping Using Selenium



Dec 08, 2020 Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same. Selenium Python bindings provide a simple API to write functional/acceptance tests using Selenium WebDriver. You can extract specific information from a website and show it in your Google Sheet using some of Sheets’ special formulas. For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. Oct 03, 2018 Summary: We learnt how to scrape a website using Selenium in Python and get large amounts of data. You can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. Using this data. If anyone is interested in looking at the complete code, here is the link to my Github. Let me know if this was helpful. Aug 06, 2020 All this is because Scrapy cannot handle webpages that render its content using JS. Selenium is an automation tool for testing web applications. It uses webdriver as an interface to control.

In this post I’ll talk about the RSelenium package as a tool to navigate websites and how it can be combined with the rvest package to scrape dynamic web pages. To understand this post, you’ll need basic knowledge of rvest, HTML and CSS. You can download the full R script HERE!

Observation: Even if you are not familiar with them, I explained as much as possible everything I did. For that reason, those who know about this stuff might find some parts of the post redundant. Feel free to read what you need and skip what you aldeady know!

Let’s compare the following websites:

On IMDb, if you search for a particular movie (for example, this one), you can see that the URL changes, and that URL is different from any other movie (for example, this one). The same behavior is shown if you search for different actors.

On the other hand, if you go to Premier League Player Stats, you will notice that modifying the filters or clicking the pagination button to access more data doesn’t produce changes on the URL.

As I understand it, the first website is an example of a static web page, while the second one is an example of a dynamic webpage.

The following definitions where taken from https://www.pcmag.com/.

  • Static Web Page: A Web page (HTML page) that contains the same information for all users. Although it may be periodically updated from time to time, it does not change with each user retrieval.

  • Dynamic Web Page: A Web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.

rvest is a great tool to scrape data from static web pages (check out Creating a Movies Dataset to see an example!).

But when it comes to dynamic web pages, rvest alone can’t get the job done. This is when RSelenium joins the party…

Java

You need to have Java installed. You can use Windows’ Command Prompt to check this. Just type java -version and press Enter. You should see something that looks like this:

If it throws an error, it might mean that you don’t have Java installed. You can download it from HERE.

R Packages

The following packages need to be installed and loaded in order to run the code written in this post.

Starting a Selenium server and browser is pretty straightforward using rsDriver().

However, when you run the code above it may produce the following error:

This error is addressed in this StackOverflow post. Basically, it means that there is a mismatch between the ChromeDriver and the Chrome Browser versions. As mentioned in the post, each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683.

The parameter chromever defined this way always uses the latest compatible ChromeDriver version (the code was edited from this StackOverflow post).

After you run rD <- RSelenium::rsDriver(..), if everything worked correctly, a new chrome window will open. This window should look like this:

You can find more information about rsDriver() in the Basics Vignette.

In this section I’ll apply different methods to the remDr object created above. I’m only going to describe the methods that I think will be used most frequently. For a complete reference, check the package documentation.

  • navigate(url): Navigate to a given url.
  • goBack(): Equivalent to hitting the back button on the browser.
  • goForward(): Equivalent to hitting the forward button on the browser.
Using
  • refresh(): Reload the current page.
  • getCurrentUrl(): Retrieve the url of the current page.
  • maxWindowSize(): Set the size of the browser window to maximum. By default, the browser window size is small, and some elements of the website you navigate to might not be available right away (I’ll talk more about this in the next section).
  • getPageSource()[[1]] Get the current page source. This method combined with rvest is what makes possible to scrape dynamic web pages. The xml document returned by the method can then be read using rvest::read_html(). This method returns a list object, that’s the reason behind [[1]].
  • open(silent = FALSE): Send a request to the remote server to instantiate the browser. I use this method when the browser closes for some reason (for example, inactivity). If you have already started the Selenium server, you should run this instead of rD <- RSelenium::rsDriver(..) to re-open the browser.
  • close(): Close the current session.

Working with Elements

  • findElement(using, value). Search for an element on the page, starting from the document root. The located element will be returned as an object of webElement class. To use this function you need some basic knowledge of HTML and CSS (or xpath, etc). This chrome extension, called SelectorGadget, might help.

  • highlightElement(): Utility function to highlight current Element. This helps to check that you selected the wanted element.

  • sendKeysToElement(): Send a sequence of key strokes to an element. The key strokes are sent as a list. Plain text is enter as an unnamed element of the list. Keyboard entries are defined in ‘selKeys‘ and should be listed with name ‘key‘.

  • clearElement(): Clear a TEXTAREA or text INPUT element’s value.

  • clickElement(): Click the element. You can click links, check boxes, dropdown lists, etc.

Other Methods

Web Scraping Using Selenium Using

Even though I have never used them, I believe this methods are worth mentioning. For more information, check the package documentation.

In this example, I’ll scrape data from Premier League Player Stats. This is what the website looks like:

You will notice that when you modify the Filters, the URL does not change. So you can’t use rvest alone to dynamically scrape this website. Also, if you scroll down to the end of the table you’ll see that there are pagination buttons. If you click them, you get more data, but again, the URL does not change. Here you can see how those pagination buttons look like:

Observation: Even though choosing a different stat does change the URL, I’ll work as if it didn’t.

Target Dataset

The dataset I want will have the following variables:

  • Player: Indicates the player name.
  • Nationality: Indicates the nationality of the player.
  • Season: Indicates the season the stats corresponds to.
  • Club: Indicates the club the player belonged to in the season.
  • Position: Indicates the player position in the season.
  • Stats: One column for each Stat.

For simplicity, I’ll scrape data from seasons 2017/18 and 2018/19, and only from the Goals, Assists, Minutes Played, Passes, Shots and Fouls stats. This means that our dataset will have a total of 11 columns.

Before we start…

In order to run the code below, you have to start a Selenium server and browser, and create the remDr object. This procedure was described in the Start Selenium section.

First Steps

The code chunk below navigates to the website, increases the windows size to find elements that might be hidden (for example, when the window is small I can’t see the Filters) and then clicks the “Accept Cookies” button.

You might notice two things:

  • The use of the Sys.sleep() function. Here, this function is used to give the website enough time to load. Sometimes, if the element you want to find isn’t loaded when you search for it, it will produce an error.

  • The use of CSS selectors. To select an element using CSS you can press F12 an inspect the page source (right clicking the element and selecting Inspect will show you which part of that code refers to the element) and/or use this chrome extension, called SelectorGadget. I recommend learning a little about HTML and CSS and use this two approaches simultaneosly. SelectorGadget helps, but sometimes you will need to inspect the source to get exactly what you want. In the next subsection I’ll show how I selected certain elements by inspecting the page source.

Getting Values to Iterate Over

I know that in order to get the data, I’ll have to iterate over different lists of values. In particular, I need a list of stats, seasons, and player positions.

We can use rvest to scrape the website and get these lists. To do so, we need to find the corresponding nodes. As an example, after the code I’ll show where I searched for the required information in the page source for the stats and seasons lists.

The code below uses rvest to create the lists we’ll use in the loops.

Observation: Even though in the source we don’t see that each word has its first letteruppercased, when we check the dropdown list we see exactly that (for example, we have “Clean Sheets” instead of “Clean sheets”). I was getting an error when trying to scrape these type of stats, and making them look like the dropdown list solved the issue. That’s the reason behind str_to_title().

Stats

This is my view when I open the stats dropdown list and right click and inspect the Clean Sheets stat.

Taking a closer look to the source where that element is present we get:

Seasons

This is my view when I open the seasons dropdown list and right click and inspect the 2016/17 season.

Taking a closer look to the source where that element is present we get:

As you can see, we have an attribute named data-dropdown-list whose value is FOOTBALL_COMPSEASON and inside we have li tags where the attribute data-option-name changes for each season. This will be useful when defining how to iterate using RSelenium.

Positions

The logic behind getting the CSS for the positions is similar to the one described above, so I won’t be showing it.

Webscraping Loop

The code has comments on each step, so you can check it out! But before that, I’ll give an overview of the loop.

  1. Preallocate stats vector. This list will have a length equal to the number of stats to be scraped.

  2. For each stat:

    1. Click the stat dropdown list
    2. Click the corresponding stat
    3. Preallocate seasons vector. This list will have a length equal to the number of seasons to be scraped.
    4. For each season inside stat:
      1. Click the seasons dropdown list
      2. Click the corresponding season
      3. Preallocate positions vector. This list will have length = 4 (positions are fixed: GOALKEEPER, DEFENDER, MIDFIELDER and FORWARD).
      4. For each position inside season inside stat
        1. Click the position dropdown list
        2. Click the corresponding position
        3. Check that there is a table with data (if not, go to next position)
        4. Scrape the first table
        5. While “Next Page” button exists
          1. Click “Next Page” button
          2. Scrape new table
          3. Append new table to table
        6. Change stat colname and add position data
        7. Go to the top of the website
      5. Rowbind each position table
      6. Add season data
    5. Rowbind each season table
    6. Assign the table to the corresponding stat element.

The result of this loop is a populated list with a number of elements equal to the number of stats scraped. Each of this elements is a tibble.

This may take some time to run, so you can choose less stats to try it out.

As I mentioned, you can check the code!

Observation: Be careful when you add more stats to the loop. For example, Clean Sheets has the Position filter hidden, so the code should be modified (for example, by adding some “if” statement).

Data Wrangling

Finally, some data wrangling is needed to create our dataset. data_topStats is a list with 6 elements, each one of those elements is a tibble. The next code chunk removes the Rank column from each tibble, reorders the columns and then makes a full join by all the non-stat variables using reduce() (the reason behind this full join is that not all players have all stats). In the last line of code I replace NA values with zero in the stats variables.

This is how the data looks like.

SeasonPositionClubPlayerNationalityGoalsAssistsMinutes PlayedPassesShotsFouls
2018/19DEFENDERBrighton and Hove AlbionShane DuffyIreland51308813053722
2018/19DEFENDERAFC BournemouthNathan AkéNetherlands40341216962528
2018/19DEFENDERCardiff CitySol BambaCote D’Ivoire4124755502235
2018/19DEFENDERWolverhampton WanderersWilly BolyFrance40316817152429
2018/19DEFENDEREvertonLucas DigneFrance44296614573439
2018/19DEFENDERWolverhampton WanderersMatt DohertyIreland45314713994630

The framework described here is an approach to working in parallel with RSelenium.

First, we load the libraries we need.

Using

The function defined below stops Selenium on each core.

We determine the number of cores we’ll use. In this example, I use four cores.

We have to list the ports that are going to be used to start Selenium.

We use clusterApply() to start Selenium on each core. Pay attention to the use of the Superassignment operator. When you run this function, you will see that four chrome windows are opened.

This is an example of pages that we will open in parallel. This list will change depending on the particular scenario.

Use parLapply() to work in parallel. When you run this, you will see that each browser opens one website, and one is still blank. This is a simple example, I haven’t defined any scraping, but of course you can!

when you are done, stop Selenium on each core and stop the cluster.

Observation: Sometimes, when working in parallel some of the browsers close for no apparent reason (or at least a reason that I don’t understand).

Workaround browser closing for no reason

Consider the following scenario: your loop navigates to a certain website, clicks some elements and then gets the page source to scrape using rvest. If in the middle of that loop the browser closes, you will get an error (for example, it won’t navigate to the website, or the element won’t be found). You can work around these errors using tryCatch(), but when you skip the iteration where the error occurred, when you try to navigate to the website in the following iteration, an error would occur again (because there is no browser open!).

You could, for example, use remDr$open() in the beggining of the loop, and remDr$close() in the end, but I think that will open and close many browsers and make the process slower.

So I created this function that handles part of the problem (even though the iteration where the browser closed will not finish, the next one will and the process won’t stop).

It basically tries to get the current URL using remDr$getCurrentUrl(). If no browser is open, this will throw an error, and if we get an error, it will open a browser.

Closing Selenium

Sometimes, even if the browser window is closed, when you re-run rD <- RSelenium::rsDriver(..) you might encounter an error like:

This means that the connection was not completely closed. You can execute the lines of code below to stop Selenium.

You can check this. StackOverflow post for more information.

Wrapper Functions

You can create functions in order to type less. Suppose that you navigate to a certain website where you have to click one link that sends you to a site with different tabs. You can use something like this:

Observation Norton 2015 trial. : this function is theoretical, it won’t work if you run it.

I won’t show it here, but you can create functions to find elements, check if an element exists on the DOM (Document Object Model), try to click an element if it exists, parse the data table you are interested in, etc. You can check this StackOverflow for examples.

The following list contains different videos, posts and StackOverflow posts that I found useful when learning and working with RSelenium.

  • The ultimate online collection toolbox: Combining RSelenium and Rvest ( Part I and Part II ). If you know about rvest and just want to learn about RSelenium, I’d recommend watching Part II. It gives an overview of what you can do when combining RSelenium and rvest. It has nice an practical examples. As a final comment regarding these videos, I wouldn’t pay too much attention to setting up Docker because at least I didn’t need to work that way in order to get RSelenium going. In fact, at least now, getting it going is pretty straightforward.

  • RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium. I found this post really useful when trying to set up RSelenium. The solution given in this StackOverflow post, which is mentioned in the article, seems to be enough.

  • Dungeons and Dragons Web Scraping with rvest and RSelenium. This is a great post! It starts with a general tutorial for scraping with rvest and then dives into RSelenium. If you are not familiar with rvest, you can start here.

  • RSelenium Tutorial. This post might be helpful too.

  • RSelenium Package Website. It has more advanced and detailed content. I just took a look to the Basics Vignette.

  • These StackOverflow posts helped me when working with dropdown lists:

  • RSelenium: server signals port is already in use. This post gives a solution to the “port already in use” problem. Even though is not marked as best, the last line of code of the second answer is useful.

  • Data Scraping in R. Thanks to this post I found the Premier League Stats website, which was exactly what I was looking for to write a post about RSelenium. Also, I took some hints from the answer marked as best.

  • CSS Tutorials:

In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

  • selenium package

Web Scraping Using Selenium Java

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

  • driver.title gets the page's title
  • driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name
  • Class name
  • IDs
  • XPath
  • CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:


find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Webscraping

Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

  • Accessing the text of the element with the property element.text
  • Clicking on the element with element.click()
  • Accessing an attribute with element.get_attribute('class')
  • Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:


In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

  • Go to the login page using driver.get()
  • Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
  • Follow the same process with the password input
  • Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

  • Check for an error message (like “Wrong password”)
  • Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Web Scraping Using Selenium Tutorial

Taking a screenshot

We could easily take a screenshot using:


Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

  • Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
  • Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

  • element_to_be_clickable
  • text_to_be_present_in_element
  • element_to_be_clickable

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd: