- Web Scraping Using Selenium Using
- Web Scraping Using Selenium Java
- Web Scraping Using Selenium Tutorial
Dec 08, 2020 Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same. Selenium Python bindings provide a simple API to write functional/acceptance tests using Selenium WebDriver. You can extract specific information from a website and show it in your Google Sheet using some of Sheets’ special formulas. For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. Oct 03, 2018 Summary: We learnt how to scrape a website using Selenium in Python and get large amounts of data. You can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. Using this data. If anyone is interested in looking at the complete code, here is the link to my Github. Let me know if this was helpful. Aug 06, 2020 All this is because Scrapy cannot handle webpages that render its content using JS. Selenium is an automation tool for testing web applications. It uses webdriver as an interface to control.
In this post I’ll talk about the RSelenium
package as a tool to navigate websites and how it can be combined with the rvest
package to scrape dynamic web pages. To understand this post, you’ll need basic knowledge of rvest
, HTML and CSS. You can download the full R script HERE!
Observation: Even if you are not familiar with them, I explained as much as possible everything I did. For that reason, those who know about this stuff might find some parts of the post redundant. Feel free to read what you need and skip what you aldeady know!
Let’s compare the following websites:
On IMDb, if you search for a particular movie (for example, this one), you can see that the URL changes, and that URL is different from any other movie (for example, this one). The same behavior is shown if you search for different actors.
On the other hand, if you go to Premier League Player Stats, you will notice that modifying the filters or clicking the pagination button to access more data doesn’t produce changes on the URL.
As I understand it, the first website is an example of a static web page, while the second one is an example of a dynamic webpage.
The following definitions where taken from https://www.pcmag.com/.
Static Web Page: A Web page (HTML page) that contains the same information for all users. Although it may be periodically updated from time to time, it does not change with each user retrieval.
Dynamic Web Page: A Web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.
rvest
is a great tool to scrape data from static web pages (check out Creating a Movies Dataset to see an example!).
But when it comes to dynamic web pages, rvest
alone can’t get the job done. This is when RSelenium
joins the party…
Java
You need to have Java installed. You can use Windows’ Command Prompt to check this. Just type java -version and press Enter. You should see something that looks like this:
If it throws an error, it might mean that you don’t have Java installed. You can download it from HERE.
R Packages
The following packages need to be installed and loaded in order to run the code written in this post.
Starting a Selenium server and browser is pretty straightforward using rsDriver()
.
However, when you run the code above it may produce the following error:
This error is addressed in this StackOverflow post. Basically, it means that there is a mismatch between the ChromeDriver and the Chrome Browser versions. As mentioned in the post, each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683.
The parameter chromever
defined this way always uses the latest compatible ChromeDriver version (the code was edited from this StackOverflow post).
After you run rD <- RSelenium::rsDriver(..)
, if everything worked correctly, a new chrome window will open. This window should look like this:
You can find more information about rsDriver()
in the Basics Vignette.
In this section I’ll apply different methods to the remDr
object created above. I’m only going to describe the methods that I think will be used most frequently. For a complete reference, check the package documentation.
navigate(url)
: Navigate to a given url.
goBack()
: Equivalent to hitting the back button on the browser.goForward()
: Equivalent to hitting the forward button on the browser.
refresh()
: Reload the current page.
getCurrentUrl()
: Retrieve the url of the current page.
maxWindowSize()
: Set the size of the browser window to maximum. By default, the browser window size is small, and some elements of the website you navigate to might not be available right away (I’ll talk more about this in the next section).
getPageSource()[[1]]
Get the current page source. This method combined withrvest
is what makes possible to scrape dynamic web pages. The xml document returned by the method can then be read usingrvest::read_html()
. This method returns alist
object, that’s the reason behind[[1]]
.
open(silent = FALSE)
: Send a request to the remote server to instantiate the browser. I use this method when the browser closes for some reason (for example, inactivity). If you have already started the Selenium server, you should run this instead ofrD <- RSelenium::rsDriver(..)
to re-open the browser.
close()
: Close the current session.
Working with Elements
findElement(using, value)
. Search for an element on the page, starting from the document root. The located element will be returned as an object of webElement class. To use this function you need some basic knowledge of HTML and CSS (or xpath, etc). This chrome extension, called SelectorGadget, might help.highlightElement()
: Utility function to highlight current Element. This helps to check that you selected the wanted element.sendKeysToElement()
: Send a sequence of key strokes to an element. The key strokes are sent as a list. Plain text is enter as an unnamed element of the list. Keyboard entries are defined in ‘selKeys‘ and should be listed with name ‘key‘.clearElement()
: Clear a TEXTAREA or text INPUT element’s value.clickElement()
: Click the element. You can click links, check boxes, dropdown lists, etc.
Other Methods
Web Scraping Using Selenium Using
Even though I have never used them, I believe this methods are worth mentioning. For more information, check the package documentation.
In this example, I’ll scrape data from Premier League Player Stats. This is what the website looks like:
You will notice that when you modify the Filters, the URL does not change. So you can’t use rvest
alone to dynamically scrape this website. Also, if you scroll down to the end of the table you’ll see that there are pagination buttons. If you click them, you get more data, but again, the URL does not change. Here you can see how those pagination buttons look like:
Observation: Even though choosing a different stat does change the URL, I’ll work as if it didn’t.
Target Dataset
The dataset I want will have the following variables:
- Player: Indicates the player name.
- Nationality: Indicates the nationality of the player.
- Season: Indicates the season the stats corresponds to.
- Club: Indicates the club the player belonged to in the season.
- Position: Indicates the player position in the season.
- Stats: One column for each Stat.
For simplicity, I’ll scrape data from seasons 2017/18 and 2018/19, and only from the Goals, Assists, Minutes Played, Passes, Shots and Fouls stats. This means that our dataset will have a total of 11 columns.
Before we start…
In order to run the code below, you have to start a Selenium server and browser, and create the remDr
object. This procedure was described in the Start Selenium section.
First Steps
The code chunk below navigates to the website, increases the windows size to find elements that might be hidden (for example, when the window is small I can’t see the Filters) and then clicks the “Accept Cookies” button.
You might notice two things:
The use of the
Sys.sleep()
function. Here, this function is used to give the website enough time to load. Sometimes, if the element you want to find isn’t loaded when you search for it, it will produce an error.The use of CSS selectors. To select an element using CSS you can press F12 an inspect the page source (right clicking the element and selecting Inspect will show you which part of that code refers to the element) and/or use this chrome extension, called SelectorGadget. I recommend learning a little about HTML and CSS and use this two approaches simultaneosly. SelectorGadget helps, but sometimes you will need to inspect the source to get exactly what you want. In the next subsection I’ll show how I selected certain elements by inspecting the page source.
Getting Values to Iterate Over
I know that in order to get the data, I’ll have to iterate over different lists of values. In particular, I need a list of stats, seasons, and player positions.
We can use rvest
to scrape the website and get these lists. To do so, we need to find the corresponding nodes. As an example, after the code I’ll show where I searched for the required information in the page source for the stats and seasons lists.
The code below uses rvest
to create the lists we’ll use in the loops.
Observation: Even though in the source we don’t see that each word has its first letteruppercased, when we check the dropdown list we see exactly that (for example, we have “Clean Sheets” instead of “Clean sheets”). I was getting an error when trying to scrape these type of stats, and making them look like the dropdown list solved the issue. That’s the reason behind str_to_title()
.
Stats
This is my view when I open the stats dropdown list and right click and inspect the Clean Sheets stat.
Taking a closer look to the source where that element is present we get:
Seasons
This is my view when I open the seasons dropdown list and right click and inspect the 2016/17 season.
Taking a closer look to the source where that element is present we get:
As you can see, we have an attribute named data-dropdown-list
whose value is FOOTBALL_COMPSEASON
and inside we have li
tags where the attribute data-option-name
changes for each season. This will be useful when defining how to iterate using RSelenium
.
Positions
The logic behind getting the CSS for the positions is similar to the one described above, so I won’t be showing it.
Webscraping Loop
The code has comments on each step, so you can check it out! But before that, I’ll give an overview of the loop.
Preallocate stats vector. This list will have a length equal to the number of stats to be scraped.
For each stat:
- Click the stat dropdown list
- Click the corresponding stat
- Preallocate seasons vector. This list will have a length equal to the number of seasons to be scraped.
- For each season inside stat:
- Click the seasons dropdown list
- Click the corresponding season
- Preallocate positions vector. This list will have
length = 4
(positions are fixed: GOALKEEPER, DEFENDER, MIDFIELDER and FORWARD). - For each position inside season inside stat
- Click the position dropdown list
- Click the corresponding position
- Check that there is a table with data (if not, go to next position)
- Scrape the first table
- While “Next Page” button exists
- Click “Next Page” button
- Scrape new table
- Append new table to table
- Change stat colname and add position data
- Go to the top of the website
- Rowbind each position table
- Add season data
- Rowbind each season table
- Assign the table to the corresponding stat element.
The result of this loop is a populated list
with a number of elements equal to the number of stats scraped. Each of this elements is a tibble
.
This may take some time to run, so you can choose less stats to try it out.
As I mentioned, you can check the code!
Observation: Be careful when you add more stats to the loop. For example, Clean Sheets has the Position filter hidden, so the code should be modified (for example, by adding some “if” statement).
Data Wrangling
Finally, some data wrangling is needed to create our dataset. data_topStats
is a list
with 6 elements, each one of those elements is a tibble
. The next code chunk removes the Rank
column from each tibble
, reorders the columns and then makes a full join by all the non-stat variables using reduce()
(the reason behind this full join is that not all players have all stats). In the last line of code I replace NA
values with zero in the stats variables.
This is how the data looks like.
Season | Position | Club | Player | Nationality | Goals | Assists | Minutes Played | Passes | Shots | Fouls |
---|---|---|---|---|---|---|---|---|---|---|
2018/19 | DEFENDER | Brighton and Hove Albion | Shane Duffy | Ireland | 5 | 1 | 3088 | 1305 | 37 | 22 |
2018/19 | DEFENDER | AFC Bournemouth | Nathan Aké | Netherlands | 4 | 0 | 3412 | 1696 | 25 | 28 |
2018/19 | DEFENDER | Cardiff City | Sol Bamba | Cote D’Ivoire | 4 | 1 | 2475 | 550 | 22 | 35 |
2018/19 | DEFENDER | Wolverhampton Wanderers | Willy Boly | France | 4 | 0 | 3168 | 1715 | 24 | 29 |
2018/19 | DEFENDER | Everton | Lucas Digne | France | 4 | 4 | 2966 | 1457 | 34 | 39 |
2018/19 | DEFENDER | Wolverhampton Wanderers | Matt Doherty | Ireland | 4 | 5 | 3147 | 1399 | 46 | 30 |
The framework described here is an approach to working in parallel
with RSelenium
.
First, we load the libraries we need.
The function defined below stops Selenium on each core.
We determine the number of cores we’ll use. In this example, I use four cores.
We have to list the ports that are going to be used to start Selenium.
We use clusterApply()
to start Selenium on each core. Pay attention to the use of the Superassignment operator. When you run this function, you will see that four chrome windows are opened.
This is an example of pages that we will open in parallel. This list will change depending on the particular scenario.
Use parLapply()
to work in parallel. When you run this, you will see that each browser opens one website, and one is still blank. This is a simple example, I haven’t defined any scraping, but of course you can!
when you are done, stop Selenium on each core and stop the cluster.
Observation: Sometimes, when working in parallel some of the browsers close for no apparent reason (or at least a reason that I don’t understand).
Workaround browser closing for no reason
Consider the following scenario: your loop navigates to a certain website, clicks some elements and then gets the page source to scrape using rvest
. If in the middle of that loop the browser closes, you will get an error (for example, it won’t navigate to the website, or the element won’t be found). You can work around these errors using tryCatch()
, but when you skip the iteration where the error occurred, when you try to navigate to the website in the following iteration, an error would occur again (because there is no browser open!).
You could, for example, use remDr$open()
in the beggining of the loop, and remDr$close()
in the end, but I think that will open and close many browsers and make the process slower.
So I created this function that handles part of the problem (even though the iteration where the browser closed will not finish, the next one will and the process won’t stop).
It basically tries to get the current URL using remDr$getCurrentUrl()
. If no browser is open, this will throw an error, and if we get an error, it will open a browser.
Closing Selenium
Sometimes, even if the browser window is closed, when you re-run rD <- RSelenium::rsDriver(..)
you might encounter an error like:
This means that the connection was not completely closed. You can execute the lines of code below to stop Selenium.
You can check this. StackOverflow post for more information.
Wrapper Functions
You can create functions in order to type less. Suppose that you navigate to a certain website where you have to click one link that sends you to a site with different tabs. You can use something like this:
Observation Norton 2015 trial. : this function is theoretical, it won’t work if you run it.
I won’t show it here, but you can create functions to find elements, check if an element exists on the DOM (Document Object Model), try to click an element if it exists, parse the data table you are interested in, etc. You can check this StackOverflow for examples.
The following list contains different videos, posts and StackOverflow posts that I found useful when learning and working with RSelenium.
The ultimate online collection toolbox: Combining RSelenium and Rvest ( Part I and Part II ). If you know about
rvest
and just want to learn aboutRSelenium
, I’d recommend watching Part II. It gives an overview of what you can do when combiningRSelenium
andrvest
. It has nice an practical examples. As a final comment regarding these videos, I wouldn’t pay too much attention to setting up Docker because at least I didn’t need to work that way in order to getRSelenium
going. In fact, at least now, getting it going is pretty straightforward.RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium. I found this post really useful when trying to set up
RSelenium
. The solution given in this StackOverflow post, which is mentioned in the article, seems to be enough.Dungeons and Dragons Web Scraping with rvest and RSelenium. This is a great post! It starts with a general tutorial for scraping with
rvest
and then dives intoRSelenium
. If you are not familiar withrvest
, you can start here.RSelenium Tutorial. This post might be helpful too.
RSelenium Package Website. It has more advanced and detailed content. I just took a look to the Basics Vignette.
These StackOverflow posts helped me when working with dropdown lists:
RSelenium: server signals port is already in use. This post gives a solution to the “port already in use” problem. Even though is not marked as best, the last line of code of the second answer is useful.
Data Scraping in R. Thanks to this post I found the Premier League Stats website, which was exactly what I was looking for to write a post about
RSelenium
. Also, I took some hints from the answer marked as best.CSS Tutorials:
In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screenshot
It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
selenium
package
Web Scraping Using Selenium Java
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
The driver.page_source
will return the full page HTML code.
Here are two other interesting WebDriver properties:
driver.title
gets the page's titledriver.current_url
gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
- Tag name
- Class name
- IDs
- XPath
- CSS selectors
We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:
All these methods also have find_elements
(note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.
WebElement
A WebElement
is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
- Accessing the text of the element with the property
element.text
- Clicking on the element with
element.click()
- Accessing an attribute with
element.get_attribute('class')
- Sending text to an input with:
element.send_keys('mypassword')
There are some other interesting methods like is_displayed()
. This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden
like this:
This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That's a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
- Go to the login page using
driver.get()
- Select the username input using
driver.find_element_by_*
and thenelement.send_keys()
to send text to the input - Follow the same process with the password input
- Click on the login button using
element.click()
Should be easy right? Let's see the code:
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
- Check for an error message (like “Wrong password”)
- Check for one element on the page that is only displayed once logged in.
So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!
We can't just check if the element is None
because all of the find_element_by_*
raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException
exception:
Web Scraping Using Selenium Tutorial
Taking a screenshot
We could easily take a screenshot using:
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it's simple and we don't have to worry about these issues.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
- Use a
time.sleep(ARBITRARY_TIME)
before taking the screenshot. - Use a
WebDriverWait
object.
If you use a time.sleep()
you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait
method you will wait the exact amount of time necessary for your element/data to be loaded.
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd: