Scrape Data from a Lazy Loading Website with Selenium Python

By admin
backend

Introduction


A few months ago
DATE

, my friend wanted me to write a program to collect the data of

one
CARDINAL

of the

NFT
ORG

collections on the

NFTrade
ORG

site, compute the current price of each NFT in

US
GPE

dollars based on the current market price of the

BNB
ORG

cryptocurrency it was listed for sale in, and compile all of the NFTs for sale into a CSV file that he could sort and manipulate.

Unfortunately, the

NFTrade
ORG

website does not have a public API so writing a Node.js script to fetch the data from the

API
ORG

and format it as required was not an option. Instead, I needed to make a site scraper to actually go to the website page and "scrape" the data off of it.

Having not written a web scraper before (and also wanting to make the script easier for my friend to update and run on his own machine), I decided to write the program in

Python
ORG

(it seems to be a very popular programming language choice for a task such as this). Along the way, my little web scraper’s requirements evolved and got more complex, and I learned a bunch of useful new techniques about using Python for my project, which I intend to share in a series of posts over

the coming months
DATE

.

My

first
ORDINAL

attempt to scrape the data from

NFTrade
ORG

was unsuccessful beyond locating the

first
ORDINAL


75
CARDINAL

NFTs on the page. I figured out this was because

NFTrade
ORG

(as many other websites do) lazy loads NFTs onto the page

75
CARDINAL

at a time: once the user’s scrolled down far enough to reach the end of the currently visible items, the site loads the next batch of elements onto the page (essentially a fancier version of pagination). So I needed a way to have my web scraper program collect whatever data was available on the page then scroll down far enough to trigger more data to load and collect that, and rinse and repeat.

After some trial and error, I finally found a working solution with the help of a

Python
ORG

package named

Selenium Python
PRODUCT

, and I’ll share with you

today
DATE

how to write your own Python script to scrape data from a lazy loading website with

Selenium WebDriver
ORG

.

NOTE: I am not normally a

Python
ORG

developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.

Selenium with Python package

There are a few different popular

Python
ORG

packages available for web scraping which I tried before reaching for

Selenium
ORG

, but I had an issue with them in that they only worked for static websites that were generated at build time, not for sites that are generated on the client-side via

JavaScript
PRODUCT

, like

NFTrade
ORG

is.

To that end, I had to do a little digging to find a package that could work with scraping sites with dynamically loaded data, and I ran across the Selenium Python package during my investigation.


Selenium Python
PERSON

is a

Python
ORG

-based

API
ORG

that allows users to write scripts or automated tests using

Selenium WebDriver
PRODUCT

in an intuitive, Python-flavored way. And

Selenium WebDriver
PERSON

is a software that can drive a browser natively, as a user would, either locally or on a remote machine. Originally created back in

2004
DATE

, some version of

Selenium
ORG

has been around for

years
DATE

and is considered one of the earliest versions of automated testing that emulates user actions on a web page (commonly known

today
DATE

as end-to-end testing).

The cool thing about

WebDriver
ORG

though, is that its uses span beyond automation testing, as scripts can actually be written to scrape data off of live web pages, and that’s just what I ended up doing in my Python script, so let’s get started.

Install Selenium Python in the

Python
ORG

project

As with most projects, the

first
ORDINAL

thing to do is add the Selenium Python package to the

Python
ORG

project. The easiest way is to use pip to install the

Selenium
ORG

package.

Assuming you have pip on your machine, at the root of your Python project folder, run the following command from a terminal.

pip install selenium

Then, add the selenium package to your requirements.txt file so anyone downloading the repo in the future can install all the necessary project dependencies.

requirements.txt

selenium

And that’s all it takes to be ready to use

WebDriver
ORG

in your Python script. Simple enough.

Import Selenium WebDriver into Python script

After adding the Selenium Python bindings to the project, it’s time to import

Selenium
ORG

‘s

WebDriver
ORG

and some of its helpful configuration options to the actual Python script that does the website scraping. I named my file for_sale_scraper.py since I was specifically looking for NFTs that are for sale (not all of the NFTs listed on

NFTrade
ORG

are – some are just visible but not actually available to purchase), but you can choose any sort of file name that makes sense for you.

Below are the imports I added to my file. I’ll break down what each one is doing below.

for_sale_scraper.py

from selenium import webdriver from selenium . webdriver . chrome . options import Options from selenium . webdriver . support . wait import WebDriverWait from selenium . webdriver . common . by import By

The very

first
ORDINAL

import line brings in the selenium.webdriver module and provides all the

WebDriver
ORG

implementations.

from selenium import webdriver


Next
ORG

, as I chose to use

Chrome
ORG

as the browser I wanted

WebDriver
ORG

to interact with (Selenium supports

Firefox
ORG

,

Chrome, Edge
ORG

, and

Safari
ORG

browsers), I imported the Options class from the selenium.webdriver.chrome.options module. This allowed me to add specific config details about how I want the

Chrome
ORG

browser to be set up when the

Python
ORG

script runs against it: things like headless mode or disable extensions, etc.

from selenium . webdriver . chrome . options import Options

I’ll cover the arguments I passed here in detail in the next section.

WebDriverWait , added in the

third
ORDINAL

line of imports, is part of the special sauce that makes

WebDriver
ORG

a good solution for sites like

NFTrade
ORG

that dynamically fetch data on the client side: it allows for implicit and explicit wait times before trying to locate an element on the page, which gives the browser time for data to come back from the server and populate in the

DOM
ORG

.

from selenium . webdriver . support . wait import WebDriverWait

This type of wait is an "explicit wait", meaning I manually set a period during which the code will wait before continuing to try and execute.

And finally, there is the import for By . By is what allows me to locate elements on the page – it is immeasurably useful and powerful.

The By class accepts element IDs, names, attributes,

XPaths
GPE

, link text, tag names, class names, and

CSS
ORG

selectors just to name a few, and once again, it is a key player when it comes to scraping data off of the web page, as I’ll demonstrate soon.

from selenium . webdriver . common . by import By

Right, all the

Selenium WebDriver
ORG

imports are now present in the

Python
ORG

file, time to initialize them and get to work.

Add methods to scrape data and lazy load more data

Before

WebDriver
ORG

can begin scraping the data from

NFTrade
ORG

, an instance of the browser that

WebDriver
ORG

will interact with must be instantiated and the proper options supplied to it.


1
CARDINAL

. Initialize the

Selenium WebDriver
ORG

instance

In my attempt to try to follow good Python coding practices (again, disclaimer: I don’t write

Python
PERSON

as my primary coding language), I created a class for the the file named class

ForSaleNFTScraper
ORG

, and created an __init__() method immediately inside of it where I created the

Chrome WebDriver
ORG

instance that the whole script will be able reference in the remainder of its methods.

class

ForSaleNFTScraper
ORG

: def __init__ ( self ) : options = Options ( ) options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ ) self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver ,

5
CARDINAL

)

The

first
ORDINAL

thing I did inside of the __init__() method was to add a couple of

Chrome
ORG

browser configs via the

Options
ORG

import from the last section by declaring a new options variable.

options = Options ( ) :

Since I wanted this script to run without actually opening a browser window on the user’s local machine, I added the config argument of –headless and the argument of –start-maximized , so the (unseen) window would take up as much screen size as was available (and hopefully load as many NFTs as quickly as possible by doing so).

options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ )

Then I passed the new options object to the instance of webdriver.

Chrome
ORG

, which was set to the variable of

self.driver
PERSON

( self is a variable accessible throughout the rest of the methods within this

ForSaleNFTScraper
ORG

class), and instructed the new

WebDriver
ORG

to wait for

5 seconds
TIME

after startup (which would presumably give it time to go to the specified

NFTrade
ORG

web URL and load the data onto the page before attempting to scrape it).

self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver ,

5
CARDINAL

)

There’s plenty happening in that

first
ORDINAL

method, but it’s all pretty straightforward once you go through the code line by line and understand what the arguments mean to the

Chrome WebDriver
ORG

instance, and why it’s doing what it’s doing. Now that the

WebDriver
ORG

instance was configured and ready to go, I could write the code fetching the

NFT
ORG

card data, and lazy loading more data once the end of the currently visible info was reached.


2
CARDINAL

. Write the get_cards() and get_current_card_count() methods

This is where the code really starts to get interesting in my opinion, because it’s where I learned to collect whatever data was currently visible in a (headless) browser and then load more data to add to the list. Pay close attention, because this is where the lazy loading code resides that gets more and more data onto the page.

def get_current_card_count ( self ) : """Get the count of cards loaded into list of cards.""" return len ( self . driver . find_elements ( By .

XPATH
PERSON

, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) ) def get_cards ( self , max_card_count =

500
CARDINAL

) : """Extract and returns card ID and price.""" URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL ) last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep (

3
CARDINAL

) last_card_count = self . get_current_card_count ( ) cards = self . driver . find_elements ( By .

XPATH
PERSON

, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards

Ok, here we go.

For starters, there are

two
CARDINAL

methods that I’m displaying in the code snippet here. The

first
ORDINAL

method, get_current_card_count() is how I keep track of how many NFT cards in a collection are currently visible on the screen.

As I’ve said,

NFTrade
NORP

lazy loads its

NFT
ORG

collections onto a site to make initial page load quicker, and when a user scrolls down to the end of the currently loaded batch of elements, the

NFTrade
ORG

page then triggers to load more cards into the

DOM
ORG

at that point in time.

The

second
ORDINAL

method is get_cards() , which handles going to the

NFTrade
ORG

collection URL and scraping all the available card data. It relies on get_current_card_count() to help it know to load more NFT cards until the desired number of cards has been loaded in the browser to scrape data from.

get_cards() method

I’ll talk about get_cards()

first
ORDINAL

as it’s the more complicated of the

two
CARDINAL

methods. The

first
ORDINAL

thing the method does is declare a new variable named URL – this variable is set to the URL of the

NFTrade
ORG

collection page I want

WebDriver
ORG

to navigate to and scrape the data from. I used the Selenium WebDriver driver.get() method to navigate to the page given by the URL.

URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL )

After navigating to the proper URL, I created a variable called last_card_count and set it equal to

0
CARDINAL

: this variable will be used to track how many NFTs are currently visible on the page and compare it to the max_card_count variable passed to the get_cards() method (if a number isn’t passed for max_card_count it defaults to

500
CARDINAL

).


Below
PERSON

is the key code to lazy loading more and more data in the browser

Inside of get_cards() , there’s a while loop set up to compare the last_card_count and max_card_count variables. As long as last_card_count is less then max_card_count , the loop will run, and each time it executes

WebDriver
ORG

uses the driver.execute_script() method to scroll down the page, wait for

3 seconds
TIME

(allowing more cards to load onscreen), and then updating the last_card_count variable equal to the new amount of cards on the page using the get_current_card_count() method.

NOTE: The

window.scrollTo
LOC

() method is critical driver.execute_script() allows for the synchronous execution of

JavaScript
ORG

in the current window, so when you see the code self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") , what’s happening is that

WebDriver
ORG

is using the JavaScript window.scrollTo() method to scroll the browser all the way to the bottom of the page (that’s why document.body.scrollHeight is present – it’s a measurement of the height of the whole document.body page element), which triggers the page to load more NFT cards into view.

last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep (

3
CARDINAL

) last_card_count = self . get_current_card_count ( )

And this is a perfect time to segue into discussing the get_current_card_count() method, which is short and sweet.

get_current_card_count() method

This method exists simply to find the count of the current elements loaded in the browser, and it does so by combining the

WebDriver
ORG

find_elements() method with the By.

XPATH
PERSON

element locator method.

Due to how the

NFTrade
ORG

site is built, there are no easily identifiable classes, IDs, or other consistent ways to identify all the cards on the page, so I had to resort to

XPath
NORP

expressions to identify each element and include it in my count to update the last_card_count variable. I cobbled together the XPath below by using my

Chrome DevTools
ORG

to inspect the elements on the page and construct the XPath from there through trial and error.

NOTE: What is

XPath
PERSON

? If you’re unfamiliar like I was,

XPath
PERSON

is a syntax that can be used to navigate through elements and attributes in a standard XML document (or webpage). The link I provided to W3Schools has some good examples of what typical

XPath
NORP

expressions look like and how to interpret them.

So the code inside of the get_current_card_count() method is just the

one
CARDINAL

line of code:

return len ( self . driver . find_elements ( By .

XPATH
PERSON

, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) )

In the code snippet, I’m getting the count (using the build-in

Python
PRODUCT

method len() ) of all the elements on the page that match the XPath of a <div> containing the class of "Item_itemContent__1XIcH" , because each NFT on the page is wrapped by that <div> with that class. It’s not the prettiest thing to read and understand, but it gets the job done.

And finally, jumping back to the get_cards() method again, once the last_card_count variable has been updated and surpassed the max_card_count variable (i.e. enough

NFT
ORG

cards are loaded into the browser), the while loop ends, and all the cards on the screen are targeted (using the very same

XPath
PRODUCT

used in the get_current_card_count() method, I might add) and set equal to the cards variable defined at the top of the get_cards() method. That variable then gets returned to the __main__ method running the whole script, which I’ll cover next.

cards = self . driver . find_elements ( By .

XPATH
PERSON

, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards

There’s quite a bit going on here, but hopefully it makes more sense now what these methods are doing. Time to test out this lazy loading script functionality and see how

WebDriver
ORG

does.


3
CARDINAL

. Run the Python script

All right, now that all the code and logic to load multiple sets of

NFTrade
ORG

cards into the browser and collect the data has been written, it’s time to run the code.

To do that, I declared a __main__ method at the bottom of the file which can be started from the terminal with the following command.

python for_sale_scraper.py

Here is what __main__ method includes.

if __name__ == ‘__main__’ : scraper =

ForSaleNFTScraper
ORG

( ) ; cards = scraper . get_cards ( max_card_count =

200
CARDINAL

) pprint ( cards ) print ( "Total cards collected:" , len ( cards ) )

The

first
ORDINAL

thing the method does is create a new instantiation variable named scraper by calling the

ForSaleNFTScraper
ORG

() class. It then proceeds to fetch all the card data and set it equal to a variable named cards by calling the method scraper.get_cards(max_card_count=200) and supplying a max_card_count variable of

200
CARDINAL

.

After this step, as a sanity check, I used the

Python
PRODUCT

pprint() and print() methods to print out all the card data and a count of the total cards fetched by the get_cards() method, and ensure all the info I needed to include in the

CSV
ORG

(NFT price,

NFT ID
ORG

, etc.) was available to me. Here’s a screenshot of some of the data printed out in my console helping me know my code is doing what I expect.

Here is what the raw NFT card data gathered from the `get_cards()` method looks like printed in the terminal.

Since I set my `max_card_count` to

200
CARDINAL

, but

NFTrade
NORP

loads NFTs in batches of

75
CARDINAL

at a time, it makes sense that the total count of NFTs scraped off the page equals

225
CARDINAL

.

And after verifying the right data’s there (and the right amount of data as well), I continued on extracting the data, calculating the current price in USD for each NFT, and assembling a CSV of all the data. But I’ll save those steps for future blog posts.

Conclusion

Building a Python-based website scraper to create a CSV of NFTs available for sale on

NFTrade
ORG

was a unique challenge I learned a lot of new things from.

After my

first
ORDINAL

attempt failed due to

NFTrade
NORP

dynamically lazy loading NFTs in batches of

75
CARDINAL

onto the page as a user scrolled further down, I had to come up with a more creative solution that would allow me to trigger the site to load more cards on the page

first
ORDINAL

, then grab the data on the cards for sale.

I found the solution I was looking for with the help of a

Python
ORG

package called

Selenium Python
PRODUCT

. Selenium Python is a powerful

Python
ORG

-based

API
ORG

that allows users to write scripts or automated tests leveraging

Selenium WebDriver
PRODUCT

. And it was up to the task at hand: with just a few methods I was able to specify as many NFTs as I wanted loaded on the page before scraping and collecting all their data all at once.

Check back in

a few weeks
DATE

— I’ll be writing more blogs about the problems I had to solve while building this

Python
ORG

website scraper in addition to other topics on

JavaScript
PRODUCT

or something else related to web development.

Thanks for reading. I hope seeing how to make a

Python Selenium WebDriver
ORG

load data onto a dynamic webpage before scraping it comes in handy for you in the future.

Further References & Resources