Scrape Data from a Lazy Loading Website with Selenium Python
A few months ago, my friend wanted me to write a program to collect the data of one of the NFT collections on the NFTrade site, compute the current price of each NFT in US dollars based on the current market price of the BNB cryptocurrency it was listed for sale in, and compile all of the NFTs for sale into a CSV file that he could sort and manipulate.
Unfortunately, the NFTradewebsite does not have a public API so writing a Node.js script to fetch the data from the API and format it as required was not an option. Instead, I needed to make a site scraper to actually go to the website page and "scrape" the data off of it.
Having not written a web scraper before (and also wanting to make the script easier for my friend to update and run on his own machine), I decided to write the program in Python(it seems to be a very popular programming language choice for a task such as this). Along the way, my little web scraper’s requirements evolved and got more complex, and I learned a bunch of useful new techniques about using Python for my project, which I intend to share in a series of posts over the coming months .
My firstattempt to scrape the data from NFTrade was unsuccessful beyond locating the first
75NFTs on the page. I figured out this was because NFTrade (as many other websites do) lazy loads NFTs onto the page 75 at a time: once the user’s scrolled down far enough to reach the end of the currently visible items, the site loads the next batch of elements onto the page (essentially a fancier version of pagination). So I needed a way to have my web scraper program collect whatever data was available on the page then scroll down far enough to trigger more data to load and collect that, and rinse and repeat.
After some trial and error, I finally found a working solution with the help of a Pythonpackage named Selenium Python , and I’ll share with you today how to write your own Python script to scrape data from a lazy loading website with Selenium WebDriver .
NOTE: I am not normally a Pythondeveloper so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.
Selenium with Python package
To that end, I had to do a little digging to find a package that could work with scraping sites with dynamically loaded data, and I ran across the Selenium Python package during my investigation.
Selenium Pythonis a Python -based API that allows users to write scripts or automated tests using Selenium WebDriver in an intuitive, Python-flavored way. And Selenium WebDriver is a software that can drive a browser natively, as a user would, either locally or on a remote machine. Originally created back in 2004 , some version of Selenium has been around for years and is considered one of the earliest versions of automated testing that emulates user actions on a web page (commonly known today as end-to-end testing).
The cool thing about WebDriverthough, is that its uses span beyond automation testing, as scripts can actually be written to scrape data off of live web pages, and that’s just what I ended up doing in my Python script, so let’s get started.
Install Selenium Python in the Pythonproject
As with most projects, the firstthing to do is add the Selenium Python package to the Python project. The easiest way is to use pip to install the Selenium package.
Assuming you have pip on your machine, at the root of your Python project folder, run the following command from a terminal.
pip install selenium
Then, add the selenium package to your requirements.txt file so anyone downloading the repo in the future can install all the necessary project dependencies.
And that’s all it takes to be ready to use WebDriverin your Python script. Simple enough.
Import Selenium WebDriver into Python script
After adding the Selenium Python bindings to the project, it’s time to import Selenium‘s WebDriver and some of its helpful configuration options to the actual Python script that does the website scraping. I named my file for_sale_scraper.py since I was specifically looking for NFTs that are for sale (not all of the NFTs listed on NFTrade are – some are just visible but not actually available to purchase), but you can choose any sort of file name that makes sense for you.
Below are the imports I added to my file. I’ll break down what each one is doing below.
from selenium import webdriver from selenium . webdriver . chrome . options import Options from selenium . webdriver . support . wait import WebDriverWait from selenium . webdriver . common . by import By
The very firstimport line brings in the selenium.webdriver module and provides all the WebDriver implementations.
from selenium import webdriver
Next, as I chose to use Chrome as the browser I wanted WebDriver to interact with (Selenium supports Firefox , Chrome, Edge , and Safari browsers), I imported the Options class from the selenium.webdriver.chrome.options module. This allowed me to add specific config details about how I want the Chrome browser to be set up when the Python script runs against it: things like headless mode or disable extensions, etc.
from selenium . webdriver . chrome . options import Options
I’ll cover the arguments I passed here in detail in the next section.
WebDriverWait , added in the thirdline of imports, is part of the special sauce that makes WebDriver a good solution for sites like NFTrade that dynamically fetch data on the client side: it allows for implicit and explicit wait times before trying to locate an element on the page, which gives the browser time for data to come back from the server and populate in the DOM .
from selenium . webdriver . support . wait import WebDriverWait
This type of wait is an "explicit wait", meaning I manually set a period during which the code will wait before continuing to try and execute.
And finally, there is the import for By . By is what allows me to locate elements on the page – it is immeasurably useful and powerful.
The By class accepts element IDs, names, attributes, XPaths, link text, tag names, class names, and CSS selectors just to name a few, and once again, it is a key player when it comes to scraping data off of the web page, as I’ll demonstrate soon.
from selenium . webdriver . common . by import By
Right, all the Selenium WebDriverimports are now present in the Python file, time to initialize them and get to work.
Add methods to scrape data and lazy load more data
Before WebDrivercan begin scraping the data from NFTrade , an instance of the browser that WebDriver will interact with must be instantiated and the proper options supplied to it.
1. Initialize the Selenium WebDriver instance
In my attempt to try to follow good Python coding practices (again, disclaimer: I don’t write Pythonas my primary coding language), I created a class for the the file named class ForSaleNFTScraper , and created an __init__() method immediately inside of it where I created the Chrome WebDriver instance that the whole script will be able reference in the remainder of its methods.
class ForSaleNFTScraper: def __init__ ( self ) : options = Options ( ) options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ ) self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver , 5 )
The firstthing I did inside of the __init__() method was to add a couple of Chrome browser configs via the Options import from the last section by declaring a new options variable.
options = Options ( ) :
Since I wanted this script to run without actually opening a browser window on the user’s local machine, I added the config argument of –headless and the argument of –start-maximized , so the (unseen) window would take up as much screen size as was available (and hopefully load as many NFTs as quickly as possible by doing so).
options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ )
Then I passed the new options object to the instance of webdriver. Chrome, which was set to the variable of self.driver ( self is a variable accessible throughout the rest of the methods within this ForSaleNFTScraper class), and instructed the new WebDriver to wait for 5 seconds after startup (which would presumably give it time to go to the specified NFTrade web URL and load the data onto the page before attempting to scrape it).
self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver , 5)
There’s plenty happening in that firstmethod, but it’s all pretty straightforward once you go through the code line by line and understand what the arguments mean to the Chrome WebDriver instance, and why it’s doing what it’s doing. Now that the WebDriver instance was configured and ready to go, I could write the code fetching the NFT card data, and lazy loading more data once the end of the currently visible info was reached.
2. Write the get_cards() and get_current_card_count() methods
This is where the code really starts to get interesting in my opinion, because it’s where I learned to collect whatever data was currently visible in a (headless) browser and then load more data to add to the list. Pay close attention, because this is where the lazy loading code resides that gets more and more data onto the page.
def get_current_card_count ( self ) : """Get the count of cards loaded into list of cards.""" return len ( self . driver . find_elements ( By . XPATH, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) ) def get_cards ( self , max_card_count = 500 ) : """Extract and returns card ID and price.""" URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL ) last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep ( 3 ) last_card_count = self . get_current_card_count ( ) cards = self . driver . find_elements ( By . XPATH , ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards
Ok, here we go.
For starters, there are twomethods that I’m displaying in the code snippet here. The first method, get_current_card_count() is how I keep track of how many NFT cards in a collection are currently visible on the screen.
As I’ve said, NFTradelazy loads its NFT collections onto a site to make initial page load quicker, and when a user scrolls down to the end of the currently loaded batch of elements, the NFTrade page then triggers to load more cards into the DOM at that point in time.
The secondmethod is get_cards() , which handles going to the NFTrade collection URL and scraping all the available card data. It relies on get_current_card_count() to help it know to load more NFT cards until the desired number of cards has been loaded in the browser to scrape data from.
I’ll talk about get_cards() firstas it’s the more complicated of the two methods. The first thing the method does is declare a new variable named URL – this variable is set to the URL of the NFTrade collection page I want WebDriver to navigate to and scrape the data from. I used the Selenium WebDriver driver.get() method to navigate to the page given by the URL.
URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL )
After navigating to the proper URL, I created a variable called last_card_count and set it equal to 0: this variable will be used to track how many NFTs are currently visible on the page and compare it to the max_card_count variable passed to the get_cards() method (if a number isn’t passed for max_card_count it defaults to 500 ).
Belowis the key code to lazy loading more and more data in the browser
Inside of get_cards() , there’s a while loop set up to compare the last_card_count and max_card_count variables. As long as last_card_count is less then max_card_count , the loop will run, and each time it executes WebDriveruses the driver.execute_script() method to scroll down the page, wait for 3 seconds (allowing more cards to load onscreen), and then updating the last_card_count variable equal to the new amount of cards on the page using the get_current_card_count() method.
last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep ( 3) last_card_count = self . get_current_card_count ( )
And this is a perfect time to segue into discussing the get_current_card_count() method, which is short and sweet.
This method exists simply to find the count of the current elements loaded in the browser, and it does so by combining the WebDriverfind_elements() method with the By. XPATH element locator method.
Due to how the NFTradesite is built, there are no easily identifiable classes, IDs, or other consistent ways to identify all the cards on the page, so I had to resort to XPath expressions to identify each element and include it in my count to update the last_card_count variable. I cobbled together the XPath below by using my Chrome DevTools to inspect the elements on the page and construct the XPath from there through trial and error.
NOTE: What is XPath? If you’re unfamiliar like I was, XPath is a syntax that can be used to navigate through elements and attributes in a standard XML document (or webpage). The link I provided to W3Schools has some good examples of what typical XPath expressions look like and how to interpret them.
So the code inside of the get_current_card_count() method is just the oneline of code:
return len ( self . driver . find_elements ( By . XPATH, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) )
In the code snippet, I’m getting the count (using the build-in Pythonmethod len() ) of all the elements on the page that match the XPath of a <div> containing the class of "Item_itemContent__1XIcH" , because each NFT on the page is wrapped by that <div> with that class. It’s not the prettiest thing to read and understand, but it gets the job done.
And finally, jumping back to the get_cards() method again, once the last_card_count variable has been updated and surpassed the max_card_count variable (i.e. enough NFTcards are loaded into the browser), the while loop ends, and all the cards on the screen are targeted (using the very same XPath used in the get_current_card_count() method, I might add) and set equal to the cards variable defined at the top of the get_cards() method. That variable then gets returned to the __main__ method running the whole script, which I’ll cover next.
cards = self . driver . find_elements ( By . XPATH, ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards
There’s quite a bit going on here, but hopefully it makes more sense now what these methods are doing. Time to test out this lazy loading script functionality and see how WebDriverdoes.
3. Run the Python script
All right, now that all the code and logic to load multiple sets of NFTradecards into the browser and collect the data has been written, it’s time to run the code.
To do that, I declared a __main__ method at the bottom of the file which can be started from the terminal with the following command.
Here is what __main__ method includes.
if __name__ == ‘__main__’ : scraper = ForSaleNFTScraper( ) ; cards = scraper . get_cards ( max_card_count = 200 ) pprint ( cards ) print ( "Total cards collected:" , len ( cards ) )
The firstthing the method does is create a new instantiation variable named scraper by calling the ForSaleNFTScraper () class. It then proceeds to fetch all the card data and set it equal to a variable named cards by calling the method scraper.get_cards(max_card_count=200) and supplying a max_card_count variable of 200 .
After this step, as a sanity check, I used the Pythonpprint() and print() methods to print out all the card data and a count of the total cards fetched by the get_cards() method, and ensure all the info I needed to include in the CSV (NFT price, NFT ID , etc.) was available to me. Here’s a screenshot of some of the data printed out in my console helping me know my code is doing what I expect.
Here is what the raw NFT card data gathered from the `get_cards()` method looks like printed in the terminal.
Since I set my `max_card_count` to 200, but NFTrade loads NFTs in batches of 75 at a time, it makes sense that the total count of NFTs scraped off the page equals 225 .
And after verifying the right data’s there (and the right amount of data as well), I continued on extracting the data, calculating the current price in USD for each NFT, and assembling a CSV of all the data. But I’ll save those steps for future blog posts.
Building a Python-based website scraper to create a CSV of NFTs available for sale on NFTradewas a unique challenge I learned a lot of new things from.
After my firstattempt failed due to NFTrade dynamically lazy loading NFTs in batches of 75 onto the page as a user scrolled further down, I had to come up with a more creative solution that would allow me to trigger the site to load more cards on the page first , then grab the data on the cards for sale.
I found the solution I was looking for with the help of a Pythonpackage called Selenium Python . Selenium Python is a powerful Python -based API that allows users to write scripts or automated tests leveraging Selenium WebDriver . And it was up to the task at hand: with just a few methods I was able to specify as many NFTs as I wanted loaded on the page before scraping and collecting all their data all at once.
Thanks for reading. I hope seeing how to make a Python Selenium WebDriverload data onto a dynamic webpage before scraping it comes in handy for you in the future.
Further References & Resources