Scrape Data from a Lazy Loading Website with Selenium Python

Created on November 12, 2023 at 11:19 am

backend

Introduction

A few months ago DATE , my friend wanted me to write a program to collect the data of one CARDINAL of the NFT ORG collections on the NFTrade ORG site, compute the current price of each NFT in US GPE dollars based on the current market price of the BNB ORG cryptocurrency it was listed for sale in, and compile all of the NFTs for sale into a CSV file that he could sort and manipulate.

Unfortunately, the NFTrade ORG website does not have a public API so writing a Node.js script to fetch the data from the API ORG and format it as required was not an option. Instead, I needed to make a site scraper to actually go to the website page and "scrape" the data off of it.

Having not written a web scraper before (and also wanting to make the script easier for my friend to update and run on his own machine), I decided to write the program in Python ORG (it seems to be a very popular programming language choice for a task such as this). Along the way, my little web scraper’s requirements evolved and got more complex, and I learned a bunch of useful new techniques about using Python for my project, which I intend to share in a series of posts over the coming months DATE .

My first ORDINAL attempt to scrape the data from NFTrade ORG was unsuccessful beyond locating the first ORDINAL

75 CARDINAL NFTs on the page. I figured out this was because NFTrade ORG (as many other websites do) lazy loads NFTs onto the page 75 CARDINAL at a time: once the user’s scrolled down far enough to reach the end of the currently visible items, the site loads the next batch of elements onto the page (essentially a fancier version of pagination). So I needed a way to have my web scraper program collect whatever data was available on the page then scroll down far enough to trigger more data to load and collect that, and rinse and repeat.

After some trial and error, I finally found a working solution with the help of a Python ORG package named Selenium Python PRODUCT , and I’ll share with you today DATE how to write your own Python script to scrape data from a lazy loading website with Selenium WebDriver ORG .

NOTE: I am not normally a Python ORG developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.

Selenium with Python package

There are a few different popular Python ORG packages available for web scraping which I tried before reaching for Selenium ORG , but I had an issue with them in that they only worked for static websites that were generated at build time, not for sites that are generated on the client-side via JavaScript PRODUCT , like NFTrade ORG is.

To that end, I had to do a little digging to find a package that could work with scraping sites with dynamically loaded data, and I ran across the Selenium Python package during my investigation.

Selenium Python PERSON is a Python ORG -based API ORG that allows users to write scripts or automated tests using Selenium WebDriver PRODUCT in an intuitive, Python-flavored way. And Selenium WebDriver PERSON is a software that can drive a browser natively, as a user would, either locally or on a remote machine. Originally created back in 2004 DATE , some version of Selenium ORG has been around for years DATE and is considered one of the earliest versions of automated testing that emulates user actions on a web page (commonly known today DATE as end-to-end testing).

The cool thing about WebDriver ORG though, is that its uses span beyond automation testing, as scripts can actually be written to scrape data off of live web pages, and that’s just what I ended up doing in my Python script, so let’s get started.

Install Selenium Python in the Python ORG project

As with most projects, the first ORDINAL thing to do is add the Selenium Python package to the Python ORG project. The easiest way is to use pip to install the Selenium ORG package.

Assuming you have pip on your machine, at the root of your Python project folder, run the following command from a terminal.

pip install selenium

Then, add the selenium package to your requirements.txt file so anyone downloading the repo in the future can install all the necessary project dependencies.

requirements.txt

selenium

And that’s all it takes to be ready to use WebDriver ORG in your Python script. Simple enough.

Import Selenium WebDriver into Python script

After adding the Selenium Python bindings to the project, it’s time to import Selenium ORG ‘s WebDriver ORG and some of its helpful configuration options to the actual Python script that does the website scraping. I named my file for_sale_scraper.py since I was specifically looking for NFTs that are for sale (not all of the NFTs listed on NFTrade ORG are – some are just visible but not actually available to purchase), but you can choose any sort of file name that makes sense for you.

Below are the imports I added to my file. I’ll break down what each one is doing below.

for_sale_scraper.py

from selenium import webdriver from selenium . webdriver . chrome . options import Options from selenium . webdriver . support . wait import WebDriverWait from selenium . webdriver . common . by import By

The very first ORDINAL import line brings in the selenium.webdriver module and provides all the WebDriver ORG implementations.

from selenium import webdriver

Next ORG , as I chose to use Chrome ORG as the browser I wanted WebDriver ORG to interact with (Selenium supports Firefox ORG , Chrome, Edge ORG , and Safari ORG browsers), I imported the Options class from the selenium.webdriver.chrome.options module. This allowed me to add specific config details about how I want the Chrome ORG browser to be set up when the Python ORG script runs against it: things like headless mode or disable extensions, etc.

from selenium . webdriver . chrome . options import Options

I’ll cover the arguments I passed here in detail in the next section.

WebDriverWait , added in the third ORDINAL line of imports, is part of the special sauce that makes WebDriver ORG a good solution for sites like NFTrade ORG that dynamically fetch data on the client side: it allows for implicit and explicit wait times before trying to locate an element on the page, which gives the browser time for data to come back from the server and populate in the DOM ORG .

from selenium . webdriver . support . wait import WebDriverWait

This type of wait is an "explicit wait", meaning I manually set a period during which the code will wait before continuing to try and execute.

And finally, there is the import for By . By is what allows me to locate elements on the page – it is immeasurably useful and powerful.

The By class accepts element IDs, names, attributes, XPaths GPE , link text, tag names, class names, and CSS ORG selectors just to name a few, and once again, it is a key player when it comes to scraping data off of the web page, as I’ll demonstrate soon.

from selenium . webdriver . common . by import By

Right, all the Selenium WebDriver ORG imports are now present in the Python ORG file, time to initialize them and get to work.

Add methods to scrape data and lazy load more data

Before WebDriver ORG can begin scraping the data from NFTrade ORG , an instance of the browser that WebDriver ORG will interact with must be instantiated and the proper options supplied to it.

1 CARDINAL . Initialize the Selenium WebDriver ORG instance

In my attempt to try to follow good Python coding practices (again, disclaimer: I don’t write Python PERSON as my primary coding language), I created a class for the the file named class ForSaleNFTScraper ORG , and created an __init__() method immediately inside of it where I created the Chrome WebDriver ORG instance that the whole script will be able reference in the remainder of its methods.

class ForSaleNFTScraper ORG : def __init__ ( self ) : options = Options ( ) options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ ) self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver , 5 CARDINAL )

The first ORDINAL thing I did inside of the __init__() method was to add a couple of Chrome ORG browser configs via the Options ORG import from the last section by declaring a new options variable.

options = Options ( ) :

Since I wanted this script to run without actually opening a browser window on the user’s local machine, I added the config argument of –headless and the argument of –start-maximized , so the (unseen) window would take up as much screen size as was available (and hopefully load as many NFTs as quickly as possible by doing so).

options . add_argument ( ‘–headless’ ) options . add_argument ( ‘–start-maximized’ )

Then I passed the new options object to the instance of webdriver. Chrome ORG , which was set to the variable of self.driver PERSON ( self is a variable accessible throughout the rest of the methods within this ForSaleNFTScraper ORG class), and instructed the new WebDriver ORG to wait for 5 seconds TIME after startup (which would presumably give it time to go to the specified NFTrade ORG web URL and load the data onto the page before attempting to scrape it).

self . driver = webdriver . Chrome ( options = options ) self . wait = WebDriverWait ( self . driver , 5 CARDINAL )

There’s plenty happening in that first ORDINAL method, but it’s all pretty straightforward once you go through the code line by line and understand what the arguments mean to the Chrome WebDriver ORG instance, and why it’s doing what it’s doing. Now that the WebDriver ORG instance was configured and ready to go, I could write the code fetching the NFT ORG card data, and lazy loading more data once the end of the currently visible info was reached.

2 CARDINAL . Write the get_cards() and get_current_card_count() methods

This is where the code really starts to get interesting in my opinion, because it’s where I learned to collect whatever data was currently visible in a (headless) browser and then load more data to add to the list. Pay close attention, because this is where the lazy loading code resides that gets more and more data onto the page.

def get_current_card_count ( self ) : """Get the count of cards loaded into list of cards.""" return len ( self . driver . find_elements ( By . XPATH PERSON , ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) ) def get_cards ( self , max_card_count = 500 CARDINAL ) : """Extract and returns card ID and price.""" URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL ) last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep ( 3 CARDINAL ) last_card_count = self . get_current_card_count ( ) cards = self . driver . find_elements ( By . XPATH PERSON , ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards

Ok, here we go.

For starters, there are two CARDINAL methods that I’m displaying in the code snippet here. The first ORDINAL method, get_current_card_count() is how I keep track of how many NFT cards in a collection are currently visible on the screen.

As I’ve said, NFTrade NORP lazy loads its NFT ORG collections onto a site to make initial page load quicker, and when a user scrolls down to the end of the currently loaded batch of elements, the NFTrade ORG page then triggers to load more cards into the DOM ORG at that point in time.

The second ORDINAL method is get_cards() , which handles going to the NFTrade ORG collection URL and scraping all the available card data. It relies on get_current_card_count() to help it know to load more NFT cards until the desired number of cards has been loaded in the browser to scrape data from.

get_cards() method

I’ll talk about get_cards() first ORDINAL as it’s the more complicated of the two CARDINAL methods. The first ORDINAL thing the method does is declare a new variable named URL – this variable is set to the URL of the NFTrade ORG collection page I want WebDriver ORG to navigate to and scrape the data from. I used the Selenium WebDriver driver.get() method to navigate to the page given by the URL.

URL = "https://nftrade.com/collection/[NFT_COLLECTION_NAME]" self . driver . get ( URL )

After navigating to the proper URL, I created a variable called last_card_count and set it equal to 0 CARDINAL : this variable will be used to track how many NFTs are currently visible on the page and compare it to the max_card_count variable passed to the get_cards() method (if a number isn’t passed for max_card_count it defaults to 500 CARDINAL ).

Below PERSON is the key code to lazy loading more and more data in the browser

Inside of get_cards() , there’s a while loop set up to compare the last_card_count and max_card_count variables. As long as last_card_count is less then max_card_count , the loop will run, and each time it executes WebDriver ORG uses the driver.execute_script() method to scroll down the page, wait for 3 seconds TIME (allowing more cards to load onscreen), and then updating the last_card_count variable equal to the new amount of cards on the page using the get_current_card_count() method.

NOTE: The window.scrollTo LOC () method is critical driver.execute_script() allows for the synchronous execution of JavaScript ORG in the current window, so when you see the code self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") , what’s happening is that WebDriver ORG is using the JavaScript window.scrollTo() method to scroll the browser all the way to the bottom of the page (that’s why document.body.scrollHeight is present – it’s a measurement of the height of the whole document.body page element), which triggers the page to load more NFT cards into view.

last_card_count = 0 while last_card_count < max_card_count : self . driver . execute_script ( "window.scrollTo(0, document.body.scrollHeight);" ) time . sleep ( 3 CARDINAL ) last_card_count = self . get_current_card_count ( )

And this is a perfect time to segue into discussing the get_current_card_count() method, which is short and sweet.

get_current_card_count() method

This method exists simply to find the count of the current elements loaded in the browser, and it does so by combining the WebDriver ORG find_elements() method with the By. XPATH PERSON element locator method.

Due to how the NFTrade ORG site is built, there are no easily identifiable classes, IDs, or other consistent ways to identify all the cards on the page, so I had to resort to XPath NORP expressions to identify each element and include it in my count to update the last_card_count variable. I cobbled together the XPath below by using my Chrome DevTools ORG to inspect the elements on the page and construct the XPath from there through trial and error.

NOTE: What is XPath PERSON ? If you’re unfamiliar like I was, XPath PERSON is a syntax that can be used to navigate through elements and attributes in a standard XML document (or webpage). The link I provided to W3Schools has some good examples of what typical XPath NORP expressions look like and how to interpret them.

So the code inside of the get_current_card_count() method is just the one CARDINAL line of code:

return len ( self . driver . find_elements ( By . XPATH PERSON , ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) )

In the code snippet, I’m getting the count (using the build-in Python PRODUCT method len() ) of all the elements on the page that match the XPath of a <div> containing the class of "Item_itemContent__1XIcH" , because each NFT on the page is wrapped by that <div> with that class. It’s not the prettiest thing to read and understand, but it gets the job done.

And finally, jumping back to the get_cards() method again, once the last_card_count variable has been updated and surpassed the max_card_count variable (i.e. enough NFT ORG cards are loaded into the browser), the while loop ends, and all the cards on the screen are targeted (using the very same XPath PRODUCT used in the get_current_card_count() method, I might add) and set equal to the cards variable defined at the top of the get_cards() method. That variable then gets returned to the __main__ method running the whole script, which I’ll cover next.

cards = self . driver . find_elements ( By . XPATH PERSON , ‘//div[contains(@class, "Item_itemContent__1XIcH")]’ ) return cards

There’s quite a bit going on here, but hopefully it makes more sense now what these methods are doing. Time to test out this lazy loading script functionality and see how WebDriver ORG does.

3 CARDINAL . Run the Python script

All right, now that all the code and logic to load multiple sets of NFTrade ORG cards into the browser and collect the data has been written, it’s time to run the code.

To do that, I declared a __main__ method at the bottom of the file which can be started from the terminal with the following command.

python for_sale_scraper.py

Here is what __main__ method includes.

if __name__ == ‘__main__’ : scraper = ForSaleNFTScraper ORG ( ) ; cards = scraper . get_cards ( max_card_count = 200 CARDINAL ) pprint ( cards ) print ( "Total cards collected:" , len ( cards ) )

The first ORDINAL thing the method does is create a new instantiation variable named scraper by calling the ForSaleNFTScraper ORG () class. It then proceeds to fetch all the card data and set it equal to a variable named cards by calling the method scraper.get_cards(max_card_count=200) and supplying a max_card_count variable of 200 CARDINAL .

After this step, as a sanity check, I used the Python PRODUCT pprint() and print() methods to print out all the card data and a count of the total cards fetched by the get_cards() method, and ensure all the info I needed to include in the CSV ORG (NFT price, NFT ID ORG , etc.) was available to me. Here’s a screenshot of some of the data printed out in my console helping me know my code is doing what I expect.

Here is what the raw NFT card data gathered from the `get_cards()` method looks like printed in the terminal.

Since I set my `max_card_count` to 200 CARDINAL , but NFTrade NORP loads NFTs in batches of 75 CARDINAL at a time, it makes sense that the total count of NFTs scraped off the page equals 225 CARDINAL .

And after verifying the right data’s there (and the right amount of data as well), I continued on extracting the data, calculating the current price in USD for each NFT, and assembling a CSV of all the data. But I’ll save those steps for future blog posts.

Conclusion

Building a Python-based website scraper to create a CSV of NFTs available for sale on NFTrade ORG was a unique challenge I learned a lot of new things from.

After my first ORDINAL attempt failed due to NFTrade NORP dynamically lazy loading NFTs in batches of 75 CARDINAL onto the page as a user scrolled further down, I had to come up with a more creative solution that would allow me to trigger the site to load more cards on the page first ORDINAL , then grab the data on the cards for sale.

I found the solution I was looking for with the help of a Python ORG package called Selenium Python PRODUCT . Selenium Python is a powerful Python ORG -based API ORG that allows users to write scripts or automated tests leveraging Selenium WebDriver PRODUCT . And it was up to the task at hand: with just a few methods I was able to specify as many NFTs as I wanted loaded on the page before scraping and collecting all their data all at once.

Check back in a few weeks DATE — I’ll be writing more blogs about the problems I had to solve while building this Python ORG website scraper in addition to other topics on JavaScript PRODUCT or something else related to web development.

Thanks for reading. I hope seeing how to make a Python Selenium WebDriver ORG load data onto a dynamic webpage before scraping it comes in handy for you in the future.

Further References & Resources

Connecting to blog.lzomedia.com... Connected... Page load complete