First off full disclosure here I am not an expert in Python or Scrapy however I find them to be highly useful in my everyday workload. So if you see something please feel free to comment on ways to do things better, I love to learn!

I have worked with eCommerce websites and apps a lot over the years and today many vendors utilize modern product information management ( PIM) systems which can integrate directly with other PIMs to update product information to distributors or act as a centralized data store for an ecommerce platform or store info system.

Software such as Akeneo has allowed for the rapid creation of such large curated product catalogs, but software still needs data. Although large vendors seem to be adopting new software quicker than ever before, smaller companies still rely on pdf, web catalogs and excel sheets.

Recently,I came across several suppliers for a eCommerce site I am working on which only supplied a simple price list. However they did have a web catalog that contained images, descriptions, product brochures and additional information. More importantly, it was information my client wanted to add to product listings.

With a little Python extracting useful product information was pretty quick and easy. This information could then be used to supplement the product description and additional resources of the products through the PIM and ultimately the eCommerce platform.

If you are interested in learning more about using Akeneo for product information management feel free to reach out to me or subscribe to my LinkedIn for future articles on Akeneo and using Product Information Management software with a modern eCommerce infrastructure. Today however we are going to discuss how to create a spreadsheet like below from a vendors website.

Specifically, we are going to look at how we can scrape data off a series of web pages using Anaconda, Python and the Scrapy library.

*Be forewarned – I work on windows mostly so I will reference instructions for windows when not explicitly mentioned.

Anaconda is a free package and environment manager for Python distributions. Used primarily in data science, Anaconda can be simple to use and allow easy setup of different Python environments while accessing a 1000+ open source libraries. You can find the setup instructions here if you have not installed it before. https://docs.anaconda.com/anaconda/install/windows/

The Scrapy library for Python 2 and 3 allows you to crawl a website and extract information into various formats such as json and csv. The library allows you to create a “spider” that is programmed to extract data based on the DOM (document object model) of the web page. This allows you access to visible and hidden information available through the client side browser.

To use Scrapy we are going to create a Web Scraping project and setup a special scraper to extract the product information from the webpages found it the project’s GitHub repository . https://github.com/jdberkowitz/sample-python-scrapy-web-crawler (sorry for the horrible picture of my teeth in the github link! )

So once you have Anaconda installed you are going to want to setup a new environment. Use the create button at the bottom of the Environments window to create a new Python 3.7 environment. Install the pre-selected packages. Then Open the terminal window as shown below and run the command below to install Scrapy

conda install -c conda-forge scrapy

You can also install the Scrapy library from the selection box at the top by selection All in the dropdown and searching for “Scrapy”. However we will use the terminal. After opening the terminal run the command below to setup a new Scrapy project.

scrapy startproject myproject

Where “myproject” is your project name.

The directory tree and setup files automatically be generated will look something like this.

myproject/
    scrapy.cfg            # deploy configuration file

    myproject/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # the project items definition file

        middlewares.py    # the project middlewares file

        pipelines.py      # the project pipelines file

        settings.py       # the project settings file

        spiders/          # this is where we put our spider python scripts
            __init__.py

For the sake of this tutorial we only need to worry about creating the spiders script. Please keep in mind this is a basic tutorial for more advanced topics such as advanced configurations and settings, auto crawling links, delaying page request in batches or adding the flexibility of input extraction parameters please feel free to contact me via linkedin or comment!

Inside the github project you will find a couple html files product-41.html, product-43.html. These are the reference html from the webpage at the time this project was written. The css selectors may change if the page is updated and the vendor appears to update ever 3-4 years.

Also the csv nsl_all_product_familys.csv is the result of the project. It contains all the information extracted with the script and is un-formatted from the output file of the script.

So to get started you want to go into the “myproject/myproject/spiders folder and create a new file for your spider. I called mine product_data.py

Next we build a simple spider script by importing scrappy and creating a subclass of the Scrapy class.

import scrapy

class ProductsSpider(scrapy.Spider):

We get access to to variables to set for the spider. 1st the “name” which we set for the spider and use on the command line for execution of the scraper.

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"

Additionally we have a start_urls list which will hold the urls which we want our spider to crawl. I used a manually entered list but you could also use a generator function or follow links to additional pages (links to the next page for example).

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

Now we get to the good stuff and set up our parse() function which will be called on each page passed in by the start_urls list.

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

    # Our parse function will crawl the pages of the start_urls and return a response that is handled in the body of the function
    def parse(self, response): 

Next we will set up a couple variables and dictionaries to hold the data and concatenate it to display it how I needed. This way we will be able to extract lists of urls and then combine them back into a single row of Fully Qualified Domain Names (FQDN) on the csv.

The response object will store data extracted via css selectors from the product webpage. Create a loop for the element or elements for which you want to extract. In our case we will take the the .content-body class which contains one node and in that node we are going to extract the product data.

Using a a chrome plugin such as SelectorGadget will allow you to quickly and easily find the css selector for the data you want. https://selectorgadget.com/

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

    # Our parse function will crawl the pages of the start_urls and return a response that is handled in the body of the function
    def parse(self, response): 
            for product in response.css('div.content-body'):  
                ##Setup the base url for the image and file paths
                base_url = "https://www.nslusa.com"             
                ##Grab a list of the image urls 
                images_list = product.css('.h-100::attr(href)').extract()
                ##Grab a list of the pdf and eps files
                pdf_link_list = product.css('.btn-sm::attr(href)').extract()

So let’s see what we have going on here. First off we are defining our parse generator function as mentioned above. A loop using the .css method on the response allows us to select one element per page. This translates into on line on our pdf. If we selected a element that had more that one node that we would get multiple lines on our csv. For this particular website we see each page is one product and the information should thus be contained on one row. Keep this in mind when we get a list of images and files!

Our base_url variable will be used to concatenate onto the beginning of file paths to create FQDN’s.

The images_list variable will store the additional images (all except the main) in a list by selecting href tag for the divs with class .h-100. This way we retrieve a list of relative file names for the images. The .extract() method pulls out the selectors data as a list, in our case we are selecting the the “href” content.

Similarly in pdf_link_list we create a list of the product information pdfs.

Now we create a few dictionaries to help the content of our link lists and while we are at it create FQDN’s for the links that will be on our csv.

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

    # Our parse function will crawl the pages of the start_urls and return a response that is handled in the body of the function
    def parse(self, response): 
            for product in response.css('div.content-body'):  
                ##Setup the base url for the image and file paths
                base_url = "https://www.nslusa.com"             
                ##Grab a list of the image urls 
                images_list = product.css('.h-100::attr(href)').extract()
                ##Grab a list of the pdf and eps files
                pdf_link_list = product.css('.btn-sm::attr(href)').extract()
                ##Create a dictionary of the the images using a base string and the index for the key
                imageDict = {
                    ##Here we are simply converting the index i to a string so we can concatenate it into a column header
                    ##We also strip the hash off the url and concatenate the base url to get fully qualified domain name (FQDN)
                    ##We use enumerate to create the index of the list in the for in loop
                   'additional_image_url_' + str(i): base_url + key.strip('#') for i, key in enumerate(images_list)
                }
                    ##We do the same thing as above for the spec files. 
                    ##TO-DO! Sperate the files by file type into seperate columns
                pdfLinkDict = {
                    'spec_or_ies_file_url_' + str(i): base_url + key for i, key in enumerate(pdf_link_list)
                }


We create an “imageDict” dictionary with a key that concatenates our text string with the index number of the item in the list (our image path).

First we create a for loop to loop through the list and we use the .enumerate()method on the images_list to generate the index we use for the dictionary key.

For the value we will create a url by adding the base_url to the image_list item to end up with a url for the image or pdf file.

The .str() method converts the number of the index into a string we can concatenate to the name string for the key value.

The relative path contains a hashtag so we remove that with the .strip(‘#) method and concatenate the base_url to “key” in our list. The key is the item in the list a.k.a our image path.

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

    # Our parse function will crawl the pages of the start_urls and return a response that is handled in the body of the function
    def parse(self, response): 
            for product in response.css('div.content-body'):  
                ##Setup the base url for the image and file paths
                base_url = "https://www.nslusa.com"             
                ##Grab a list of the image urls 
                images_list = product.css('.h-100::attr(href)').extract()
                ##Grab a list of the pdf and eps files
                pdf_link_list = product.css('.btn-sm::attr(href)').extract()
                ##Create a dictionary of the the images using a base string and the index for the key
                imageDict = {
                    ##Here we are simply converting the index i to a string so we can concatenate it into a column header
                    ##We also strip the hash off the url and concatenate the base url to get fully qualified domain name (FQDN)
                    ##We use enumerate to create the index of the list in the for in loop
                   'additional_image_url_' + str(i): base_url + key.strip('#') for i, key in enumerate(images_list)
                }
                    ##We do the same thing as above for the spec files. 
                    ##TO-DO! Sperate the files by file type into seperate columns
                pdfLinkDict = {
                    'spec_or_ies_file_url_' + str(i): base_url + key for i, key in enumerate(pdf_link_list)
                }
                ##Now we create our main product dictionary by taking elements within the main body
                productsDict = {
                    ##Using the css selector we can grab the text 
                    'name': product.css('h5::text').extract_first(),
                    ##We can get pretty granular with our selectors. Checkout SelectorGadget extentsion for chrome to quickly identify css selectors
                    'description': product.css('p:nth-child(1)').extract_first(),
                    ##This is our main product picture and we create a FQDN
                    'main_image_url': base_url + product.css('.flex-md-nowrap .mx-auto img::attr(src)').extract_first(),
                    ##Here we are just putting our image list into a column just for reference
                    'additional_images_list': images_list,
                    ##Same with the remaining file types
                    'all_pdf_ies_links_list': pdf_link_list,
                    ##I wanted the HTML of the description and the appilcation so i can reuse it on my project but you may just want text so modify by using the psuedo element
                    'applications_html': product.css('#product-applications').extract_first(),
                    ##For the brochure link we follow the same as the main image
                    'product_brochure_pdf_url': base_url + product.css('.mt-md-0 a::attr(href)').extract_first()
                }

Now we create our products dictionary productsDict which will contain the columns for our spreadsheet. The productsDict will be the dictionary we return via the yield statement for this generator and Scrapy will use some built in libraries to export this as json, csv etc.

Let’s break down what we are adding to each column.

First off we are grabbing the name of the product family by selecting the text of the h5 tag. Using the .extract_first() method pulls this value in as a string and not a list like the .extract() method.

The description potentially contains html data tables so I wanted to preserve so in the description column we are grabbing the first child paragraph which is the description for each product on this page layout. Again .extract_first() returns a string of the html and text.

For the additional_ image and files list columns I simply wanted a reference of all the links in a single column so passing in the the lists will give us a comma separated list of the relative file urls.

The applications_html holds some additional product application data as the html which may be useful as well.

There is a product pdf brochure on each page so we want a full link to that file ina column. We do that by selecting the “href” from the a attribute of the .mt-md-0class found using SelectorGadget.

import scrapy

class ProductsSpider(scrapy.Spider):
    # We name our spider with the name variable, this is what we reference on the cmd line to run our crawl
    name = "products"
    # Here we can add a list of URLs to crawl. 
    # I chose to do them manually becuase i wanted to see each page and there isn't to many on this website.
    # This script was written on Jan 3 2019 so if reading this in the future 
    #
    # (or even spookier ... in the past) then you may want to reference the wayback machine archive to see the page markup correctly
    start_urls = [
        'https://www.nslusa.com/products/43/',
        'https://www.nslusa.com/products/41/',
]

    # Our parse function will crawl the pages of the start_urls and return a response that is handled in the body of the function
    def parse(self, response): 
            for product in response.css('div.content-body'):  
                ##Setup the base url for the image and file paths
                base_url = "https://www.nslusa.com"             
                ##Grab a list of the image urls 
                images_list = product.css('.h-100::attr(href)').extract()
                ##Grab a list of the pdf and eps files
                pdf_link_list = product.css('.btn-sm::attr(href)').extract()
                ##Create a dictionary of the the images using a base string and the index for the key
                imageDict = {
                    ##Here we are simply converting the index i to a string so we can concatenate it into a column header
                    ##We also strip the hash off the url and concatenate the base url to get fully qualified domain name (FQDN)
                    ##We use enumerate to create the index of the list in the for in loop
                   'additional_image_url_' + str(i): base_url + key.strip('#') for i, key in enumerate(images_list)
                }
                    ##We do the same thing as above for the spec files. 
                    ##TO-DO! Sperate the files by file type into seperate columns
                pdfLinkDict = {
                    'spec_or_ies_file_url_' + str(i): base_url + key for i, key in enumerate(pdf_link_list)
                }
                ##Now we create our main product dictionary by taking elements within the main body
                productsDict = {
                    ##Using the css selector we can grab the text 
                    'name': product.css('h5::text').extract_first(),
                    ##We can get pretty granular with our selectors. Checkout SelectorGadget extentsion for chrome to quickly identify css selectors
                    'description': product.css('p:nth-child(1)').extract_first(),
                    ##This is our main product picture and we create a FQDN
                    'main_image_url': base_url + product.css('.flex-md-nowrap .mx-auto img::attr(src)').extract_first(),
                    ##Here we are just putting our image list into a column just for reference
                    'additional_images_list': images_list,
                    ##Same with the remaining file types
                    'all_pdf_ies_links_list': pdf_link_list,
                    ##I wanted the HTML of the description and the appilcation so i can reuse it on my project but you may just want text so modify by using the psuedo element
                    'applications_html': product.css('#product-applications').extract_first(),
                    ##For the brochure link we follow the same as the main image
                    'product_brochure_pdf_url': base_url + product.css('.mt-md-0 a::attr(href)').extract_first()
                }

                ##Finally we use update to add the image and file columns with the FQDN's back to the product dictionary 
                productsDict.update(pdfLinkDict)
                productsDict.update(imageDict)


                ##Yield completes our generator and returns the output with nice column names and FQDNs to use
                yield productsDict

Finally we are going to add our imageDict and pdfLinksDict to the productsDict to create a single row on our spreadsheet with all the file and images uniquely numbered and linked for easy batch download. We do this simply by using the .update() method on our productsDict.

Last we use yield to complete the generator and return the updated productsDict.

To run the script we simply head over to the Anaconda and open the environment terminal, use the cd command to enter into your project directory then run the below command.

scrapy crawl products -o products.csv

This tells Scrapy to crawl the site in the products spider url list and output the results to a csv in the project directory. Any errors will show on the error call stack trace returned in the terminal. Also it’s worth noting if you are crawling lots of pages, which may affect the site’s performance, you will want to notify the site admin and get permission first. You can also set headers to identify you to the web admin performing the crawl so that if issues arise the admin can identify your traffic quickly.

Using the above command without the output flag will return the results to the terminal window making it very useful for developing your script.

This article is a work in progress so please feel free to reach out to me with any comments!