Breaking News

How to Build a Custom Crawler Using Colab and advertools

How to Build a Custom Crawler Using Colab and advertools


In this video, we’re going to be talking about how we can create our own custom crawler using Google Colab, Advertools, and a few other Python libraries. Before you get scared or wigged out, don’t. Take a deep breath. It’s going to be okay. I don’t even proclaim to be a coder, but what I am pretty good at is copying and pasting. I know that you can use that Ctrl+C button just as well as I can. Let’s take a look at how we can create our own custom SEO crawler and how we can put it to work to help us solve some unique problems.

 

Resources

Video Transcript: 

Let’s look at how we can build a custom crawler using Google Colab, as well as Advertools, Python library.

What is an SEO Crawler (Spider)?

Crawlers are tools that can crawl website pages very much like a search engine would do, and it helps us gain valuable SEO information. It helps us see the website, the way Google, Bing, or the other search engines would see our site. There are tools that do this. There are a number of tools out there that do this, the most popular is probably the Screaming Frog SEO Spider, and it’s a tool that we love and use all the time, but sometimes we need a custom solution.

Why Would You Create Your Own Crawler?

Most ‘off-the-shelf’ crawlers do amazing things, but sometimes you have a specific question that needs to be answered and you can create a custom crawler in order to control all the outputs. You only get the data that you want or need. This way you don’t have to be constrained by a tool setup, but you can run a quick crawl of a website or pull only one piece of information or pull a whole lot of information and organize it in a different way, using Colab and Python.

What is Advertools?

Advertools is a Python library that allows you to do a lot of things. You can crawl websites, generate keywords for your search engine marketing campaigns, create text ads, analyze SERPs, gain insights on social media posts, and do a whole lot more. It’s an extremely flexible library. It does a lot of cool things and is fairly simple to use.

I wouldn’t call myself a programmer. I would just say I’m decent at copying and pasting. Even though I’m not an in-depth Python programmer, I’ve been able to get a lot of benefits from using a tool like Advertools.

What We are Going To Do

This is what we’re going to do in this video.

  1. Create a new Colab file and install Advertools
  2. Create a custom crawler using advertools
  3. Crawl and analyze the website and the data
  4. Visualize some of those results using another Python library called Plotly
  5. Export the data

Step 1: Create a Colab File and Install Advertools

Google Colab is a tool that’s going to allow you to do a number of cool things. It allows you to run code within cells to build tables, build your own programs, do custom things, anything from machine learning to SEO, and more. If you’ve never used it before, it’s free to use, and it allows you to leverage Google’s computational power free of charge. It’s very cool so I highly recommend you go check this out.

If you’re not already using CoLab there are a lot of great resources here. One of the things you have to do in order to use a library that’s outside of Python, there’s the natural installation. You need to install that program. Most of the time you use a function called PIP and you’ll then pull in that new library. It’s a fairly simple process to use.

One of the things that all of these guys do, who build these programs is they show you exactly how to set it up within their docs. So always read these docs and it’s going to allow you to understand, “How do I import these tools and get these tools working for myself?”

In order to install Advertools, we’re going to use this line of code right here:

!pip install advertools

Once you’ve put the code here into the cellblock in CoLab, go ahead and hit the play button. It’ll execute this block of code. You should see something like this, where it’s installing the code and the entire package here so that we can now use this library to build our crawler. Once you see the green checkmark, you know that it’s done.

Step 2: Create a Custom Crawler Using Advertools

Next, we’re going to want to execute a new line of code.

import advertools as adv

from advertools import crawl

import pandas as pd

 

You can go ahead and hit the code button here and it’ll populate this new one. We’re going to import some specific parts of the Advertools library. We’re importing advertools, we’re importing the crawl method. We’re also importing something called pandas. For those of you who are not familiar with Python, pandas allow us to work with our data inside of data frames, basically making tables within Python.

Once you’ve set all of this up, you go ahead and run your code again. This is going to import all of this information. If we’re building a crawl, you’ll notice over here, that it’s talking about how we can do this, how we can import these crawls. There are a few approaches; you can import Advertools like we did and run this command line, which will do what we’re doing.

I like to make Colab a little bit easier to use in case somebody on my team wants to leverage it as well. We’re going to do something a little bit different than what they show here. But if you follow this approach, you will get it right and it will work as well.

site = "https://simplifiedsearch.net/" #@param type:"string"

crawl(site, 'simp.jl', follow_links=True)

crawl_df = pd.read_json('simp.jl', lines=True)

crawl_df.head()

 

What we are going to do is this line of code. The first thing we’re doing is defining a variable and the variable is going to be the website that we want to crawl. By using this param type string, it gives me a box over here, which then allows me to type in over here, what website I want to crawl. I can put my website here. I can put any website here and it’ll set that variable for me. This way I don’t have to type it here. I can just type it into a form and somebody who’s not as comfortable with clicking inside of the cell box could just go ahead over here and type a side out.

In this case, we’re going to use our simplified search site, just because we’d use that all the time. We’ll go ahead and paste it over here. Right below that we’re following the exact same rules that they were setting over here. We’re using Advertools.crawl, and then we’re using the site as our variable. We have an output file. Then we want it to follow the links within the website.

We do the next step as well, where we set the crawl data frame, and we tell it to open up our output file because it’s going to output in JSON. The pandas are going to read the JSON and create a data frame for us. In the end, I’m telling us just to show the head of this data frame, to make sure that everything’s working as intended. Once we follow this step and run this cell, we’re going to be crawling the website and it’s going to do a data dump below and we’ll be able to see all the different functions within this crawl.

I’m going to go ahead and run this cell. It may take a few minutes just because it’s running a crawl of the entire website. Once we’re done, we’ll talk about how we can leverage the crawl data to pull out specific pieces of information.

Step 3: Crawl and Analyze the Website and Data

Now the site has crawled and you can see I have a list of URLs, titles, meta descriptions, viewpoint, char set, H1s, H2s, and H3s. All of this information is being pulled into this frame. If you want to see it a little bit cleaner, you can hit this magic button right here and Google is going to transfer this data here into a little bit easier of a data frame to work with. I have a total number of columns right here of 266. That’s a lot of columns that I can work with.

You might be asking yourself what is in all of these columns. We can go back over here to the advertools and you can see all the different elements. There’s quite a bit of export data that we can look at and pull lots of cool information.

If we want to see a list of all the different columns that we have available, we can run this code:

We need to take the columns first and create a list out of them. We’ll use the code list and then put parenthesis, and inside their crawl_DF, which is the name of our data frame, and call the new list columns. Here we have columns, and we run that cell, and you can see all of these different possible columns. It’s quite a bit of information, as you can see, it’s looking for a whole lot of information.

What if you want to see just a piece of that information? What if you just wanted to get all the titles or all the meta descriptions or some of the H tag information, or maybe you wanted to see all the pages and the type of schema.org markup that you might have on them. This is where having something like Advertools comes in handy.

Let’s say we wanted to look at the JSON-LD types across our pages.

json_df = crawl_df[['url', '[email protected]' ]]

json_df

 

We can start with some new code. Let’s go ahead and create a new data frame called JSON-DF. We want to get some information from our original data frame. The first thing we’re going to want to do, let me just go down here a little bit to make it easier on everybody crawl, _DF. We’re going to use a bracket and another bracket.

The first thing we want to pull is the URL. We know that URL is important because we need to know all the pages within our site, all the URLs. The next thing we want to do is we want to find the JSON type. We can go back to this list and we can go JSON type, copy that and say, I also want to know the JSON type. I’m going to go ahead and keep this consistent, that way we follow best practices. What do we do in this little line here? We said ‘create a new data frame’ and use the data from our original database, from our original data frame and pull back only the URLs and the JSON-LD types.

If I run this, it’s going to create a new data frame with just that information. In order to see this data, I can just go ahead, put JSON_DF, do a new cell, and hit enter. It gives me a list of all of my pages and the type of markup that’s associated with those specific pages. This can be very helpful if you want to look quickly and find all the JSON on your website, what types you have, and what markup you have.

Furthermore, do you have some pages that are missing markup? You can quickly identify those. We have this new data where we have all of our URLs and we have all of our JSON-LD types that we know exist on that page.

Step 4: Visualize the Results

Let’s say we want to create a quick report or graph to show to, either a client or somebody else, or the amount of information and data that we’ve added to this site for them and the different types of it.

The first thing I need to do is count all the different types of markup that have been added, and then I can visualize it. Let’s start by counting this and creating a new data frame. I’ve already created this code and I’m going to walk you through it:

json_counts = json_df['[email protected]'].value_counts()

json_counts = json_counts.reset_index()

json_counts

 

It’s called JSON counts. This is a new data frame. We are taking the data from the JSON-LD column right here. We’re having it count the unique values that are in this column. When I run this code and then I tell it to output it, you’re going to see that we have all of that information counted.

What it’s doing is that it’s giving me this error because it’s finding some zeros or some NAS in the list. That’s okay because you’ll see in just a second that we got that information here. Here are all the different markup types and it’s all been laid out for us.

You’re also noticing though that it doesn’t quite look like a data frame like it is here. We have to re-index this data frame, or this variable, into a data frame for it to work properly. We’ll go ahead and give our data frame and run this code:

json_counts = json_counts.reset_index()

 

When we run this, you’ll see we have a data frame. We have the index, which is the term. Then we have the JSON-LD type and the count of that. We still don’t have a graph. We still just have another data frame. What do we need to do in order to turn this data frame into a visualization, or a graph? We’re going to use something called Plotly.

Plotly is another library, very similar to Advertools that we can use to create visualizations, and specifically, we’re going to use Plotly express. The first thing we need to do is install Plotly, we go ahead and do PIPinstallPlotly, I’m going to run this cell. It’s already been installed in this worksheet, but it’s okay. As you can see, it’ll tell us that it’s already installed, already satisfied. We’re good to go.

Take that code we just copied from here and paste it back into our crawler. We don’t need this middle one because this is data that we’re not using. We’re using our own data. We do need to import Plotly express as PX, and we need to connect our new database here in order to get the right information into our chart.

!pip install plotly

 

import plotly.express as px

fig = px.bar(json_counts, x='index', y='[email protected]')fig.show()

Our data frame was called JSON counts. On our X we're going to use index and on the Y we're going to use JSON type. Why did I choose those? The index is where the words are. We want to have those on the X, and then the count is on JSON-LD @type, and that's going to be our Y, that's going to tell us how many are in each of those columns. We'll go ahead and put that here. Pretty simple. And then fig.show will show that graph. So now, we have all of our different types down here, and here, we have the different amounts of each type in a nice graph.

If you wanted to share this, you can download it as a PNG, and Plotly will now download it to your computer. You can take this and say, "We've put this much mark up on these many pages." A pretty cool way to quickly visualize it.

Step 5: Export the Data

However, what if we want to download all of this data and work with it, maybe in Google sheets or something else? Well, you can also do that in order to do that. We just need to use one more line of code and we should be good to go. So let's say we're going to download this table here with all of our website pages and the JSON-LD type. We can go ahead to this cell or anyone that you want to, and then we're going to create a new line of code.

We need to use something from Google Colab called import files. That's the first thing that we're going to do. Next, we're going to find this data frame, which is JSON-DF. We're going to add this below and say .2_CSV, and then we're going to give it a name. We can call this JSON_DF.CSV. Once you've typed this code in, you've created your CSV file. If I look over here into my folder, you're going to see the file right here.

From here, I could just go ahead and download it, or I could put a line of code here that helps us download it even quicker. I could say files.download, and then I go ahead and call this file, which I just created, and I asked for Colab just to download it for me directly. When I run this cell it's going to download that file and here I have it. I can go ahead, click open, and now I have this CSV file that I can do whatever I want with any kind of spreadsheet tool that I have. I can also see the ones that are possibly missing some markup.

There you have it. We've gone ahead and we've created our own custom crawler. We've pulled some custom data. We've visualized that data and we've downloaded that data for use in other programs. We did all this, and I am not a computer programmer, I don't even try to pretend to be one. Like I said before, I'm just good at copying and pasting. You guys can figure these things out too.

When you have questions, there are always cool solutions. If you're willing to try something new and different, I highly recommend you play around in Colab. There are lots of great resources out there. There are lots of people who are way smarter than me doing much more amazing things that I've learned a ton from, and have helped me in my marketing game, research, crawling, and so much more.

If you have any questions about what we did today, please comment below. I'm also going to give access to this specific CoLab file and I'll even share step by step the code that we used along the way. Thanks so much for watching. Don't forget to subscribe and until next time, happy marketing.

Here's the full code if you're interested: 

 





Source link