External Domain and Link Extractor

What is External Domain Extractor?

External Domain Extractor is a desktop application that allows users to crawl websites and extract external (outbound) domains and links from each page of the target websites(s).

How does External Domain Extractor work?

External Domain Extractor has three methods of finding external domains and links.

Crawl a list of websites for external domains (outbound domains)

If you have a list of websites that you would like to extract external links or domains from then you can simply enter them into the search type and External Domain Extractor will crawl each page of each website in this list.

Crawl search results by niche related keywords for external domains

Enter a list of niche related keywords and External Domain Extractor will search Google for all these niche related keywords and crawl every domain returned for every niche related keyword, extracting all outbound links and domains.

Why should I use this software?

Most people use Google to find websites and people to market to. This might be fine for larger companies but many small businesses and people just don't appear in Google. For example if you want to find webmasters who are in the internet marketing niche you'd have much more luck crawling Digital point or Black Hat World for outbound domains rather than searching Google for them by keywords. Plus you wouldn't need hundreds of proxies.

How does this tool work?

Once the crawler has a domain(s) to crawl it will grab all links on the page. The outbound (external) domains or links will be added to the results section of the software and the internal links will be followed/crawled. This process is repeated until the entire website has been crawled.

What does External Domain Extractor look like?

External Domain Extractor

Watch External Domain Extractor in action / Tutorial

Change log

  • 31/07/2017 - Initial release
  • 07/01/2018 - Added anchor text and fixed saving some settings bug

Video transcript

This is the transcript from the video above, it needs tiding up a little as its from YouTube subs.

Hi guys it's Jamie from SupaGrowth.com and in this video I'm going to show you how to use my external domain extractor software tool. So basically, what this does is it crawls your specified websites and extracts all the external domains so domains that have been linked out to from the website and it can also scrape all the external links as well, not just the domains. So there's two main options, so you can crawl from a website list, as you can see here personally, I use this for scraping web sites from internet marketing forums because that's niche or so that will crawl every web page on every web site in this list extracting all the external domains or links whatever you've got a set to.

The second option is to crawl from the results of a search query. So in this example, you could put dog toys in and then it'll go off to Google and search for dog toys and then every website that Google returns it'll crawl those websites for external domains. So it just saves you, you know having to grab a list of websites and put them in there. You can just put your niche in there and then go off and search google for that query and use those domains instead of the domains you put in there. Okay, so we'll go through some more of the advanced options in a sec okay. So what we'll do first is we'll go through this first option here, the first method of crawling here and we'll go through some of the options for it: okay, so these stats here pretty self-explanatory really, so if I were to click start now, we'd have that would say five websites to crawl and then, as it progresses, you'll see that pages it's crawled and once it's finished a website, it'll populate it there and then for every external domain found it'll add one there and it'll show it in the grid.

Here is if websites are blocking you you'll have here how many websites are blocking the crawl, and this is pages crawled per minute and then how long the crawl her been running. Okay, so let's look at some of the settings general for all settings. So again these are pretty self-explanatory, delay between each request the website, so if you're crawling a website, you know shared on shared hosting or something like that, you might not want to crawlt too fast you might want to give it a bit of a delay just so you don't, you know, knock it off or get your IP address blocked or something like that. That's how many websites you want to crawl at once, so if we were to click start, then that would do all five ones, because we've got that set to turn and that's how many connections you want for each website. So you know: we'd have two connections open to warrior forum, wicked fire blah blah, keep in mind that your total of our connections is basically that times that if you've got a full list of domains, so, as we said it will detect when a website is blocking you.

It's not always immediately obvious, for example, sometimes you're crawling a site and you'll get loads of normal 200 http response codes back, which is normal and expected and then you'll get a few like 404 can't be found, or you know you might get a few timeouts, if you have not a decent connection. So what we do is we say if it's 30 consecutive errors, then consider you've been blocked, basically, you can lower that you can put it higher, but 30 is generally about accurate, so domains only or links and domains. Now I'll show you the difference about this in a minute, but basically, if you only care about getting a list of domains on a website then leave it at that. But if you want to extract basically the external domain page, it was on and the full link it's linking out with then have that link option clicked.

Now, if you want to limit the amount of pages that you crawl per website when you're crawling it, so if you suspect a website, that's some kind of you know a bit of a strange structure and you might think the cralw might get stuck in infinite loop or something, then you can crank it up to however many pages, you think the websites got and write results into your text file. So basically, you know just in case you have a power cut or something and you've left the crawler writing for a you know, 48 hours, then you know you were to lose all your results at least its safe in that text file. Okay, so I'm quickly just going to show the website crawl website list action like that there we go, you can see it's racing through these websites extracting all external domains.

Now that takes a minute to populate it, basically just takes how many pages crawled in the last minute so you'll see that when that gets to 60 you'll see that update that will slow down as you build up more and more external domains found because for every Domain is found, it's got to check whether it's in this list and that list gets pretty big after a while. Okay, so right we'll stop that for now and then I'll show you the other option now, as you can see, as we've got the domains and links ticked, you get the normal domain, but you also get the URL that it was found on and the full URL that it's linking out to. Okay, so let's move on to the cruel from search query feature cover the settings first. So if you only want to crawl, the first say 10 results from the search query and you can click that and then enter the amount in there. So I don't know you might only want two hundred or whatever you want to put it in there.

If you uncheck that, then it will keep searching Google and it'll go through all the results right to the end. You might not want to crawl big domains, you know well-known domains, so it's in the majestic million you can skip it and there might be some domains like YouTube and stuff you might not want to bother with, and you can just add it in there as line separated value. Okay, so, and the final option here sometimes you'll find when you search for a particular niche on Google because of Google's shift to Authority websites you'll get things like Amazon when you search for dog toys and instead of to getting dog toy websites, you'll get dog toys that are on Amazon, so a way to avoid that is, if you put your niche in here or your keyword, so then every result from google every domain that comes back from google for in this example dog toys it would go and check the homepage of it and check the websites metadata.

So that's the description and title to see if it contains this keyword here and you can put multiple line separated ones in. Now we'll just quickly run through some of these stats. Here, queries left search, so you don't have to limit yourself to one query. You can have multiple line separate ones here. How man y IP blocks so if you get blocked by Google, which you shouldn't do because it's will grab a page results, then crawl, the domains on them and then crawl and then do the other bit of a search. So the then the next ten of the next results, you know 20, 30, whatever yeah it does them. So it's quite a long pause between the searches so you shouldn't get blocked ever by Google. But if you do it'll tell you there, if you're, if you've been hammering it with some other search tool or something, then you might see that go up and then you know, you've got a either run it on a different server or if you've got a dynamic IP address reset your router just so that you're not blocked anymore or just wait if you're you can't get a new IP address and Google normally unblocks the IP address.

So unless and there we go, you can see planetdog.com came up in the results in the Google search for dog toys, see, as you can see, we've began extracting domains out of the Google search result domais and because I had the link and the domain options set, and you can see it's giving me the URLs that it got it and all outbound links. Okay, that's it, thanks guys.