Continuous Run ...

Started by drgeorgep, January 03, 2017, 04:07:43 PM

drgeorgep

Hi ... some time ago, I reported A1 Sitemap Generator ran continuously, reporting as many as 500,000 URLs for a site with fewer than 2100 URLs. At that time, I was advised to have a tech check my data base. After many unrelated delays, I managed to find a competent data base tech. Here's what he reported: "I ran A1 for over 6 hours today. I ran into the same situation as you, where it detected too many links. It also incorrectly formed the links -- some of them went on and on with "/article/1012/article/1030/article..." From my experience that would leave me to believe that their tool is crawling your website, following links, and continuing to find links back to articles it's already visited, causing the long, incorrect link chains." I should add that this problem originated in 7.x version and the tech tested it with 8.x. Can someone please help? Thanks so much. drgeorgep

Webhelpforums

From your description I can make a qualified guess at the problem:

1)
Your website does no return 404 error response for URLs that do not exist

2)
Your website uses relative links


So suppose A1 finds a link to example.com/doesnotexist/

but your website returns normal "200 found" reponse and have the page link to "newfolder/"

A1 will then test example.com/doesnotexist/newfolder/

but your website returns normal "200 found" reponse and have the page link to "newfolder/"

A1 will then test example.com/doesnotexist/newfolder/newfolder/

etc.

Please bear in mind that above is a guess based on my experience from people reporting similar - the specifics of your case can be different. Please email your website address + a few example URLs



TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

drgeorgep

Hi ... website is grubstreet.ca. Here are five URL: http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc; http://grubstreet.ca/articles/index/1174/as-i-hear-it-summers-here; http://grubstreet.ca/articles/index/1896/as-i-hear-it-hot-cars-in-summer; http://grubstreet.ca/articles/index/1080/as-i-hear-it-the-dreaded-event; http://grubstreet.ca/articles/index/1258/as-i-hear-it-radio-reductions. I have asked my host to confirm that 404 error messages are generated for URLs that do not exist. I am sure such messages are generated, but checked to be sure. Will report, asap I hear from host support. Thanks. dgp

drgeorgep

Hi ... I confirmed, with the grubstreet.ca host, that, indeed, 404 error responses are sent when someone attempts to load a non-existent URL. This is to up-date my previous response. Thanks so much. dgp

Webhelpforums

#4
Hi,

Okay - i just tested
http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc

and created
http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg

The second URL should return 404 - but instead returns error response 200 - just verified in FireFox Live HTTP Headers plugin


http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg

GET /articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg HTTP/1.1
Host: grubstreet.ca
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: da,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
Cookie: PHPSESSID=dkr25hfdsra0gnbkjntegkpfv7; _ga=GA1.2.543169924.1483635724; _gat=1
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1

HTTP/1.1 200 OK
Date: Thu, 05 Jan 2017 17:03:44 GMT
Server: Apache/2.4.23 (cPanel) OpenSSL/1.0.1e-fips mod_bwlimited/1.4
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8


You can test as well using e.g. https://httpstatus.io or http://web-sniffer.net with the above mentioned URL

So my initial diagnosis based on your description of symptoms sofar seems correct :)

You can either

1)
Fix the problem

2)
Exclude unwanted URL patterns in A1 Sitemap Generator output filers and analysis filters

I consider solution #1 best since it will work with all crawlers and search engines, but otherwise see:


If you are ever in doubt how A1 finds a specific URL you can see "linked-by" and "redirected-by" and "sourced-by" tabs. You can then follow the trail to see where a problem originates from:
http://www.microsystools.com/products/sitemap-generator/help/sitemaps-generator-analyze-links/
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper