Webmaster Forums - Website and SEO Help

Microsys Products and Webmaster Tools => A1 Sitemap Generator => Topic started by: drgeorgep on January 03, 2017, 04:07:43 PM

Title: Continuous Run ...
Post by: drgeorgep on January 03, 2017, 04:07:43 PM
Hi ... some time ago, I reported A1 Sitemap Generator ran continuously, reporting as many as 500,000 URLs for a site with fewer than 2100 URLs. At that time, I was advised to have a tech check my data base. After many unrelated delays, I managed to find a competent data base tech. Here's what he reported: "I ran A1 for over 6 hours today. I ran into the same situation as you, where it detected too many links. It also incorrectly formed the links -- some of them went on and on with "/article/1012/article/1030/article..." From my experience that would leave me to believe that their tool is crawling your website, following links, and continuing to find links back to articles it's already visited, causing the long, incorrect link chains." I should add that this problem originated in 7.x version and the tech tested it with 8.x. Can someone please help? Thanks so much. drgeorgep
Title: Re: Continuous Run ...
Post by: Webhelpforums on January 03, 2017, 04:51:46 PM
From your description I can make a qualified guess at the problem:

Your website does no return 404 error response for URLs that do not exist

Your website uses relative links

So suppose A1 finds a link to example.com/doesnotexist/

but your website returns normal "200 found" reponse and have the page link to "newfolder/"

A1 will then test example.com/doesnotexist/newfolder/

but your website returns normal "200 found" reponse and have the page link to "newfolder/"

A1 will then test example.com/doesnotexist/newfolder/newfolder/


Please bear in mind that above is a guess based on my experience from people reporting similar - the specifics of your case can be different. Please email your website address + a few example URLs

Title: Re: Continuous Run ...
Post by: drgeorgep on January 03, 2017, 05:36:30 PM
Hi ... website is grubstreet.ca. Here are five URL: http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc; http://grubstreet.ca/articles/index/1174/as-i-hear-it-summers-here; http://grubstreet.ca/articles/index/1896/as-i-hear-it-hot-cars-in-summer; http://grubstreet.ca/articles/index/1080/as-i-hear-it-the-dreaded-event; http://grubstreet.ca/articles/index/1258/as-i-hear-it-radio-reductions. I have asked my host to confirm that 404 error messages are generated for URLs that do not exist. I am sure such messages are generated, but checked to be sure. Will report, asap I hear from host support. Thanks. dgp
Title: Re: Continuous Run ...
Post by: drgeorgep on January 03, 2017, 06:43:15 PM
Hi ... I confirmed, with the grubstreet.ca host, that, indeed, 404 error responses are sent when someone attempts to load a non-existent URL. This is to up-date my previous response. Thanks so much. dgp
Title: Re: Continuous Run ...
Post by: Webhelpforums on January 05, 2017, 12:02:52 PM

Okay - i just tested
http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc (http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc)

and created
http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg (http://grubstreet.ca/articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg)

The second URL should return 404 - but instead returns error response 200 - just verified in FireFox Live HTTP Headers plugin

Code: [Select]

GET /articles/index/2327/as-i-hear-it-rainy-day-in-nyc/abdjgysdjgfsjg HTTP/1.1
Host: grubstreet.ca
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: da,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
Cookie: PHPSESSID=dkr25hfdsra0gnbkjntegkpfv7; _ga=GA1.2.543169924.1483635724; _gat=1
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1

HTTP/1.1 200 OK
Date: Thu, 05 Jan 2017 17:03:44 GMT
Server: Apache/2.4.23 (cPanel) OpenSSL/1.0.1e-fips mod_bwlimited/1.4
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

You can test as well using e.g. https://httpstatus.io or http://web-sniffer.net with the above mentioned URL

So my initial diagnosis based on your description of symptoms sofar seems correct :)

You can either

Fix the problem

Exclude unwanted URL patterns in A1 Sitemap Generator output filers and analysis filters

I consider solution #1 best since it will work with all crawlers and search engines, but otherwise see:

If you are ever in doubt how A1 finds a specific URL you can see "linked-by" and "redirected-by" and "sourced-by" tabs. You can then follow the trail to see where a problem originates from:
http://www.microsystools.com/products/sitemap-generator/help/sitemaps-generator-analyze-links/ (http://www.microsystools.com/products/sitemap-generator/help/sitemaps-generator-analyze-links/)