Webmaster Forums - Website and SEO Help

Microsys Products and Webmaster Tools => A1 Website Download => Topic started by: ericgourmet on July 09, 2017, 06:47:12 AM

Title: A1 Website download exclude pages
Post by: ericgourmet on July 09, 2017, 06:47:12 AM
Hi,
I am trying to exclude web pages based on their names beginning with the characters MS_
The pages are downloaded in different directories, each page name beginning with MS_ and html extension. How do I exclude these pages from both analysis and output?

Exemple :
dir1/MS_987.html should be excluded
dir2/MS_48732.html should be excluded
dir1/boxes.html should be included

Thanks!
Title: Re: A1 Website download exclude pages
Post by: Webhelpforums on July 09, 2017, 06:58:34 AM
Please see:
https://www.microsystools.com/products/website-download/help/website-download-convert-links/ (https://www.microsystools.com/products/website-download/help/website-download-convert-links/)

Sounds like those files (MS_ + .HTML)  are those created when two different URLs will map to the same file name on disk (this is because file names on disk do now allow the same characters as URLs do)

I have contemplated a different way - but a problem I encountered with huge websites was that if I just removed illegal characters or similar "simple" method - I would receive examples of where URLs would collide when saved to disk.




General way of excluding pages from crawl:
https://www.microsystools.com/products/website-download/help/website-crawler-scanner-filters/ (https://www.microsystools.com/products/website-download/help/website-crawler-scanner-filters/)

General way of excluding pages from final output:
https://www.microsystools.com/products/website-download/help/website-crawler-output-filters/ (https://www.microsystools.com/products/website-download/help/website-crawler-output-filters/)
Title: Re: A1 Website download exclude pages
Post by: ericgourmet on July 29, 2017, 02:50:36 AM
Ok, that solved my problem. I had to go in the Crawler Options and check the Cutout "?" (GET parameters) in internal links.
Thanks!
Title: Re: A1 Website download exclude pages
Post by: Webhelpforums on July 29, 2017, 04:50:31 AM
That will not be a favourtable solution with websites where ? parameters are an important part of URLs

e.g. fetchpage?page=about and fetchpage?page=contact

The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works

It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.
Title: Re: A1 Website download exclude pages
Post by: RichardBaker on September 13, 2017, 09:14:28 AM
 Were you able to figure out how to exclude these pages from both analysis and output?
Title: Re: A1 Website download exclude pages
Post by: Webhelpforums on September 17, 2017, 11:48:53 AM
Quote from: Webhelpforums on July 29, 2017, 04:50:31 AM
That will not be a favourtable solution with websites where ? parameters are an important part of URLs

e.g. fetchpage?page=about and fetchpage?page=contact

The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works

It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.

Update: Option exists now and is found in "Scan website | Download options"