Hi,
I am trying to exclude web pages based on their names beginning with the characters MS_
The pages are downloaded in different directories, each page name beginning with MS_ and html extension. How do I exclude these pages from both analysis and output?
Exemple :
dir1/MS_987.html should be excluded
dir2/MS_48732.html should be excluded
dir1/boxes.html should be included
Thanks!
Please see:
https://www.microsystools.com/products/website-download/help/website-download-convert-links/ (https://www.microsystools.com/products/website-download/help/website-download-convert-links/)
Sounds like those files (MS_ + .HTML) are those created when two different URLs will map to the same file name on disk (this is because file names on disk do now allow the same characters as URLs do)
I have contemplated a different way - but a problem I encountered with huge websites was that if I just removed illegal characters or similar "simple" method - I would receive examples of where URLs would collide when saved to disk.
General way of excluding pages from crawl:
https://www.microsystools.com/products/website-download/help/website-crawler-scanner-filters/ (https://www.microsystools.com/products/website-download/help/website-crawler-scanner-filters/)
General way of excluding pages from final output:
https://www.microsystools.com/products/website-download/help/website-crawler-output-filters/ (https://www.microsystools.com/products/website-download/help/website-crawler-output-filters/)
Ok, that solved my problem. I had to go in the Crawler Options and check the Cutout "?" (GET parameters) in internal links.
Thanks!
That will not be a favourtable solution with websites where ? parameters are an important part of URLs
e.g. fetchpage?page=about and fetchpage?page=contact
The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works
It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.
Were you able to figure out how to exclude these pages from both analysis and output?
Quote from: Webhelpforums on July 29, 2017, 04:50:31 AM
That will not be a favourtable solution with websites where ? parameters are an important part of URLs
e.g. fetchpage?page=about and fetchpage?page=contact
The way A1WD handles it by default ensures both pages are downloaded (but renamed because e.g. "?" can not be part of file name on Windows) + internal linking works
It is possible one could keep more of the old names and simply append the "MS_xx" part to them - would that be better? If so, I will add it to wishlist / create an option for it.
Update: Option exists now and is found in "Scan website | Download options"