Cannot crawl this site...

cgeyma · April 01, 2011, 01:46:35 PM

Trying to crawl the following site but Web Analyzer returns immediately without any data.

http://www.npsmm2011.com

I see that this site does some redirects before the site home page is delivered.
Wouldn't WA follow the redirects?

Then, I tried putting in the redirected URL in WA and WA crawled a few pages then stopped.
Tried turning on a few crawl engine options but no help.

Can someone help explain why WA can't crawl the site completely.

Thanks.

Webhelpforums · April 01, 2011, 08:16:32 PM

In
http://www.npsmm2011.com/robots.txt
You disallow crawling

With default settings A1 Website Analyzer obeys robots.txt, noindex, nofollow etc.
It's also why you see no URLs with *default settings* after website because of robots.txt
(Possibly I should change default settings in A1 Website Analyzer to show this...)

Read about robots.txt in A1 Website Analyzer

The other reason is a bit more tricky, but has to do with default analysis and output filters.

First make sure you have disabled easy mode

Then go into
analysis filters and output filters

and remove *all* file extensions listed in both.
(A1 Website Analyzer will then depend entirely on MIME types for configuring out what URLs to analyze)

If that was not enough... It also appears your website is checking up on user agent string. At least I can better results when changing user agent to MSIE8...

I will take a better look tomorrow. The website scan results are still not satisfactory. I am sure I will solve rest tomorrow

cgeyma · April 01, 2011, 08:42:01 PM

Thanks very much for your response.
Yes, when I had the very limited success, I had turned on the option to ignore robot.txt.

I have just now followed your direction and was only able to get the home page loaded.
Any links from the home page was not crawled.

Looking forward to your further analysis.

Webhelpforums · April 02, 2011, 09:29:22 AM

Okay. The website URL:
http://www.npsmm2011.com/portal/site/NPSMM

contains (snipping out lots of code)

<script type="text/javascript">
window.location.replace(rpstring + '/portal/site/NPSMM/menuitem.blablablablabla/?vgnextoid=blablablablablablablablabla);

And while A1WA (when Javascript parsing is enabled) does pick up the string with the URL, the Javascript handling code judges it to be an invalid URL. I will try fix that, but it is done to avoid too many "false positives" since extracting strings from Javascript functions can contain lots of string data that may or may not look-like or be actual URLs.

cgeyma · April 02, 2011, 09:47:38 AM

Yes. the Javascript.
I did take the URL in the Javascript and put it as the scan path.
WA picked up a few more links and strangely, they are picked up as external urls.

If I look into the source of the home page, there are a few <a> tags.
I was surprised when WA did not pick those up because I thought the hrefs in <a> tags are simple.

Thanks.

Webhelpforums · April 02, 2011, 10:56:09 AM

When I see the raw content returned to A1WA for that URL, here is what is returned.

Code Select


<html>
<head>
<title>redirecting...</title>
</head>
<body>
	<script type="text/javascript">
		//alert('/portal/site/NPSMM/menuitem.d05e0bdbfb3f07a50f160f16d6813453/?vgnextoid=0c2b85a28895e210VgnVCM2000007d184335RCRD');
		var rpstring = "";
		if (location.hostname.indexOf("mbnetstar.com") > 0) {
			rpstring= "/rp_connectionprod";
		}
		window.location.replace(rpstring + '/portal/site/NPSMM/menuitem.d05e0bdbfb3f07a50f160f16d6813453/?vgnextoid=0c2b85a28895e210VgnVCM2000007d184335RCRD');
		// do not use location.href for one less entry in browser history
	</script>
</body>
</html>

When a browser see that it will probably load the real content containing the <a> URLs.

Anyways, I am getting it work in A1WA. I will have a new version out soon

cgeyma · April 02, 2011, 09:31:38 PM

Thanks.

Please do let me know when the new version is available.
I just updated to 3.1.4 from 3.1.3 that was purchased a couple of weeks ago.
I am not sure if 3.1.4 is the version you just mentioned you will be releasing soon.

Webhelpforums · April 02, 2011, 11:32:23 PM

3.1.5 will be the new version.

You may also want to drop me an email, so I can email you a project file

Webhelpforums · April 04, 2011, 08:44:18 PM

I will be releasing new version soon and email you project file!

As I wrote earlier, analysis/output filters can for your site not depend on file extensions.
(Which otherwise is default in addition to using MIME types.)

That would normally be fine. But it also appears like
A1WA is getting wrong mime types returned by server for some URLs.

This problem can be solved by options:
Enable: Scan website | Crawler options | Fix URL "mime" types when webserver is returning obvious wrong data
Disable: Scan website | Crawler engine | Default to GET for page requests (instead of HEAD followed by GET)

And doing all this. Scan speed and results got satisfactory

See Our Webmaster Tools for Windows and Mac

Cannot crawl this site...