Cannot crawl this site...

Started by cgeyma, April 01, 2011, 01:46:35 PM

cgeyma

Trying to crawl the following site but Web Analyzer returns immediately without any data.

http://www.npsmm2011.com

I see that this site does some redirects before the site home page is delivered.
Wouldn't WA follow the redirects?

Then, I tried putting in the redirected URL in WA and WA crawled a few pages then stopped.
Tried turning on a few crawl engine options but no help.

Can someone help explain why WA can't crawl the site completely.

Thanks.

Webhelpforums

In
http://www.npsmm2011.com/robots.txt
You disallow crawling :)

With default settings A1 Website Analyzer obeys robots.txt, noindex, nofollow etc.
It's also why you see no URLs with *default settings* after website because of robots.txt
(Possibly I should change default settings in A1 Website Analyzer to show this...)

Read about robots.txt in A1 Website Analyzer


The other reason is a bit more tricky, but has to do with default analysis and output filters.

First make sure you have disabled easy mode

Then go into
analysis filters and output filters

and remove *all* file extensions listed in both.
(A1 Website Analyzer will then depend entirely on MIME types for configuring out what URLs to analyze)


If that was not enough... It also appears your website is checking up on user agent string. At least I can better results when changing user agent to MSIE8...

I will take a better look tomorrow. The website scan results are still not satisfactory. I am sure I will solve rest tomorrow :)
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

cgeyma

Thanks very much for your response.
Yes, when I had the very limited success, I had turned on the option to ignore robot.txt.

I have just now followed your direction and was only able to get the home page loaded.
Any links from the home page was not crawled.

Looking forward to your further analysis.

Webhelpforums

Okay. The website URL:
http://www.npsmm2011.com/portal/site/NPSMM

contains (snipping out lots of code)


   <script type="text/javascript">
      window.location.replace(rpstring + '/portal/site/NPSMM/menuitem.blablablablabla/?vgnextoid=blablablablablablablablabla);

And while A1WA (when Javascript parsing is enabled) does pick up the string with the URL, the Javascript handling code judges it to be an invalid URL. I will try fix that, but it is done to avoid too many "false positives" since extracting strings from Javascript functions can contain lots of string data that may or may not look-like or be actual URLs.
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

cgeyma

Yes. the Javascript.
I did take the URL in the Javascript and put it as the scan path.
WA picked up a few more links and strangely, they are picked up as external urls.

If I look into the source of the home page, there are a few <a> tags.
I was surprised when WA did not pick those up because I thought the hrefs in <a> tags are simple.

Thanks.

Webhelpforums

When I see the raw content returned to A1WA for that URL, here is what is returned.



<html>
<head>
<title>redirecting...</title>
</head>
<body>
<script type="text/javascript">
//alert('/portal/site/NPSMM/menuitem.d05e0bdbfb3f07a50f160f16d6813453/?vgnextoid=0c2b85a28895e210VgnVCM2000007d184335RCRD');
var rpstring = "";
if (location.hostname.indexOf("mbnetstar.com") > 0) {
rpstring= "/rp_connectionprod";
}
window.location.replace(rpstring + '/portal/site/NPSMM/menuitem.d05e0bdbfb3f07a50f160f16d6813453/?vgnextoid=0c2b85a28895e210VgnVCM2000007d184335RCRD');
// do not use location.href for one less entry in browser history
</script>
</body>
</html>



When a browser see that it will probably load the real content containing the <a> URLs.

Anyways, I am getting it work in A1WA. I will have a new version out soon
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

cgeyma

Thanks.

Please do let me know when the new version is available.
I just updated to 3.1.4 from 3.1.3 that was purchased a couple of weeks ago.
I am not sure if 3.1.4 is the version you just mentioned you will be releasing soon.

Webhelpforums

3.1.5 will be the new version.

You may also want to drop me an email, so I can email you a project file :)
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

Webhelpforums

I will be releasing new version soon and email you project file! :)

As I wrote earlier, analysis/output filters can for your site not depend on file extensions.
(Which otherwise is default in addition to using MIME types.)

That would normally be fine. But it also appears like
A1WA is getting wrong mime types returned by server for some URLs.

This problem can be solved by options:
Enable: Scan website | Crawler options | Fix URL "mime" types when webserver is returning obvious wrong data
Disable: Scan website | Crawler engine | Default to GET for page requests (instead of HEAD followed by GET)

And doing all this. Scan speed and results got satisfactory :D
TechSEO360 | MicrosysTools.com  | A1 Sitemap Generator, A1 Website Analyzer etc.

More About Our Webmaster Tools for Windows and Mac

HTML, image, video and hreflang XML sitemap generatorA1 Sitemap Generator
      
website analysis spider tool for technical SEOA1 Website Analyzer
      
SEO tools for managing keywords and keyword listsA1 Keyword Research
      
complete website copier toolA1 Website Download
      
create custom website search enginesA1 Website Search Engine
      
scrape data into CSV, SQL and databasesA1 Website Scraper