Tool to inspect a website structure?

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

Sysman replied

18 January 2012, 15:00
Originally posted by NickFitz View Post

For example, it's not unknown for server logs to accidentally be made available at an unsecured URL...

I've come across a few folks who willingly publish their web stats without massaging them first. It doesn't take much imagination to realise that a supposedly hidden URL could pop up in those stats.
Leave a comment:
northernladuk replied

18 January 2012, 14:32
Not completely related but anyone read the investigation of how they caught the Facebook worm Koobface creators? Much of that was from info left on servers..

Very interesting... if you like that type of thing...

The Koobface malware gang – exposed! | Naked Security
Leave a comment:
TheFaQQer replied

18 January 2012, 14:23
Originally posted by NickFitz View Post

However don't thereby start to believe that putting a page/file on a server and not linking to it is a good way of keeping it secure from prying eyes. There are a number of ways things can end up being accidentally linked to. For example, it's not unknown for server logs to accidentally be made available at an unsecured URL...

There was a great article in the most recent 2600 magazine about how people edit files on the server, and the text editor automatically creates a backup including a ~ at the end of the extension.

So, if you do a search by filetype on Google, you can easily find (for example) sites which have *.php~ files. Which won't get executed as php, and will expose the contents to anyone that looks.

If you do a search for "wp-config.php~" I reckon you could quite easily find the database connection and password for quite a few Wordpress blogs out there.....
Leave a comment:
NickFitz replied

18 January 2012, 13:46
Originally posted by d000hg View Post

That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

Assuming the site is at a newly-registered domain, Google's nameservers will find out about the new domain and tell the spider to go and have a look.

Other than that, as PAH and NLUK have said, it's just a question of following links.

However don't thereby start to believe that putting a page/file on a server and not linking to it is a good way of keeping it secure from prying eyes. There are a number of ways things can end up being accidentally linked to. For example, it's not unknown for server logs to accidentally be made available at an unsecured URL...
Leave a comment:
northernladuk replied

18 January 2012, 13:26
Originally posted by d000hg View Post

Exactly

That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

Google can only find it if it has been told where it is. It does this by...

1) Having the page submitted manually to Google which you can do here Overview ? Submit your content
Make it a page with either a sitemap xml or a lot of links through your page (like a sitemap page). Google then crawls all the links. Submit a single page with no links in our out and it will take that page alone once, bugger off and never return.

2) Have links from other pages that google rates (for faster and more frequent crawling) and it's spider will come visit you at some point. Paid links or relevant content links. The more relative the better google will deem it and more likely rate higher.

3) Submit to user generated sites like DMOZ but because it is user authenticated it can take forever.

Google AFAIK does not document new pages that appear out of the blue. It has to be connected for the spiders to find it.. No linkey no likey....

Last edited by northernladuk; 18 January 2012, 13:29.
Leave a comment:
PAH replied

18 January 2012, 13:19
Originally posted by d000hg View Post

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

Nope. Google uses links to find pages. A new site needs to be linked to from another site for Google to find it, or you can manually submit a site or page to Google for adding to their index. There's a special page on Google somewhere to do that.

The only way a page that's not linked to may be found is if it uses dynamic URLs where there's something on the querystring to identify the page content to return, such as 'page=1'. Then it may be possible some search engines would use an incrementer to find all possible entries, but I wouldn't rely on it.
Leave a comment:
d000hg replied

18 January 2012, 13:07
Originally posted by PAH View Post

Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.

Exactly

Originally posted by TheFaQQer View Post

If the pages aren't public, they what is going to know that they are there?

That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?
Leave a comment:
TheFaQQer replied

18 January 2012, 12:40
If the pages aren't public, they what is going to know that they are there?

Search engines aren't going to find them, since they aren't anything that you can crawl through.

If you use something that will download the entire site, then it will follow links to find the pages, so that's not going to be any use.

If you own the site, then there are tools you can use to find the orphaned pages, but for a site which you have nothing to do where a directory is secured in any way, then you aren't going to get anything from there.
Leave a comment:
PAH replied

18 January 2012, 12:15
If it's not your site and they've got security blocking folder/directory browsing then it doesn't appear to be a simple task.

You could compare older versions of the site via the Wayback Machine.

Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.
Leave a comment:
Joeman replied

18 January 2012, 09:34
Originally posted by d000hg View Post

Are there any tools which will generate a nice report of pages on a specific site/domain... i.e. finding pages which are publicly accessible but not linked from the main site?

Look to see if the site has a Robots.txt file. often in there you can find reference to parts of the site the owner doesn't want crawled by search engines..

Besides clues Like this, unless directory browsing is enabled with no default page, not sure how you can find pages not linked from the site.
Leave a comment:
d000hg replied

18 January 2012, 08:47
Originally posted by TheFaQQer View Post

Have you tried this?

Any particular reason you felt justified in using LMGTFY when the search phrase contains a technical term?

Those tools are also NOT what I asked for, they seem to work by crawling recursively from the homepage... meaning they'd miss pages that aren't reachable by following links?

Last edited by d000hg; 18 January 2012, 08:51.
Leave a comment:
northernladuk replied

17 January 2012, 22:20
Yeah, Marillionfan..but he is expensive

Sorted.
Leave a comment:
TheFaQQer replied

17 January 2012, 22:05
Have you tried this?
Leave a comment:
d000hg started a topic Tool to inspect a website structure?

17 January 2012, 21:57
Tool to inspect a website structure?

Are there any tools which will generate a nice report of pages on a specific site/domain... i.e. finding pages which are publicly accessible but not linked from the main site?
Tags: None