Help wanted - extracting data from the web using Excel 2007

**Spacecadet** · 19 November 2011, 18:52

Originally posted by Zippy View Post

Tabular info should appear in tables. Unfortunately the 'thou shalt not use tables for layout' has been taken to mean that you shouldn't use tables at all.

WZS

I blame Nick Fitz

**NickFitz** · 19 November 2011, 19:45

Originally posted by Spacecadet View Post

WZS

I blame Nick Fitz

OI! I've always insisted that tabular data be presented in tables - you should have seen the beauty that was the 5-day weather forecast on GCap Media's local radio sites

Ooh - you can, at the Wayback Machine

Anyway, getting back to the original question: have a look at YQL, which is a SQL-like way of extracting arbitrary data from the web and returning it in a structured form. I've previously posted an example of its use to extract useful data from CUK (bit of a first, that), and Yahoo! have copious documentation.

**TimberWolf** · 19 November 2011, 19:54

For manual methods that involve putting the entire HTML page text into a string and using string functions, is a function that would retrieve only text strings that would be displayed on a web page rather than the entire, rather larger quantity of, HTML background gubbins.

**mudskipper** · 19 November 2011, 20:05

Not wishing to be picky but...

BOE are using tables for their interest rate.

Code:

				<h1>KEY FACTS</h1>
				<table width="245" border="0" cellspacing="0" cellpadding="0" id="keyfacts">
					<tr>

						<td width="145" valign="top" class="kflbold">Current Bank Rate </td>
						<td width="100" valign="top" class="kfrbold"><img src="/images/kfarrow.gif" width="13" height="13" border="0" />0.5%</td>
					</tr>

Data that should, arguably, be non-tabular.

Not to mention the proliferation of <h1> tags...

**NickFitz** · 19 November 2011, 20:16

Originally posted by k2p2 View Post

Not wishing to be picky but...

BOE are using tables for their interest rate.

Ah, so the following YQL query will do the needful:

Code:

select * from html 
    where url="http://www.bankofengland.co.uk/" 
    and xpath="//table[@id='keyfacts']/tr[1]/td[2]/p/text()"

returning the following XML:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="1" yahoo:created="2011-11-19T20:13:13Z" yahoo:lang="en-US">
    <results>0.5%</results>
</query>

from which the result can easily be extracted using the MSXML DOM, or whatever similar facilities Excel might offer for importing XML over HTTP.

Edit: that YQL URL in full:

Code:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.bankofengland.co.uk%2F%22%20and%20xpath%3D%22%2F%2Ftable%5B%40id%3D'keyfacts'%5D%2Ftr%5B1%5D%2Ftd%5B2%5D%2Fp%2Ftext()%22

**MarillionFan** · 19 November 2011, 20:23

Originally posted by NickFitz View Post

Ah, so the following YQL query will do the needful:

Code:

select * from html 
    where url="http://www.bankofengland.co.uk/" 
    and xpath="//table[@id='keyfacts']/tr[1]/td[2]/p/text()"

returning the following XML:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="1" yahoo:created="2011-11-19T20:13:13Z" yahoo:lang="en-US">
    <results>0.5%</results>
</query>

from which the result can easily be extracted using the MSXML DOM, or whatever similar facilities Excel might offer for importing XML over HTTP.

Ah! You've done it. I was just looking at that as well but I am afraid X-Factor beckon.

Now I was trying to see how you could format that or return it in a tabular SQL style format with a view to UNION a few different queries together. I'm not sure YSQL is quite the same as SQL output without reading up on it all?

So the results would be

Source Description Date Value
BankofEngland.co.uk Bank Base Rate 19/11/2011 20:20 0.5%
MotleyFool.co.uk/Lloyds Lloyds Share Price 19/11/2011 20:20 25.5

Pull together some queries like that and you have something very useful.

**NickFitz** · 19 November 2011, 20:56

Originally posted by MarillionFan View Post

Ah! You've done it. I was just looking at that as well but I am afraid X-Factor beckon.

Now I was trying to see how you could format that or return it in a tabular SQL style format with a view to UNION a few different queries together. I'm not sure YSQL is quite the same as SQL output without reading up on it all?

So the results would be

Source Description Date Value
BankofEngland.co.uk Bank Base Rate 19/11/2011 20:20 0.5%
MotleyFool.co.uk/Lloyds Lloyds Share Price 19/11/2011 20:20 25.5

Pull together some queries like that and you have something very useful.

I think the appropriate approach to aggregating data from multiple sources like that is to create a YQL Open Data Table and add it to their repository; then Yahoo! will do all the heavy lifting of grabbing the data and aggregating it on their backend, and send you just the bits you need.

Your share price query doesn't specify bid or ask: you presumably want something like select Bid, Ask from yahoo.finance.quotes where symbol='lloy.l'

**pacharan** · 20 November 2011, 03:43

Scroon screeping?

**aussielong** · 20 November 2011, 05:48

I've done a bit of this in the past, in Java. This is how I did it and found it quite maintainable..

Use Tagsoup to turn the HTML into well formed XML.

Use XPath expressions to address the parts of the XML document that have the data i'm after.

When the page is updated, your XPaths might need updating - I used an XPath generator plugin in Eclipse so I just reload the page, point and click the bit i'm interested in- and it tells me the XPath I need. I then update my screen scraper with the new XPath.

I've been doing this to scrape some data from a vendor website for a few years now. They have updated the site over time but my stuff still works. This would take an hour or two to knock up and then no more coding to maintain it.

(You just need a thin JNI/COM wrapper to call this from your spready.)

**suityou01** · 20 November 2011, 12:24

Originally posted by NickFitz View Post

Ah, so the following YQL query will do the needful:

Code:

select * from html 
    where url="http://www.bankofengland.co.uk/" 
    and xpath="//table[@id='keyfacts']/tr[1]/td[2]/p/text()"

returning the following XML:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="1" yahoo:created="2011-11-19T20:13:13Z" yahoo:lang="en-US">
    <results>0.5%</results>
</query>

from which the result can easily be extracted using the MSXML DOM, or whatever similar facilities Excel might offer for importing XML over HTTP.

Edit: that YQL URL in full:

Code:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.bankofengland.co.uk%2F%22%20and%20xpath%3D%22%2F%2Ftable%5B%40id%3D'keyfacts'%5D%2Ftr%5B1%5D%2Ftd%5B2%5D%2Fp%2Ftext()%22

Wow that is art!!! You can query anything these days

I also notice that

Code:

select * from html 
    where url="http://www.contractorukcom/forums/" 
    and xpath="//table[@id='poster_type'='Bellend']/tr[1]/td[2]/p/text()"

returns

Code:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="1" yahoo:created="2011-11-19T20:13:13Z" yahoo:lang="en-US">
    <results>MarillionFan</results>
</query>

Help wanted - extracting data from the web using Excel 2007

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Help wanted - extracting data from the web using Excel 2007

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud