Joho the Blog » spreadsheets

July 27, 2012

Importing HTML into Google Docs spreadsheets

Rick Klau points [g+] to a feature of Google Docs spreadsheets I didn’t know about (although I’m far from a spreadsheet maven): It can automatically include a table from any HTML document accessible on the Web. It turns out it can also include the contents of lists.

It’s not the most intuitive feature. Into a cell you type:

=ImportHTML(“[URL]”,”[query]”,”[index]”)

Except you put in the HTML page’s url instead of [URL], “table” or “list” instead of [query], and which the number of the tables or list you have in mind instead of [index]. For example:

=ImportHTML(“http://www.hyperorg.com/blogger/index.html”,”list”,1)

gets the first list (ul or ol) on Joho The Blog (this page you’re reading), which turns out to be the one on the left called “Other Stuff.” If you ask for 2 instead of 1, you’ll get my blogroll.

Or, to use Rick’s more useful example:

=ImportHtml(“http://www.accuweather.com/en/us/anchorage-ak/99501/august-weather/346835?view=table”,”table”,1)

That imports AccuWeather’s table of weather for Anchorage (where Rick is headed for vacation.)

The data updates every time you open the spreadsheet.

ImportCVS does the same for CVS data. And Kingsley Idehen explains how you can update your spreadsheet with Linked Open Data by going through SPARQL. (SPARQL lets you query a database for linked data.) (Yes, it’s over my head.)

Wouldn’t it be useful to be able to import a single element into a Google spreadsheet, even if it’s not in a list or a table? For example, suppose I want to get the headline of the first posting at, say, DailyKos.com. That element has an id of “article-1″. (I know this because I looked at the source.) So, why not let me specify the url and the id, and plop the contents into a cell in the spreadsheet? Or suppose I want the content of one particular cell of a table?

No, we’re never satisfied.

 


Two seconds after I pressed the “Publish” button, Rick Klau responded to my questions on the Google Plus thread where he talks about this feature. He suggests importXML for grabbing an item by its id. And to get a frozen copy of the data, copy and paste it. He also points to a post from 2007 about these features. (Oh, yeah, you can trust Joho to stay on top of the news!) In fact, that post gives an example of how to obtain the latest headline from the NYT:

=GoogleReader (“http://graphics8.nytimes.com/services/xml/rss/nyt/HomePage.xml”, “items title”, “false”, 1)

It still works. Cool!!

8 Comments »