Joho the Blog » TED, scraped
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

TED, scraped

TED used to have an open API. TED no longer supports its open API. I want to do a little exploring of what the world looks like to TED, so I scraped the data from 2,228 TED Talk pages. This includes the title, author, tags, description, link to the transcript, number of times shared, and year. You can get it from here. (I named it tedTalksMetadata.txt, but it’s really a JSON file.)

“Scraping” means having a computer program look at the HTML underneath a Web page and try to figure out which elements refer to what. Scraping is always a chancy enterprise because the cues indicating which text is,say, the date and which is the title may be inconsistent across pages, and may be changed by the owners at any time. So I did the best I could, which is not very good. (Sometimes page owners aren’t happy about being scraped, but in this case it only meant one visit for each page, which is not a lot of burden for a site that has pages that get hundreds of thousands and sometimes millions of visits. If they really don’t want to be scraped, they could re-open their API, which provides far more reliable info far more efficiently.)

I’ve also posted at GitHub the php scripts I wrote to do the scraping. Please don’t laugh.

If you use the JSON to explore TED metadata, please let me know if you come up with anything interesting that you’re willing to share. Thanks!

Previous: « || Next: »

Leave a Reply

Comments (RSS).  RSS icon