Joho the BlogHelp create shareable syllabi - Joho the Blog

Help create shareable syllabi

Every course has a syllabus. In it are an expert’s ideas about the topics essential to the course of study and the works that explain those topics. And that’s at a bare minimum. Syllabi rock.

Yet, all that wisdom and goodness is locked in the syllabi, of use to the handful of students taking or considering taking the course. What a shame! Think of all we could do if the information in syllabi were made available for open access by humans and machines:

  • Teachers could discover new ideas for how to teach a course.

  • Students could browse among courses to see how other teachers teach them.

  • Researchers could be guided by this canon-in-practice — both to the expected works and away from works that are too expected.

  • Researchers studying disciplines would have a rich source of data to analyze.

To unlock these riches, several things have to happen.

First, the syllabi have to be collected and put into an open access repository. I’d love to see universities adopt open syllabi policies that require (ask? suggest?) that faculty submit a copy of their syllabi for each course they teach.

Then the syllabi have to be scraped so that the data within them is searchable by humans and parsable by computers.

Then the data should be put into some standard data format so that it can be more easily found, reused, and mined.

It’s this last step that I’m looking for help with. I’ve started a little project with Joseph Cohen to develop an XML schema for syllabi. (Joseph has a commecial project underway that could help with some of the other elements required to turn dead syllabi into a living beast at our command.)

If you’d like to jump in, go to the SylliXml wiki. (You have to register to edit.) We’re just at the kicking it around stage, and your contributions will be very helpful.

There are lots of questions to resolve. At the moment, we’re aiming at producing the most minimal schema we can, because syllabi are unstructured documents and trying to accommodate everything that might ever be put into one is a mug’s game. So, what is the minimum set of data and metadata that would make the information in syllabi amazingly useful?

Come play!

There is tremendous value hidden in the syllabi diaspora. Let’s unite and conquer!

27 Responses to “Help create shareable syllabi”

  1. I was talking last night to Jeff Verkoeyen who has built a similar project using University of Waterloo data.

    You should talk to Jeff.

    suggested search string: CS 135

  2. Sounds like a good use case for using semantic technologies such as publicly hosted ontologies and semantic queries. I guess you could easily define the model (ontology) using RDF and expose it as a public ontology to be used by syllabus-aware sites (schools/instituions) and syllabus-aware tools (search engines, student calendars/course managers, etc).

  3. Have you considered using HTML? Titles, paragraphs, lists. Seems like it contains most of what you need already, as well as being an output format you can view in a browser. Double win.

  4. Usually, only the most assiduous student would be interested in reading a syllabus in choosing courses. In those situations, the student should just contact the professor, who is usually fine with providing it.

    If you’re looking to create Open Syllabi as a gateway drug to OCW, well… then you are more than obliged :-)

    Open Syllabi unto itself seems like a waste of good time. Open Syllabi should be a given as a product of OCW.

  5. @Noah Slater, not sure if your sarky remark was aimed at me or the author but I think the idea here is a data modelling exercise as a starting point so that users of syllabus knowledge can benefit via sharing of the data across different applications. I merely suggested tools that could help with making the definition of the schema more open (to future work) rather than imply that the work was already done.

  6. It wasn’t intended as a snarky remark at all. It’s a very serious suggestion. I’ve worked extensively with Semantic Web technologies, as well as alternative XML document formats such as DocBook. I would strongly suggest modelling this data as HTML if you can.

    HTML is the input format, and the output format. Along with its extreme flexibility and ability to mark-up real-world documents of the type you’d expect this project is interested in, you can use class names and ids to extend the basic model.

    cf. Microformats.

  7. Putting aside any qualms I have about formatting data in a presentation markup like HTML, how would you define the format itself in HTML? You still need a schema that defines the data formats and the relationships it may have with other entities right? Not sure HTML was designed for that. I might be wrong but I’m definitely curious to see how a presentation layer technology is used to do both data definition and marshalling/serialisation in rather domain specific application development. I’ve just never seen HTML used that way. I have been proved wrong before so who knows?

  8. HTML isn’t a presentation language. It is a structural document language with built-in extensibility. Defining a set of extension guidelines is as easy as writing them down for people to read.

    You could say, for example, that all links pointing to the course’s homepage should have rel=”homepage” on them. Specifying that is as easy as publishing the previous sentence somewhere.

    It was good enough for the W3C:

    So good in fact, it spawned the Microformats project:

    If this was my project, I would probably set out to:

    * Collect a sample corpora and try to convert these documents to HTML by hand. I would pay attention to common elements, and use class attributes to typify the HTML elements in a way that seemed to make sense.

    * I would document the results of that process on a public wiki. I would list the different structural elements and how they should be used to compose a document, as well as listing the recommended class names to use for different uses of those structural elements.

    * I would then invite people to come to the wiki, to check out the corpora, and to get involved. I would ask them to discuss the idea on a mailing list, and to edit the wiki with new ideas or corrections. I would attempt to build consensus around the recommendations.

    * I would then build a reference implementation of a client for extracting this information. I would write it in a language like Python. I would serialise the HTML to an XML infoset using something like lxml (that uses libxml) and extract the data using xpath. I would publish this tool to Github, and invite people to fork me and improve the software.

    As an example for that last step: want to get the homepage of the current document? //a[@rel=’homepage’] should do it. Easy as pie.

  9. I was talking to my friend about this, and he had the following to say:

  10. MIT has about courses on line. Here is the link:

    Each one has the syllabus and lots of other info about the course. I think that they are doing, at least in part, what you are asking for.

    These courses are not for credit and you do not have the option to contact the instructor.

  11. Noah, if you’d like to take your msg out of the conditional and do it, I’d support you entirely. (No snark in this msg.)

    There are some elements that are predictable in syllabi, and that are particularly valuable to particular use cases. Since we want to keep the spec simple, it’s focused on those; we’re trying to treat syllabi as sets of data, rather than as the documents that they are.

    It’s not clear to me what the advantage of doing it as an html microformat would be since we are very much NOT expecting teachers to alter how they write syllabi. They’re going to keep writing in Word or whatever, or if they write them in html, we don’t want to ask them to use particular classes or id’s.

  12. Ari, I love OCW, but we want to see if we can aggregate syllabi from places that don’t (yet) support OCW.

  13. @David: if you’re not expecting people to alter the data they are producing – how are you planning to generate the resultant data? Whatever process you use, surely it will be orthogonal to the final format? If you can get away with encoding the information you’re producing as HTML, it doubles up as a display format too. You could shove the files up on a webserver, and people can browse them! Being part of the browseable web is a huge boon.

  14. Noah, I appreciate your patience. Thank you.

    The current “plan” is based on the assumption that if we ask faculty to take a single extra step, we will lose, say, 95% of the syllabi. Further, most faculty that I’m aware of are not writing their syllabi in html. They’re using a word processor. (I have no evidence except for an informal survey asking a few folks at one university.)

    So, we’re assuming that we’re going to have to hire a legion of workers to scrape the contents of syllabi and enter them into a form that will then create database records with the fields and relations specified in an xml spec. To make them browsable will obviously then require some type of front end to generate html from the data. But there are, of course, many uses of aggregated syllabus data, not all of which aim at browsing them. We therefore thought (and are open to re-thinking) that xml would be the right format for making the data accessible to humans (via a front-end app) and to computer queries.

    BTW, Joseph Cohen, who is one of the originators of this idea, is working on both a front-end app and a back-end form as a commercial app. I personally am interested in non-commercial uses of this data. We have a shared interest in making the data accessible in open, standard ways.

  15. Hi David,

    If you were already thinking about implementing a backend store for this data, I was going to suggest JCR (Java Content Repository) as repository such as Apache Jackrabbit which is a tree based repository and does not require you to have a schema in stone for you to start storing data. The data is stored as nodes which have properties. You may or may not implement strict content types but that’s up to you. The data being stored can be served up in any format out of the box via xml, json, etc. This link might be useful in the initial content modelling stage:


  16. Another link to introduce JCR:

  17. David, Noah is spot on. If you’re creating them in a word-processing program, they can readily be converted into HTML (and likely will be for students to read them, unless your faculty really love pdf).

    So, augmenting these already semi-structured documents with classes is an easier step than recreating them from scratch in a new format.

    In any case, if you are embarking on this, ai strongly recommend reading (and if convinced, following) the microformats process for deriving structure, even if you end up with a non-HTML format at the end of it.

    In particular, be rigorously empirical about which elements to include using the 80:20 rule, not the ‘we must support every possibility’ quagmire.

  18. The unlocked potential knowledge in the mass of distributed syllabi was the point behind my Syllabus Finder, which used the Google Search API and some statistical profiling to find roughly a million syllabi online. (Unfortunately, the code was based on the original, more powerful Google Search API, which was deprecated in 2008, thus eventually shutting down the Syllabus Finder.) You can see the intellectual results of the Syllabus Finder in this paper on how American History is taught with textbooks.

    Unlike other projects that have tried to aggregate syllabi, the Syllabus Finder made the (probably commonsensical) assumption that most syllabi would never be put into semantically marked-up formats. It took syllabi as they are—they do vary widely—and used text mining instead. However, we just received an NEH grant at the Center for History and New Media to create ScholarPress, a WordPress plugin that will (in part) allow teachers to create a syllabus within WordPress in a way that is structurally organized.

  19. It’s tempting to take this offline, but I think working through it in public might be useful. So, once again thanking you all for your help and patience…

    First, Kevin, to the easy part: We are very much aiming at a minimal spec or model, especially since syllabi are documents, and thus are infinitely variable. The wiki gives a sense, I hope, of the level of complexity we’re hoping to capture, which is in fact way less than 80%

    Second, yes, many profs love pdf. What they really love is writing in whatever tool they usually use, and handing them off to an admin aid to print them out, and to magically put them into any online classware they happen to be using.

    Third, I’m not understanding why adding classes to an htmlized version of a syllabus is either easier or more desirable than scraping the data into a database. The hypothesis of this semi-project is that we cannot rely upon the faculty taking any extra steps. We can’t ask them to write in HTML and we certainly can’t ask them to create POSH pages. Well, we can ask, but we will drop submissions into the single digits.

    Perhaps it would be useful to note that I personally want some type of standard because I also would like to encourage schools to adopt an open syllabus policy that requires profs to submit their syllabi for open public access. Faculty assemblies are far far more likely to pass this policy if it requires teachers to do nothing more than hand in a copy of their syllabus in whatever form they originally created it in. Requiring profs to convert to html and add designated html classes is a non-starter, unfortunately.

    So, we convert their docs into html. We deal with the inevitable problems in so doing. We create a tool for adding classes to html pages and saving them. We hire folks (Mech Turk?) to use that tool to add the classes. We create a tool for parsing the html and putting the classed/fielded data into a database. And, of course, we create a data model for the database that takes account of the nested and relational nature of the data. Why is this preferable to writing an xml schema that is consonant with the database model and then hiring people to scrape the data straight from the docs the profs have authored?

  20. Well, if you want to be ruthlessly pragmatic about it, then coming up with a standard is putting the cart before the horse. A pragmatic discussion of how to get to your noble goal would have to begin by talking to BlackBoard, which (alas) has the vast majority of the market for content management systems that contain syllabi (often making them inaccessible to the public). Something like the workflow of ScholarPress (which makes it trivial to structure a syllabus into semantic pieces) would have to be implemented there—i.e., a process that makes it easy for professors to set up a syllabus but is also marking up that syllabus as it’s created, rather than in a free-form doc. Second best is to have your proposed standard implemented in Sakai and Moodle (the main FOSS competitors to Bb). But of course you may encounter the resisting comment that the nice thing about standards is that there are so many of them. I’m all for this effort, but I still think this is a social problem as much as a technical one. (And I also think to get a good aggregate database you will have to use regular expressions on the raw text of extant syllabi.)

  21. I side with Dan that munging existing documents is better than recreating them in an ‘ideal’ form.
    The point of microformats is to set up a way to make future munging more straightforward.
    If there are existing tools that generate such syllabi, do indeed see what structure they create, but also encourage them to add your classes to the HTML. This is how we got to billions of hCards on the web.

  22. It might interest you to know that in Norway the curriculum used in primary and secondary education is actually defined by the government. And, in fact, it’s published in XML form (specifically, Topic Maps (XTM 1.0)), and all parts of the syllabus have globally unique identifiers.

    So there are actually content providers in Norway which connect their content to parts of the syllabus and allow users to navigate the content via the syllabus structure.

    Unfortunately, all of this is in Norwegian only, as far as I can recall.

  23. I’m working with Joseph Cohen on this and the commercial project, and wanted to toss my two cents in about XML vs. HTML:

    First, I think this discussion may be a bit premature. We still need to iron out the data that would be covered by this format, and use cases before determining the best way to go with it. This echos Sean Palmer’s comment linked by Noah Slater.

    Also, there’s no reason we can not have both formats (see: vCard and hCard coexisting).

    That said, having a way to collect/share the data in an more easily-mined XML format seems to be a huge advantage, particularly for researchers.

  24. I would just add that XML is no more easily mined than HTML. Providing you can encode your data with both of them easily enough (that is, all else being equal) then both formats can be serialised into XML infosets. Of course, that lets you use the same tools for both jobs.

  25. Hello David —

    the open access repository is already alive and well in some institutions like MIT, but one site is attempting to collate knowledge from anywhere — “Connexions”.

    I hope you have seen Connexions (, started at Rice University by Richard Baraniuk….

    Connexions is a kind of wiki — but here’s the description:

    Connexions is: a place to view and share educational material made of small knowledge chunks called modules that can be organized as courses, books, reports, etc. Anyone may view or contribute:

    * authors create and collaborate
    * instructors rapidly build and share custom collections
    * learners find and explore content

    You see this opens it up from simple syllabi to also include all the supporting course materials — and also student responses / arguments / reports, as well as random users who can contribute to the topic….

    If you haven’t seen it, check it out, explore it for a bit. Perhaps your project could really just feed into his, or maybe you can work together. Or at the very least, might be a model to work from when building your own.

    Brian Laakso
    Canton City Schools

  26. here’s a syllabus I was delighted to stumble across online:

    I love the idea of this being the norm

  27. […] David Weinberger from the Harvard Library Innovation Lab and Dan Cohen, the Co-Director of the Ray Rosenzweig Center for History and New Media (CHNM) at George Mason University discuss Zotero, the free bibliographic reference manager program that is being developed by the CHNM along with a number of other research tools. David and Dan also discuss open syllabi. […]

Web Joho only

Comments (RSS).  RSS icon