Usage.

For complete details, please see Updating the Infrastructure web site and Reference Manual. A video tutorial is available at http://s.apache.org/cms-tutorial.

If you just want to get started editing a page:

  • Install the bookmarklet from the cms page. You only have to do this once.
  • Navigate to the page you wish to edit (on the live site, not in the cms).
  • Click the bookmarklet. There will be a short pause while the CMS system is initialised for you.
  • Click on Edit (to skip this step hack the bookmarklet to add an 'action=edit' param to the bookmarklet's query string)
  • The page editor should then be displayed.
  • Click Submit to save your edit to the workarea
  • Click Commit to save the updated file to SVN and trigger a staged build. (to skip this step click on the "Quick Commit" checkbox in the Edit form).
  • The results should appear shortly on the staging site. (You may have to force the page to refresh in order to see the updated content)
  • Once you are happy with the updated page, click on Publish Site to deploy.

Rationale.

This section describes the current conditions of the ASF website publishing system and its deficiencies. It also discusses options the Infrastructure Team considered in addressing these problems with an eye towards our future needs.

Problems with the Current Website Management Tools.

Scheduled find + sync Doesn't Scale.

The existing publishing system at Apache has evolved from the case where the organization's hardware consisted of a single machine. Websites have always been limited to using a combination of static content and cgi scripts in order to not overtax a machine simultaneously responsible for delivering (circa 2000-2003) over 1M hits and serving committers as our CVS master host.

The organization has since grown to encompass about three full cabinets worth of hardware and a pair of machines dedicated to serving mainly www.apache.org and project websites. The machines, eos and aurora, are some of our most expensive equipment and are located in two different datacenters to provide redundancy and failover capabilities. The current traffic load is roughly 20M hits a day for those machines.

However the publishing system involves running hourly find jobs on people.apache.org and pushing that content out to eos and aurora with rsync. With roughly 300GB worth of content to scan it is no longer possible to do this with a single find job, so we now run them in parallel: one find job per website. This puts an incredible load on people.apache.org's ZFS array as there are roughly 100 sites to scan. As good as ZFS is, the filesystem will not be able to keep up with this load as the organization continues to promote new top-level projects.

Limitations of Confluence's Shared Plugin Architecture.

Several years ago during the wiki craze at Apache, the Infrastructure Team was tasked with setting up a Confluence installation for our projects to use. Apache member Pier Fumigalli developed and offered the autoexport plugin as a way to provide Confluence-backed project websites, which was quickly adopted by several projects. The process involves rsyncing the autoexported pages from the machine hosting Confluence over to people.apache.org, where the standard publication system described above would push those pages out to eos and aurora to be served live.

Over time we began to experience chronic problems with this particular setup. First off, different projects often wanted to use different and occasionally conflicting plugins for their sites. Secondly, plugins would often break during Confluence upgrades. The biggest offender was in fact the autoexport plugin and its reliance on Confluence internals. Virtually every upgrade was guaranteed to break it, and after a while Pier and other java developers at Apache lost interest in supporting it. We tried around for people to support it, and were even willing to compensate folks for their time, but there were no takers. Confluence backed websites were fully dependent on the autoexport plugin to have any chance of working, and the organization was caught between a rock and a hard place in deciding when it was possible to upgrade Confluence.

The other main problem with this configuration is that it makes url deletions a nightmare. The autoexport plugin doesn't support url deletions, and that is carried through to the live sites via rsync.

Currently Apache's Confluence installation is hosted on thor, which is a Sun T5220 Sparc. It's by far our beefiest machine with 8 cores and 8 threads per core, and yet our Confluence service is dog slow. Our installation is simply out-scaling the software, and to keep it performing acceptably will require even more significant equipment investments going forward.

Anakia Is Outdated.

Anakia was a great tool 10 years ago. It is a competing technology to XSLT for dealing with raw XML content. Many projects still rely on anakia to generate their webpages but most of the web has moved on. It's time the ASF caught up with the times.

Not Every Content Author Is a Geek.

While Apache is still primarily a place for software developers to collaborate, some of the people who provide support for our press and legal efforts need to be able to contribute to www.apache.org. Expecting them to deal with tools like Anakia to roll their own builds of XML-based content is a non-starter.

Publishing Delays Suck.

Obviously with hourly crons pushing content out to our webservers there will be delays as long as 2 hours between the time someone commits a change and logs onto people.apache.org to svn up the website, and the time it actually gets synced to the live site. That has been the status quo at Apache for several years and it simply isn't good enough any longer.

Problems with Existing CMS's.

While there is a zoo of available Open Source CMS's to choose from, only a handful of them actually support exports of static content. Even fewer of them offer support for staging. Apache's project websites aren't like Twitter, they don't have rapidly changing content that needs to be updated and delivered in real-time. The sites are meant to provide stable resources for the public to gain necessary information about the software we develop.

Day's CQ5.

While not an open source offering, Roy T. Fielding pursued a CQ5 installation for the organization's use. Roy demoed the featureset at ApacheCon US 2009 and the members of the Infrastructure Team who saw it were thoroughly impressed. It seemingly met all of our core requirements.

However conditions changed in 2010 for Roy, and he simply lost any free time he could have put to this effort. We had to eliminate this as an option going forward, but thank Roy and Day for their time and consideration.

Adoption and Diversity.

Lenya had most of the features we were looking for, but ultimately was rejected as being insufficiently flexible for use as a foundation-wide CMS. Allowing projects the flexibility of deploying per-project site build technologies which were only limited by the software installation on the build host was the Infrastructure Team's preferred strategy.


Custom Solution.

In September 2010 Philip Gollucci, VP Infrastructure, gave the green light to a custom-built CMS for the ASF, to be developed primarily by one of the contracted System Administrators. After collecting feedback on the goals and requirements of several interested parties, the development work was undertaken with a goal of completing the work in 60 days or less- just in time for ApacheCon 2010 NA. Fortunately the goals were kept simple enough that the actual development time only spanned about 30 days.

Unix Paradigm.

The software follows the Unix development mantra of separate executables for independent activities. The key separation was to ensure content presentation was kept independent from content editing, using the addressability of the web to sew things together. The main advantage of this approach is that it imposes relatively few constraints on the content generation software- different projects may adopt different tools to build their websites, without any of the conflicts inherent in single-process plugin architectures like Confluence.

Flexible Templating and Site Generation.

While Dotiac::DTL, a perl port of django's templating library, was chosen for use with www.apache.org, it is not a requirement that projects adopt it. Any templating system that runs on FreeBSD may be used, provided the necessary (perl) glue code is written that makes the system compatible with the CMS's build system.

Automated Parallel Builds.

The CMS relies on buildbot to provide automated builds and checkins of a project's staging site. Such builds are triggered instantly on commits to the project's site source material and are an essential component of the system.

The build system executes builds in parallel, so it is quite fast, even for a full site build.

Markdown Recommended.

Markdown was chosen as the format for the www.apache.org source content. Editing the source in the CMS's webgui relies on the wmd-editor to provide a WYSIWYM look and feel to the CMS.

Although it is strongly recommended that projects migrating to the CMS adopt markdown, it is not a hard requirement. In fact the codemirror is also provided as an option for those who prefer to store their source content in raw html.

Django Influences.

The CMS's overall design was influenced heavily by django's architecture. From the build system to the preferred template system to the webgui, the influences are clear and obvious to anyone familiar with django.

Subversion as Data Store.

Instead of developing versioning support and a notification scheme into a database driven CMS, Apache's subversion infrastructure was chosen as the central data store for everything. The fact that the web interface to the CMS interacts with the subversion repository in a LAN environment, combined with the lightning-fast SSDs that serve as l2arc cache for the underlying FreeBSD ZFS filesystem, eliminates virtually all subversion network/disk latency. Subversion continues to scale past 1M commits to deliver high performance to Apache developers, as well as to our internal programs that rely on it.

mod_perl Based Webgui.

The mod_perl based webgui is under 3500 LOC and takes full advantage of the httpd module API. Being an in-process application it is respectably fast and will scale well even on the limited hardware (a FreeBSD jail) that it runs on.

The application embraces the REST architectural style while making appropriate use of cookies solely to enhance the user experience. It is also LDAP enabled, not another auth silo to deal with, so your svn committer credentials will instantly grant access to the site.

It was also designed for humans already familiar with the featureset of the svn command-line tool, taking cues from the Emacs svn.el module. However it is accessible even to those without any familiarity with svn- a simple javascript bookmarklet allows users to go from a live webpage to a WYSIWYM editor session in 2 clicks. Submitting, committing, and publishing those changes is just as simple and straightforward. You may access the CMS anonymously if you are not currently an Apache committer.

Because the webgui revolves around providing users with a temporary server-side working copy, the urls it generates are not meant to be bookmarked, and are forbidden from being shared with others. The fulcrum for sharing changes is the staging site, and the "commits are easy and cheap" concept is built into the webgui.

However the url for publishing a website may be considered appropriate for writing a basic web service client app. Since the site is based in subversion developers may check-out the site and commit directly from their workstations instead of through the webgui, so it may be convenient for project members to have a simple site publication script. This choice is entirely up to each project, and a reference implementation is available at http://s.apache.org/cms-cli. Virtually every resource on the site may be directed to be served as application/json simply by adding as_json=1 to the query string, or by setting application/json as being preferable to text/html in the "Accept" request header.

ZFS.

In order to scale effectively to handling multi-gigabyte size websites, the webgui relies on zfs clones to create per-user working copies. The alternative algorithm would be to physically copy (with say rsync or cp -R) working copy trees, but such algorithms are O(N) whereas a zfs clone (essentially a copy-on-write version of the original) is O(1).

Svnpubsub.

Svnpubsub was developed by Paul Querna to provide an infrastructure for distributing change notifications to our frontline webservers (eos and aurora). This system is used by the CMS to convert site publication requests into live publications, and will someday eventually supplant the existing find + rsync architecture for site publication. It is a key component of Apache's infrastructure and will continue to be promoted going forward, even for those projects who elect not to use the CMS.

Scheduled Deployments of Dynamic Content.

Despite the above remarks, there is still room for supporting the generation of "dynamic" content, in the same fashion that Planet Apache works. Namely buildbot may be setup to run periodic builds of select urls that have dynamic content, and to subsequently publish the results of those builds. While it is possible to run these jobs more frequently than once an hour, it is not recommended due to the ensuing email notification traffic generated thereby.

Separate ACL's for Committing Source Versus Publication.

Since the CMS relies on separate sections of svn for original content and staging versus publication, it is possible to configure more relaxed ACLs for content authors versus those capable of publication. The Infrastructure Team recommends that the content on www.apache.org be editable by the full committership, while publication remains restricted to members, committers with apsite karma, and members of the Infrastructure Team.


Adoption Constraints.

This section lists the requirements for projects electing to adopt the CMS.

Layout.

The original source tree MUST have the following layout:

trunk/
   content/                (location of actual site content)
   lib/                    (only required for projects using the standard perl build system)
      path.pm              (the analog of django's url.py)
      view.pm              (the corresponding views)
   cgi-bin/                (optional cgi directory)
   templates/              (location of site-wide templates)
branches/                  (optional branches, currently unused)

Content.

The source content MUST have a unique file extension for each generated file. I.e. you cannot generate foo.pdf and foo.html from the same source file living in the same directory. You must disambiguate the paths to these resources using copies or svn externals (symlinks are not supported, sorry).

There is a further restriction in that the webgui and build system treat foo.page/ directories as attachment directories. This convention prevents any files contained therein to be built, but may be treated as content components (eg html snippets and images) for an individual webpage.

Moreover the source files MUST be utf8- no exceptions.

Content source files with .mdtext or .md extensions are typically expected to contain optional RFC-compliant (mail or http) headers at the top of the file, or YAML headers as is customary in comparable, modern static site generation tools.

Build.

The build system is under 2000 LOC and relies on lib/path.pm to provide a specially formatted @patterns array to give the build system hints on which view to run for a given source file. The patterns are checked in order, and if none of the patterns match, the source file will simply be copied over to the build tree. Each element of the @patterns array is an arrayref which consists of 3 items: the pattern to test, the name of the view function to call, and a hashref of named parameters to pass (by value) to the view function. The patterns are tested against files based on their location rooted within the content/ subdirectory.

lib/path.pm may also provide a hash %dependencies mapping paths to array refs. The keys lists names of files which will also be rebuilt whenever a file matching a value has changed. (This is typically used for sitemaps.) The filenames in the values and also listed in the keys are rooted in the content/ subdirectory. The dependency calculation is transitive.

The build system also requires the view functions in lib/view.pm to return 2 values, the first being the generated content, and the second being the new file extension.

The build system will always take the local path to trunk/ as the current working directory for the build (branches are currently unsupported).

Changes to either the templates/ or lib/ subdirs will trigger a full site build.

A detailed walkthrough is available for folks working on site design. Note that the typical ASF::View based views now support template preprocessing of source content by passing a preprocess => 1 argument to the configured view in path.pm.

External Builds.

With the introduction of svn 1.7+ working copies, it becomes possible to plug in a wide variety of functionally similar build systems to the standard perl system described above- think maven, ant, forrest, etc. If this interests you please discuss the matter further on the infrastructure@ mailing list. It is not unfair to describe this CMS as simply a CI tool with a basic web browser interface.


Future Plans.

This section describes the future plans for Apache Infrastructure as it relates to website publication.

The Incubator.

After going live with www.apache.org, the next project we would like to tackle is the incubator website. It too is based on anakia, but thanks to Sam Ruby there is an xslt file available to help automate the conversion from xdoc to markdown sources. We would like to complete this migration by March 1, 2011.

Anakia Based Sites.

After migrating the incubator site we will branch out to approach any Apache project still using anakia to convert to the CMS. This will of course be a project decision, but we hope the advantages of migration will be clear and well appreciated by pmc members. We hope to complete this process during the summer of 2011. Update: see ant adoption for new options for projects still stuck on Anakia.

Phasing Out Confluence as a CMS.

The next long-term project to tackle is the eventual phaseout of Confluence backed websites. This will be an extensive project which will require development of content conversion tools, but the clock is ticking on how long we can continue to run Confluence without any support for the autoexport plugin. Update: see confluence adoption for new options for projects still stuck on Confluence.

Phasing Out people.apache.org as a Publication Hub.

The final long-term objective is to completely eliminate people.apache.org as the publication hub for Apache websites. Security considerations alone make this a worthwhile goal, and to make this happen we would like to mandate the adoption of at least svnpubsub for all projects by the end of 2012.

View the ASF CMS code.

As of 1 Nov 2010, this ASF CMS system is now running the main www.apache.org site.

The code for the CMS itself is being developed by the Infrastructure Team, and you can follow its Subversion repository.

We are considering turning the CMS into a proper Apache project starting with an incubator podling. If this interests, you please contact infrastructure-dev@apache.org and sign up!