Collaborative Content Production

Showing results for 
Search instead for 
Did you mean: 

Collaborative Content Production

0 0 11.3K

Obsolete Pages{{Obsolete}}

The official documentation is at:



Alfresco supports parallel development for web content (and documents).  Users are able to work in private 'sandboxes' that provide virtualization, templates, folders for normal and transient content, default workflows, and many other useful features.  Sandboxes are customizable, and three default configurations will be provided:


Author, Reviewer, and Staging sandboxes will form the basis of Alfresco's default production model.   As depicted below, the figure on the far left is a representation of a generic 'sandbox', the middle figure shows two folders placed inside of it, and the figure on the right shows these folders being used as separate 'docroots' in a servlet engine (Tomcat is used within this example, but other servlet containers and httpds will be supported too):


Using transparent folders and virtual subdomains, each user will be able to obtain a private view of their own content overlayed on the master version shared by all.  The relationship between a 'docroot' folder and its corresponding virtual subdomain will be maintained by a chain of servlet/httpd adapters that implement a command pattern interface.  These adapters will be invoked by a workflow, so that every aspect of the system remains highly configurable.


Building and maintaining a website collaboratively is difficult; the content is highly interdependent, is reused in modified form, changes rapidly, and there's a lot of it.  Further, a small group of administrators is often responsible for managing a much larger pool of contributors.   In order to scale, flexible content management capabilities are needed, along with high-level policies and techniques for using them effectively.

Let's first review some of the problems that must be solved.  The most fundamental issue is how to deal with multiple people writing to the same file at the same time.  Content management solutions usually pick one of the following approaches:

  1. Ignore the problem
  2. Prevent parallel modification via 'locking'
  3. Give each party a separate copy of the original

For example, suppose Alice wants to change the text to 'Hi Jon', but Bob wants to change it to 'Hello John'.  If there's only one file, then you have:


Unless Alice and Bob are using tools that force one person to wait until the other is completely done, then the final outcome might be a corrupted file (as denoted by the red question mark).  Assuming some form of 'locking' is used, other problems still remain:

  • Alice or Bob might spend a lot of time waiting ('locked out')
  • There's no easy way to arbitrate lock contention
  • Non-merged updates can still be lost

A more sophisticated solution is to give each user a seperate copy.  This improves parallel development, but now there's a new challenge:   if Alice and Bob each begin modifying their own copy of the original file, how are their conflicting changes resolved?  Here's a graphic illustration of the problem:


Many systems that enable parallel development require that the burden of merging changes together in the event of a conflict  (e.g.: 'Hello Jon') is assumed by whoever is slower.  Sometimes this is acceptable, but sometimes it's not;  stitching things back together might require different skills than authors posess, and so it may be preferable to let a reviewer/editor handle such tasks.   Another problem is that if Alice and Bob each have separate copies, the following issues must be addressed:

  • Keeping files synchronized with the master version
  • Seeing local modifications as they would appear in the context of the master version
  • Scalability  (in terms of content and users)

Alfresco's approach is a variation on the 'separate copies' idea.  Rather than duplicate each file, transparent layers are used to create the illusion that each user is working directly with the master copy. 

A transparent directory (or file) in Alfresco acts like a 'sheet of glass' on top of some other directory (or file). You can either look through this transparent barrier and see the object on the other side, or you can write on the barrier itself, and see your own data instead.  Thus, if directory Y is a transparent overlay on directory Z you could visualize it like this:


Until you make modifications, a transparency is similar to a UNIX-style symbolic link; after making modifications, you're manipulating a completely detached, private copy of the original. 


In the detailed description of transparent layers, when Y is an overlay on Z, it's depicted like this:


However, because the focus here is on how these layers are composed to form useful structures for collaborative development, the 'sheet of glass' notation will be used instead:


It amounts to the same thing. 

How it works

Suppose an author develops content within a particular folder.   Because directories are versioned, private 'checkpoints' can be used to return to a known state, or recover an old version of a file (even if nobody else has seen it yet).   Let's depict the top-level folder that an author uses for content development like this:


For collaboration purposes, some central location must be agreed upon as a common point of reference for the latest agreed-upon version of the website's contents.  Authors will base their work on it and/or generate new files and directories with this 'master version' in mind.   In Alfreco, the term used for such data is 'staging content';  here's a graphical depiction of a top-level folder containing staging content:


Often it's useful to see 'author content' in the context of 'staging content'.   For example, if an author has created a web page with embedded images and links to articles written by others, the images will be broken, and the links won't work unless those images and articles are also present. 

The obvious, but flawed solution is to make each of the authors copy the entire contents of the 'staging' folder to their own local 'author content' folder:


Parallel development is now possible, but three major problems remain:

  1. Scalability
  2. Staleness
  3. Brittleness

Achieving scalability

Subversion does not solve the scalability problem.  While 'virtual copies' of folders can be created quite efficiently within a Subversion repository, users must extract 'working copies' into their local file system in order to make substantial changes.   Because web content tends to be highly intertwined, the only practical solution is to extract everything in the repository in order to see things in-context.  Unfortunately, the more data there is in the repository, the longer this process takes, and the more disk space it consumes.   Subversion works well enough for less demanding tasks like source code management; however, websites tend to have more files, more data (e.g.: large images), and many more 'casual contributors'.


Linear-time operation
An activity that requires more time in direct proportion to the size of the task at hand.

TeamSite makes lightweight 'virtual copies' internally, and then exposes its repository as an NFS-mountable file system.  Therefore, it is highly scalable.  You don't need to perform a lengthy extraction process like you do with Subversion; new users can get their own private 'copies' almost instantly, no matter how much data is involved.


Constant-time operation
An activity that does not consume more time when the size of the task at hand increases.

Alfresco's repository can also be exposed directly as a file system (via CIFS; however, it will usually be accessed via a browser-based interface.   Making virtual copies of data in Alfresco is also a contant-time operation.

Avoiding staleness

Authors of web content almost always want to see how their proposed submissions would appear if added to the current version of the website.   Thus, taking a fixed 'snapshot' of the staging content, and modifying files within this 'static picture' is undesirable.  Typically, they'd prefer to have the files they're not touching be updated automatically within their private content folder.  That way, they can see whether or not the hyperlinks they're creating are stale immediately.  If someone adds a new image to a gallery, they'd like those to be accessible immediately too.  In Alfresco, Transparent layers allow them to do exactly that:


Unless a file has been modified within an author's content folder, it automatically appears the same as it does within the staging content folder.

Reducing brittleness

In the area of source code management, users typically want the exact opposite of what web content developers prefer.  Because of how source code build tools work, software developers don't want anything in their private content folder to change unless they specifically request it.  Alfreco is capable of handling this too, again using the transparency mechanism.   For this task, instead of overlaying the folder containing the latest 'staging content', an  'author content' folder can overlay an non-changing/archived version of it.   Recall that all directories are versioned in Alfresco, so this can be done quite efficiently.  Here's a graphical representation of a non-changing 'staging archive' folder:


The combination of versioned directories and transparency is quite powerful.  As you've just seen, overlaying the author content folder on a staging content folder accommodates web developers, and overlaying it on an archived version of the staging content provides software developers with the more rigid environment they need to build source code properly.

There's another benefit to making 'author content' a transparent layer:   the ability to substitute one background layer with another in constant time.  For example, suppose version X of the content in the staging folder was able to build properly.  Authors doing source code development could base their work on this by overlaying 'staging archive x', like this:


Eventually, there comes a point where it's time for an author to contribute work back to the main 'staging folder', but a considerable amount of time may have passed.  A daunting choice arises:

  1. Update the 'author folder' with a recent snapshot and try to build before submitting
  2. Submit changes 'blind', and hope for the best

Why would anybody ever submit 'blind'?    With most systems, there are actually some valid practical reasons.  What if this 'recent snapshot' of the staging content is broken?  To help visualize this, let's suppose 'staging archive x' was good, as was 'staging archive y',  but the current data in the 'staging content' folder is bad, and won't build (e.g.: someone didn't do QA before checking in, and their reviewer failed to notice).  Now you'd be in a situation something like this:


Therefore, if you were to create a new 'snapshot' of the staging folder (i.e.: 'staging archive z') and try to build using that, you'd be in a 'broken' state too:


While 'staging' is insulated from your mistakes, you're at the mercy of its latest 'snapshot' with TeamSite (and most other systems).    When your 'author content' folder won't build anymore, there could be problems with the staged content... or your own stuff... or some combination of the the two.  You'd like to revert, but what version should you revert back to?   One bad checkin can prevent an entire team from continuing.  In order to escape this dilemma, sometimes a subset of the users will create a special branch that has nothing to do with the actual feature set;  they just want a bit more stability.  The problem is, if they submit their new work to the 'sane branch', they're creating yet another job for themselves later in the form of merging everything back.

Alfresco will address this issue by providing ways to annotate versioned objects.  Eventually, we hope to make the browser-based interface (and any command-line tools) we supply aware of them.  By providing a low-level hook like this, one could do all sorts of things to make the 'update' operation more intelligent.  For example, a continuous build system like CruiseControl could run in the background, and supply annotations that make the 'update' operation mean 'update to latest source code that produced a valid build'  rather than 'give me the absolute latest content'.  Thus, instead of updating to the broken 'staging archive z', you could back off to 'staging archive y':


Now you can re-validate the work you did earlier when building against 'staging archive x', and know that if something is broken, it's your fault... not someone else's.   When you submit your changes, they can still be sent to the 'staging content' folder as usual, so no additional merging will be required.  This way, you've done the best possible thing under the circumstances, and can continue working while the person who broke the build in the staging area fixes their problem. 

Handling transient content

Many websites are template-driven.  There are many nice tools out there that take raw data from different sources (e.g.: XML files) and combine it with presentation logic to build HTML files.  Besides enabling you reuse look-and-feel logic, templates also give you the freedom to experiment with alternative alternatives.  Unfortunately, you don't always know what files a template engine might emit in advance because the file names used might be programmatically generated  (e.g.:  paginated output).  Because of this, it's hard to preview the results of a template compilation without either over-writing files you actually care about, or leaving 'garbage' preview files strewn everywhere.   What you'd really like to be able to do is isolate 'preview' output in its own folder:


However, notice that this 'scratch space' for template previews might reference other files, such as those in the author's content folder.  Therefore, you'd like to see the preview content as an overlay on top of the author content, like this:


Author Sandboxes

As stated earlier, an author's content folder is itself seen against the backdrop of staging;  therefore, the full overlay diagram looks like this:


Rather than force administrators to assemble these low level structures together 'by hand', Alfreco will create a higher-level object called an 'Author Sandbox'.


Sandboxes will have other properties as well  (and be parameterizable), but many of these details are still in too much in flux to document.   One of the more interesting ones is that they'll be associated with virtual domains;  this will give authors their own  'Docroot' in webservers and application servers.  

Suppose an HTML file in an author's content folder contains an 'absolute' reference:

<a href='/moo/cow.html'>...</a>

Because all references to '/' are rooted within the author's content folder, all files previewed from the browser will be drawn from it  (or the background 'staging content' folder) when that author-specific virtual domain is used. 

Reviewer Sandboxes

In most organizations, authors don't simply copy work up to the master staging folder.  Typically, there is some sort of review process.  Alfresco places a virtual copy of the changes that an author wishes to submit into a fresh folder that is itself a transparent overlay of the staging area.  Thus, a 'reviewer sandbox' is created;  it is very much like the author's original sandbox, except that it only contains the subset of changes that the author wanted to submit for approval:


Like an Author Sandbox, it will be possible to configure a separate virtual domain for a Reviewer Sandbox, so that reviewers can see all proposed changes in-context.


Typically, reviewer sandboxes are rather short-lived.  They are created automatically in the course of a workflow process; when they're not longer needed, they are destroyed.  If the content they contain is approved, the submit workflow makes a virtual copy of their contents into the staging area.

Staging Sandboxes

By default, when a reviewer approves of an author's submission, a virtual copy of it is promoted to a 'staging content' folder:


As was mentioned earlier, this 'staging content' represents the latest approved version that has been agreed-upon by everybody involved in the collaborative effort.  Sometimes though, it might be undesirable for content that has been approved to go into the main staging folder right away.  For example, certain kinds of press announcements, holiday banners, etc.   It's still useful to see such changes in-context, but we'd like to have a clean separation between content that's part of a 'timed deployment', and everything else.  Again, transparent layers can help:


Normal authors and reviewers don't have to look through the 'timed deployment' overlay at all.  In fact, if the administrator wishes to deny them access, they can be prevented from seeing it entirely.

Alfresco will preconfigure a high-level structure called a 'Staging Sandbox' that will also have the same 'virtual domain' features available to the other kinds of sandboxes:


There comes a point when a major site reorganization really requires it's own branch, but for incremental pending data, the ability to see timed deployments in-context gives the QA team that much more time to spot trouble.