Refresh your cache - Best In Class has been baked

2010-05-17 08:07:41

Best In Class has just been overhauled. From a slowly cooked PHP/Wordpress solution, to the hip blazing 250% faster Clojure driven version. In this post I'll outline the major strategies used in this rewrite.

 

 

Preface

When I launched Best In Class last october I knew that selling Clojure as my primary service was going to take more than words, so I decided to launch a blog simultaneously with launching the company in order to demonstrate why I think Clojure is ideal for use in the industry. The blog was initially launched on Wordpress.com, but since I was not seeing a spillover from that domain onto BestInClass.dk I decided to fuse the blog together with the main site on the .dk domain, by running both sites on a single Wordpress installation. There were some quick-wins that I needed at the time, primarily time-to-launch was very low as setting up Wordpress and importing your old posts only takes a day or so, but the drawbacks finally caught up with me.

 

Cooking vs Baking

When a site is cooked it's doing some kind of processing on every request, like PHP does for instance. When a site is baked, it's pre-rendered into static files which are then served by a webserver. The site you're looking at it fully baked, with no dynamic content. Definitions borrowed from here & here.

This is the desired way to go for a number of reasons. First off, I can backup and deploy the entire site using nothing but rsync, there is no database to update, backup or maintain. Secondly, its about 250% faster, if not more than that, because even though PHP is quite fast it doesn't beat the serving of static files. And finally, good luck cracking a .html file - a lot of security concerns disappear with the removal of code evaluation in the frontend.

 

Transition

So making the transition from cooking to baking began with me mentallly running over the elements of each page, thinking of what needed to by dynamic and what didn't. The old index.php was dynamic, in that it got a list of all the blogposts from the SQL database, then proceeded to render the excerpts in some paginated style, but why not just paginate using JS and hardcode the excerpts? I'll go through the elements one by one, showing how I made them static.

 

index.clj

The Clojure driven index (or blog.html) is different from index.php, in that it doesn't load anything but instead it is updated everytime I publish a blogpost. I have a fancy WYSIWYG backend where I can write my posts and as soon as I hit publish, a file is generated with that post and the excerpt is stripped out and prepended to the blog.html file. This process is very simple thanks to Enlives clever templating system, where with I define what an excerpt looks like:

(defsnippet teaser "teaser.html" [:body :> any-node]
  [title link thumb excerpt]
  [:a.title-link]                       (do-> (set-attr :href link)
                                              (content title))
  [:a.thumb-link]                       (set-attr :href link)
  [:div.link-float-right :a.perma-link] (set-attr :href link)
  [:img.avatar]                         (set-attr :src (str thumb-prefix thumb))
  [:div.excerpt]                        (content (html-snippet excerpt)))

For that to work, you just have to provide an html file which contains elements to match the selectors.

Simple enough? Its just plain CSS selectors, working sort of like Pure. To prepend an excerpt to the main index is then as simple as:

(->> ((template (File. "site/blog.html") [title link thumb excerpt]
                [:ul.content] (prepend (select (teaser title link (str "/" thumb) excerpt)
                                               [:ul :> any-node])))
      title url avatar (-> (.split (slurp "draft") "") first))
     (apply str)
     (spit "site/blog.html"))

 

A Single Post

So generating a blogpost is a 2 stage process. Every page on the site has certain similarities, ie. the header, the menu, the footer. So to avoid having to edit a ton of files everytime I change something, these are all abstracted away in a template appropriately named 'page':

; Raw template for all pages, include header/footer
(deftemplate page "template.html" [title scripts styles body]
  [:title]             (content title)
  [:div#pages :a]      (clone-for [[href src] [["/index.html"     "/images/forside-lnk.png"]
                                               ["/services.html"  "/images/services-lnk.png"]
                                               ["/produkter.html" "/images/produkter-lnk.png"]
                                               ["/blog.html"      "/images/blog-lnk.png"]
                                               ["/kontakt.html"   "/images/kontakt-lnk.png"]]]
                                  this-node (set-attr :href href)
                                  [:img]    (set-attr :src src))
  [:script.header]     (clone-for [src scripts] (set-attr :src src))
  [:link]              (clone-for [href styles] (set-attr :href href))
  [:div#content]       (substitute body))

You'll notice that this relies heavily on Enlives 'clone-for', which behaves exactly as a for-loop except that its spitting out html. So in the case of the menu, I supply link-icons and hrefs and these then get destructured and transformed into the menu you see at the very top of the page. Finally the #content is then substituted for whatever else might be in div#content. Four fns which will be your best friends when using Enlive are append, prepend, substitute and content. The final line is important, because it allows me to pass other transformations, ie. snippets as the content, so for instance to render my frontpage I'll make the html first:

<body>
    <div class="scrollable">
      <div class="items">
        <div>
          <img src="/images/slider/webudvikling.png" class="thumb"/>
          <img src="/images/slider/webudvikling-quote.png"/>
          <a class="clink" href="/services.html">L?s merea>
        div>
        <div>
          <img src="/images/slider/appudvikling.png" class="thumb"/>
          <img src="/images/slider/appudvikling-quote.png"/>
          <a class="clink" href="/services.html#2">L?s merea>
        div>
        <div>
          <img src="/images/slider/cljudvikling.png" class="thumb"/>
          <img src="/images/slider/cljudvikling-quote.png"/>
          <a class="clink" href="/services.html#2">L?s merea>
        div>
      div>
    div>
body>

And this is then loaded into a snippet:

(defsnippet frontpage "index.html"     [:body :> any-node] [])

So generating the frontpage can now be done like so:

(page "Best In Class" scripts css-files (frontpage))

It couldn't be much simpler and it makes for highly reuseable and maintable code -also its the case of 'optimize once, win everywhere'

Comments

Ah, but there is one gotcha. Comments are dynamic right? Well, half of them is. As you are probably able to deduce from the snippets above, appending a comment to a blogpost is trivial, but receiving it and moderating posts are still dynamic tasks. For that reason, the backend of the site is driven by Moustache, Christophes micro web-framework. Moustache has a few simple tasks

Since we are serving multiple users, we are risking race-conditions in several of these challenges, so its a good thing I decided to write the site in Clojure. When you submit a comment to the site (and I hope you do), that comment is sent to an in-memory queue, which when skipping the urlencoding/decoding looks like so:

(if (= captcha answer)
       (dosync
        (alter comment-queue conj
               {:url     url, :name name, :email email
                :captcha (format "Answered %s to question #%s (%s)" captcha cid question)
                :date    (.toString date)
                :comment comment})
        {:body "OK"})
       {:body "NOT OK"}))

Every minute an agent is checking that queue and persisting it to a file on disk, in case of a server crash:

(defn backup-comments [a]
  (doseq [comment (dosync
                   (let [comments @comment-queue]
                     (ref-set comment-queue [])
                     comments))]
    (append-spit "comment-queue"
                 (with-out-str
                   (prn comment))))
  (Thread/sleep 60000)
  (send-off *agent* backup-comments))

So the moderation panel is just a matter of checking which comments are in queue and either delete or prepend them to a post. Simple right? But there's another gotcha: StreamWriters aren't atomic. So that means while Enlive is busy printing the new blogpost, some poor reader comes by and sees a halfway written html file. Luckily Unix systems provide a number of atomic filesystem (fs) operations, like 'mv':

(defn append-to-post [{:keys [url name date comment]}]
  (let [url     (-> (str "site/" url) (.replaceAll "//" "/"))
        url2    (str url (hash url))
        c-class (if (= name "Lau") "comment-lau" "comment")]
    (->> ((template (-> url File. html-resource) [new-comment]
                    [:div#debate] (append new-comment))
          (a-comment c-class name date comment))
         (apply str)
         (spit url2))
    (sh "mv" url2 url)))

So there you have it, atomicity on the filesystem. All this does it sanitize the url and then make a second version with its own hash value prepended. Then it checks if Im (Lau) posting and if so changes the class of the :div#comment tag - prints the html to disk and swaps the files.

The combination of Enlive, Moustache and Clojure is extremely powerful, so powerful that you can generate almost anything with it.....

Atom Feed

Yes, even an atom feed becomes trivial to generate, so whenever I publish from the backend, not only is the post produced and the index modified, but the atom feed is also updated. All this takes is a simple atom.xml file, which has the basic structure required by the RFC. Then you make a template which spews the elements, ie. the posts and calling it is as simple as:

(atom-feed (.format (SimpleDateFormat. "yyyy-MM-dd'T'HH:mm:ss'+08:00'") (java.util.Date.))
               (take 10 (sort-by :updated #(compare %2 %1) data)))))

The 'data' variable is just a hash-map of the posts taken from a file-seq. This data is then sorted in descending order (thanks Chris) and the top 10 posts are passed to the atom-feed template. If you're not seeing the full picture, wait until I put it on Github :)

 

Bringing the luggage!

So there's just one thing missing, what about all of my old posts? 42 to be exact. Well, the evil twin of Enlives templating is selectors, which are perfect for scraping, so I've written a small lib which swallows a Wordpress.xml export file and converts it to whatever you like. There are 3 main stages

If you are the proud owner of a Wordpress blog, try exporting your site and looking at the comments and you'll see that they are neatly organized in tags, with elements explaining if they are approved, pingbacks or whatever. I want all approved comments minus pingbacks:

(defn extract-comments [post]
  (let [comments (select post [[:wp:comment (has [[:wp:comment_approved (pred #(= "1" (text %)))]])
                                       (but (has [[:wp:comment_type (pred #(= "pingback" (text %)))]]))]])]
    (sort-by :date compare
             (for [c comments]
               (loot c [:author :email :date :comment]
                     [:wp:comment_author :wp:comment_author_email
                      :wp:comment_date :wp:comment_content])))))

Christophe, not wanting to shadow "not", named that operator "but" which makes for some unclear reading. The loot function wasn't meant for primetime, but it kept coming in handy! It simples take a collection of names and a collection of selectors. In the maps the result, of taking the content of those selectors to the name supplied:

(defn loot
  [chunk names selectors]
  (knit names (map (fn [selector]
                     (pick chunk (if (coll? selector)
                                     selector
                                     [selector content])))
                           selectors)))

So with all the comments tucked away in a collection, we can now grab the main content:

(defn get-posts
  " Takes an Wordpress backup file as its first argument and a function of 1-args as its second.

    The wordpress file is parsed for post data and this is return in hash-maps containing keys
    [:title :link :body :thumb]

    Thumb is specific to users of the post-avatar plugin. After the data is retrieved the
    post-capture-hook is applied to each item. Use this to sanitize, modify, etc."
  [file post-capture-hook]
  (let [posts  (-> file xml-resource
                   (select [[:item (has [[:wp:post_type (pred #(= "post" (text %)))]])]]))]
    (map post-capture-hook
            (for [{i :content} posts]
              (-> (loot i [:title :link :body :date :thumb]
                        [:title :link :content:encoded :wp:post_date
                         [[:wp:postmeta (has [[:wp:meta_key (pred #(= "postuserpic" (text %)))]])]
                          [:wp:meta_value] content]])
                  (assoc :comments (extract-comments i)))))))

There are no surprises and its great to see how little code you have to write, in order to import from an entirely different CMS. You see me picking out the "postuserpic" which is a property unique to users of the 'post avatar' plugin. If you dont use that on your blog, it'll return nil. This first loops over all the posts, extracing the interesting details, then it associated the :comments to each entry and finally it maps a post-capture-hook unto all the elements. The hook allows you to do arbitrary post-capture formatting, like fixing dates et al. My post-capture hook pulls out the excerpts and fixes links, yours might do something else.

 

NGINX Fu!

So the last step is simply to launch the site on a webserver. The old links looked like so:

http://www.bestinclass.dk/index.php/2010/04/prototurtle-the-tale-of-the-bleeding-turtle/

And the new links like so:

http://www.bestinclass.dk/index.clj/2010/04/prototurtle-the-tale-of-the-bleeding-turtle.html

Nginx (Engine X) provides some fancy rewritiing with regexes, so if you click the old link, you'll actually see it transform into the new one before your very eyes. This is how I do it:

if ($request_method ~* GET ) {
	 rewrite ^/(.*)/$ /$1;
 }

 if ($uri ~* /index.php) {
       rewrite ^(.+)(\.php)(.+)$ http://www.bestinclass.dk$1.clj$3.html last;
       break;
 }
	

The first is a common rule of all GET requests, removing a possible trailing slash. The second captures 3 groups and then knits them together around the domain address, works perfectly! So hopefully all of the old links still work.

Another thing which you need to keep in mind, is that some of the typical caching rules for Nginx also cache html files, which would make for a very boring blog, so it makes sense to disable this and only cache the truly static files.

 

Conclusion

Converting an entire Wordpress blog to a new slick baked solution is a piece of cake so to speak. The awesome expressive power of Clojure makes it just a few lines to produce a thread-safe webapplication. The new version of Best In Class is much (MUCH!) easier to manage, maintain and update. Time is scarce these days, but I'll try to bundle the code and OpenSource most of it, in case anybody else is looking to get off Wordpress and into the baking business :)

If you're in Europe late june and would like to learn how to deploy Clojure in the industry, be sure to check out Conj Labs.

Baishampayan Ghose
2010-05-18 06:08:27
Awesome new blog Lau! Congrats.
joschi
2010-05-18 06:53:18
There are still some errors, especially in the Atom feed.

The content for example contains:

") first)) (apply str) (spit "site/blog.html")) 

in the first line of every entry. Also the title tag has sometimes strange content like "Best In Class: net.cgrand.enlive_html$content__627@16bbeaf".

And on http://www.bestinclass.dk/index.clj/2010/05/best-in-class--now-baked-with-clojure.html Firefox links the RSS icon in the address bar to the CSS files referred to in the HTML page (which could also be a bug in Firefox).
Lau
2010-05-18 08:09:14
Hey Joschi,

Thanks for checking in. I've been fortunate enough to have many keen eyes watching the site.

1) The leaked s-exps in the Atom feed, was due to the way I extracted excerpts. If more than one -more- tag was present in the body it would break, thats fixed now.

2) The link to main.css instead of atom.xml was due to the Enlive selector which injects the css. It wasn't looking at the rel tag. Thats fixed now with [[:link (attr= :rel 'stylesheet')]].

3) The title looks like an emitter missing apply-str somewhere, Ill go check it out asap.

Thanks again to everybody, Lau 
metacagoule
2010-05-18 08:31:32
delightful resurrection.
Chouser
2010-05-18 05:26:04
You might consider having your agent pull from a BlockingQueue instead of polling a ref.
John
2010-05-18 07:06:47
I love it. Having spent a lot of time in Clojure and writing various bits and pieces, I'm most interested in the big picture. I'm curious about how *exactly* you have set up the server since it seems like everybody does it differently with Clojure web apps.
Viksit
2010-05-19 12:08:26
Excellent Job Lau! I'm looking forward to the open sourced version of the blogging code.

From what I could see, you haven't used Compojure here. Any specific reasons not to?
Lau
2010-05-19 08:11:14
@John: Then stay tuned for the next few blogposts where I hope to release most of the code :)

@Viksit: There are many considerations that helped move me away from Compojure. First and foremost Compojure 0.4 and onwards is a very slimmed down version which is under heavy development. Moustache on the other hand is like the younger cousin of Enlive, which only handles routing and it does so in a very simple and well implemented manner. Thirdly, Moustache is made by Christophe Grand whom I have a lot of trust in, when it comes to writing solid functional code, so that in itself was enough for me to try it out and Im quite happy with it.

Though with that said, Compojure still has a lot of offer so for every project I recommend checking out both libs and seeing where they are at.

More on this in future blogposts.
Viksit
2010-05-19 12:12:27
BTW, what criteria did you measure the 250% increase in performance with?

Cheers.
Lau
2010-05-19 08:15:46
@Viksit: Performance was measured with Apaches 'ab' utility, testing identical blogposts with a concurrency level of 10. The 250% was in the number of requests served per second. In my preliminary tests, all requests were answer in less than 1 second with the Clojure generated site, where on the other hand all requests were ] 1 second on the PHP site. Still, we're down to differences of a couple of hundred milliseconds, and the internet the most accurate medium for testing speeds. The 2 blogposts had almost identical byte-sizes.
semperos
2010-06-08 08:21:29
Beautiful site, and beautifully coded.  Thanks for open-sourcing the code.

The "cooked" versus "baked" distinction is important, but could you speak to it a little more (either in a comment or in another post)?

The Drupal project, for example, has a contributed module called Boost that auto-caches HTML file versions for all "pages" that would otherwise be served dynamically, so you bypass standard PHP-DB-PHP loops and simply serve up a file.  Could you speak to some of the differences between your "baking" process and common caching schemes like this?