Web application/Caching
From DocForge
Web application caching is the process of storing dynamically generated data for reuse and leaving data closer to the end user. Caching can be used at a variety of levels to improve web application performance. Generally, caching reduces work load and results in users receiving content faster. Implementing a cache can be complex; often the simplest forms of caching produce the quickest and most cost-effective performance payoffs.
Contents |
[edit] Benefits
- Even small performance improvements from caching can drastically improve the user experience.
- Reduced server workload can translate to reduced hardware and support costs.
- There are very low-cost forms of caching. Most web browsers and some Internet networking components, such as proxies, have common caching options. Leveraging these caches only requires a simple entry in the HTTP response header.
- The HTML5 standard incorporates web application caching through manifest files, letting developers more easily incorporate application-specific caching techniques.
[edit] Drawbacks
- Implementation costs can be expensive in some scenarios. It's sometimes difficult to program caches within existing web applications. The extra logic will also require additional testing and debugging time.
- The best results may come from special-purpose software or hardware, which can be costly and harder to support. For example, with distributed systems a dedicated proxy cache sitting above the web application might be most practical.
- System administrators might need to be specially trained or experienced in particular caching scenarios to properly configure and support them.
[edit] Design Patterns
A wide variety of design patterns exist for caching within web applications. Each has its own set of benefits and limitations. All can be used in combination with each other. Pages that rarely change don't need to be regenerated on every request. Consider caching as early as possible in the request handling process.
[edit] Caching Proxy Servers
In the simplest form of a web application, the system responds directly to every request from a web browser. It can be beneficial to introduce a dedicated caching server between the client and the web application server, avoiding dynamic processing completely when possible. A proxy server can watch the HTTP headers for cache hints and act appropriately. This leaves the web server free to only process the requests it actually needs to.
A proxy can often be introduced with no changes to the web application, assuming the web application already responds with appropriate cache headers. Some proxies also inject their own HTTP header content for tracking purposes, or to pass back information to the web application. Some proxy rules must be carefully set, though, to allow dynamic generation where required, such as for user accounts and expired data.
In addition to caching, proxy servers may perform additional performance tasks, such as compressing data and optimizing documents. Systems that assist in such way are referred to as web accelerators. For example, this intermediary may take multiple JavaScript files, minify and combine them into one, and also change the HTML of the main document to reference the new smaller JS file. This allows the web developers to focus on building page functionality while another system handles performance characteristics.
Sometimes caching proxy servers are most useful in distributed environments, where the proxy server can act as both a load balancer and a cache. For every request requiring dynamic content the request is passed to the next available web server.
[edit] Content Delivery Networks
Content delivery networks, or CDNs, are geographically distributed networks which can be very useful for storing static content close to the end user. Edge servers can reside around the world, closer to end users, while the primary storage resides in one or several central locations. A web application can push any static content to the central server. Upon first request in a local region, the edge server requests the file from the central server. Subsequent requests near the edge server return the cached content until the Time-To-Live, or TTL, expires. If this is set to 30 minutes, for example, then the next request following the 30 minute mark is refreshed from the central server.
CDNs therefore can act as large distributed caches for web applications. A web application can push content to the CDN repository via FTP, rsync, or some other file upload protocol.
A Content Delivery Network might serve an entire web site, having the domain name point directly to the CDN. The content can be pushed to the CDN's upload server whenever the web application is updated. (See Case study/High traffic multimedia web site.)
Alternatively a web application might push just large static content, such as videos or images, to the CDN, while handling the primary web page requests dynamically.
[edit] Whole Page Content
After a request makes it to the web server / web application, that server might employ its own form of page caching.
Often the difficult part in caching whole page content is knowing when the content is out-of-date and needs refreshing, particularly in large systems where a single page is made of many smaller components.
To completely avoid dynamically generating a requested web page more than once, the entire page may be cached to disk and served directly by the web server. When the page is first created, or later updated, the disk cache is written in a path accessible to the web server and subsequently handled without any dynamic code. This is perhaps the most significant web application performance improvement because on many (maybe most) requests the application does no work at all. This method has several important consequences:
- All URLs from other pages must point directly to this file, or a path easily processed by mod_rewrite rules. Therefore it's best to use a predetermined naming convention for all cached files so it's relatively trivial for other pages to generate the correct path.
- Page cache updates must be explicitly triggered when necessary. If missing page requests are all sent to the web application, then the first request might be one trigger. All future related updates by a content manager must also be applied.
- No content on the page can be generated dynamically during the page request. This can be overcome with AJAX, whereby the client makes requests after page load to update any additional dynamic content. This might, however, overcome the benefits of caching the entire page. (See Partial Page Content for alternatives.)
- Clean URLs might be lost without the use of mod_rewrite. The URL structure should be considered in the context of the entire web application.
Another method of caching complete page content is for the web application to handle it directly on demand. All page requests still go directly to the web application, at which time it checks for a cached page. If the cached page exists and is not out of date ("dirty"), it's sent immediately back to the client and no further processing is necessary. The cache might be updated at timed intervals or as the data behind it becomes out of date. This method also has several important consequences:
- URLs don't need to change for cached or dynamic content. It's seamless to the user and the web application.
- The cache store can be anywhere accessible to the web application, such as remote flat file storage or databases.
- The core of the web application might need to query several components to determine if the page as a whole is out of date. Alternatively, those other components might notify a core caching component when it's time to refresh the page. For example, an administrative interface might need to clear the cache or notify a caching interface when a database record is inserted or updated.
[edit] Partial Page Content
It's often beneficial to cache segments of dynamically generated web page content when possible. If a page is built from various components, and only some of those components are expensive to generate, then just those parts can be cached to improve overall performance.
If a web application is properly modularized, then any individual component should be cacheable. Let's take a simplified shopping cart as an example. The products and categories might be easy to generate, while the user's cart summary on each page can be more expensive due to user management and cart calculations. On each page but the cart itself, the summary might be cached and updated as needed.
// This PHP code is for demonstration only. // It's a very simplified example of caching data across // web pages and is not a complete system. // An extremely simple web page template <html> <head> <title>For Sale</title> </head> <body> <div id="header"> ... </div> <div id="cart_summary"> You have <?php echo $cart->get_item_count(); ?> items, totaling <?php echo $cart->get_value(); ?>, in your cart. </div> </div> <div id="categories"> <ul> <?php foreach ($shop->get_categories() as $category) { echo "<li><a href="/category/{$category->id}">{$category->name}</a></li>"; } ?> </ul> </div> <div id="products"> <ul> <?php foreach ($shop->get_products() as $product) { echo "<li><a href="/product/{$product->id}">{$product->name} (${$product->cost})</a></li>"; } ?> </ul> </div> </body> </html> // A few of the cart class methods class Cart { ... public function add_product($product, $count = 1) { // Clear any related cached values unset($_SESSION['cart_item_count']); unset($_SESSION['cart_value']); $database = Database::get_current(); $database->query("INSERT INTO user_cart ... "); } // Get the total number of items in the cart. // Let the cache be bypassed on pages where the real value // is most important, such as on checkout pages. public function get_item_count($allow_cache = true) { if ($allow_cache && isset($_SESSION['cart_item_count'])) { // Use the cached value return $_SESSION['cart_item_count']; } $database = Database::get_current(); $count = $database->query("SELECT COUNT(1) FROM user_cart ... "); $_SESSION['cart_item_count'] = $count; return $count; } // Get the total value of the items in the cart. // Let the cache be bypassed on pages where the real value // is most important, such as on checkout pages. public function get_value($allow_cache = true) { if ($allow_cache && isset($_SESSION['cart_value'])) { // Use the cached value return $_SESSION['cart_value']; } $database = Database::get_current(); $value = $database->query("SELECT SUM(value) FROM user_cart ... "); // Add tax $value = $value + Shop::calculate_tax($value, ... // Add shipping $value = $value + Shop::calculate_shipping($value, ... $_SESSION['cart_value'] = $value; return $value; } }
This caching method can be implemented in various ways. Using the session to store the values might have its own drawbacks. The general principle is to prevent recalculating the product total plus tax and shipping on every page load, where many times the user will be browsing and not updating their cart. Other parts of the application might also benefit from their own caching, such as the category list.
For caching large subsets of pages, the session is often not the most useful or efficient. Expensive result sets, for example, might be better cached directly on disk. If progressively loading a large page with AJAX, query results are written once to disk and subsets are returned as needed. This might also be useful for pagination using queries or databases which can't efficiently return subsets of their results. In PHP writing and reading data on disk is most easily achieved with the serialize and unserialize methods. As with any system, using these methods has its own overhead which must be balanced against the expense of querying the database.
In our shopping cart example, another way to improve performance might be to cache the body of the page containing categories and products. These can be served statically until the products or categories are updated by administrators. The user's cart summary can then be updated after page load with an AJAX request. In this scenario the bulk of the content which rarely changes is served directly (as with whole page content), while only the part which is specific to the user is generated dynamically. This effectively reduces the work load of the web application to an absolute minimum.
<!-- A cached web page, generated when categories or products are updated --> <html> <head> <title>For Sale</title> <script type="text/javascript" src="/js/jquery.js"></script> <script type="text/javascript"> $( function () { // /cart/summary is a web application page which // returns user cart data in a JSON object $.post('/cart/summary', {} , function(data) { $('#cart_item_count').text(data.item_count); $('#cart_value').text(data.value); }, 'json'); }); </script> </head> <body> <div id="header"> ... </div> <div id="cart_summary"> You have <span id="cart_item_count">0</span> items, totaling $<span id="cart_value">0</span>, in your cart. </div> </div> <div id="categories"> <ul> <li><a href="/category/1">One set of products</a></li> <li><a href="/category/2">Another set of products</a></li> ... </ul> </div> <div id="products"> <ul> <li><a href="/product/1">First product ($123)</a></li> <li><a href="/product/2">Second product ($456)</a></li> </ul> </div> </body> </html>
// The /cart/summary page simply returns our (often cached) cart values // as JSON in response to the AJAX request. $cart = new Cart(); // Load the cart associated with the user's session or user account $cart->load_current(); // Generate page data $data = array( 'item_count' => $cart->get_item_count(), 'value' => $cart->get_value() ); echo json_encode($data);
In this case our page is basically static. Only the dynamic part is request from the server, and even that is cached when possible.
[edit] HTTP Headers
The cache values of the HTTP response headers can be set to inform web browsers and intermediate servers what they may cache and for how long.
Generally, a web application should allow a web server to directly serve static content to end users unless that content needs to be restricted for security reasons. Typically files like template images, CSS, and JavaScript are served by the web server, skipping dynamic runtimes (e.g. PHP). In these cases the web server should be configured to send appropriate cache headers. Using Apache's httpd, for example, mod_expires can be enabled and configured:
ExpiresActive On # By default cache all files for 2 weeks after access (A). ExpiresDefault A1209600 # Do not cache dynamically generated pages. ExpiresByType text/html A1 ExpiresByType text/xml A1
When served by the web application, these headers can be overridden. For example, if the web application generates CAPTCHA images, caching should be completely disabled. If a dynamically generated HTML page can be cached for longer, then the expiration can be extended.
For generating downloadable files, a web application should set the following headers:
Content-Disposition: attachment; filename=[...] Expires: 0 Cache-Control: private Pragma: cache
See also: RFC 2616
[edit]
Web applications built on platforms which handle each request completely independent of any other, such as PHP, can benefit from sharing memory among multiple requests. By leveraging extensions such as APC or memcached, requests can use data which has been generated during prior page requests. Unlike session data, which is persisted for each individual user, shared memory can be used by all users simultaneously. This potentially reduces overall memory consumption and processing requirements. Memory is also a much faster storage mechanism than disk.
Let's say, for example, that a shopping cart has a complex hierarchical product category structure. Generating this data set might require multiple database queries or recursion. There's no need to generate this data unless it's been updated by site administrators. Therefore a page can get this structured data from memory if available, or generate it if not found. The related administrative pages of the site can mark this data expired as needed.
[edit] Database
Often a database is the biggest bottleneck for the server side of a web application. Gathering data, particularly from a relational database, can be expensive for large or complex data sets. In addition to directly improving database server and query performance, it can be possible to cache database data to further improve performance and avoid unnecessary processing.
Generally, databases which support query caching, such as MySQL, should have this feature turned on. Recent query results are retained in memory and resent directly if the same query is run again. The database server will automatically clear the cache when any related table data is updated, and the caching should therefore be completely seamless to the web application. Even with database query caching enabled, repeatable queries should generally be avoided so the request to the database server isn't even made unless required.
"Expensive" or large queries which require much database processing or bandwidth can be cached by a web application in various ways:
- Memory or disk caching can be used directly to avoid running duplicate queries. A complex data report, for example, can have its result sets serialized to disk for reuse in multiple page templates. Caching has its own overhead, such as the expense of reading and parsing data on disk, so simple and inexpensive queries are most often best left for the database to process.
- ORM modules can be built to retain objects in memory. For the first request, the data object is loaded into memory. Any updates or further requests can be managed in memory, with data written back to the database immediately or later, such as during system shutdown. In this way the ORM is effectively a buffer between the web application logic and the database server.
- Caching partial page content can help avoid rerunning queries when that content is required multiple times.
[edit] Conclusions and Further Reading
Caching in various forms is an important part of any web application performance strategy. Run benchmarks and analyze which methods will provide the most improvements proportionate to the app's bottlenecks. Start with the caching strategies that will give the biggest boost with the least amount of effort and complexity.
More on caching:
- Using the application cache. Mozilla Developer Network. “HTML5 provides an application caching mechanism that lets web-based applications run offline. Developers can use the Application Cache (AppCache) interface to specify resources that the browser should cache and make available to offline users.”
- Kleppmann, Martin. Rethinking caching in web apps. “My hope is that we can develop ways of better managing scale (in terms of complexity, volume of data and volume of traffic) while keeping our applications nimble, easy and safe to modify, test and iterate.”