Apache Week
   
   Issue 73, 11th July 1997:  

Copyright ©2020 Red Hat, Inc

In this issue


Apache Status

Release: 1.2.1 (Released 6th July 1997) (local download sites)
Beta: None

Bugs in 1.2.1:

  • Solaris systems can fail to restart on a SIGHUP. This appears to be a bug in Solaris which should be fixed in Solaris 2.6. For more details, workarounds and a patch see known bugs. This will be fixed in 1.3.
  • Content negotiation may fail to pick the smallest of equally acceptable variants. This will be fixed in 1.3.

Patches to Apache 1.2 bugs will be made available in the apply to 1.2.1 directory on the Apache site. Some new features and other unofficial patches are available in the 1.2 patches directory. For details of all previously reported bugs, see the Apache bug database and Known Bugs page. Many common configuration questions are answered in the Apache FAQ.


Unless otherwise noted, all the new features discussed here are planned for Apache 1.3 and not Apache 1.2.1.

More Use of Regular Expressions

The Alias, ScriptAlias and Redirect directives map incoming URLs onto a file or another URL. The incoming URL is given as a simple partial match, so for example,

  Alias /icons/ /usr/web/icons/

maps /icons/banner.gif onto /usr/web/icons/banner.gif. But it is difficult to do things like map (say) all images onto a different server. Although this can be done with the optional rewrite module, the syntax for this module is quite complex. A new simpler way of matching URLs will be implemented in Apache 1.3.

This will use Unix "regular expressions" to match the incoming URL. This gives a lot of flexibility, especially since parts of the incoming URL can be included in the resulting filename or URL (instead of just the trailing part). Three new directives implement this: AliasMatch, ScriptAliasMatch and RedirectMatch. They perform the same function as their counterparts without the "Match", but use regular expressions for the first argument and can include replacement tokens in the second argument.

For example, to map all requests for .gif files onto a different server you could use

  RedirectMatch (.*)\.gif$ http://www.img_server.com$1.gif

The first argument is the regular expression to match against the incoming URL. The .* means match any number of characters, while the \.gif$ matches the text ".gif" at the end of the URL only. Because the expression tries to match the longest part it can, the .* bit will match the whole initial part of the request, from the initial / onwards. Finally the brackets ( ) mark the text that matches for use in the second argument.

The second argument gives the replacement URL. The $1 part is replaced by the text that matched within the brackets in the first argument. So, for example, if the incoming URL was

  /about/head.gif

then the first argument would match (because it ends in .gif), and the bracketed part would match the text "/about/head" and call that match $1. In the second argument the $1 will be replaced with this text, giving a redirected URL of

  http://www.img_server.com/about/head.gif

The directives <Directory>, <Location> and <Files> can already use regular expressions, indicated by a ~ (tilde) as the first argument, followed by the expression. For consistency there are now additional directives <DirectoryMatch>, <LocationMatch> and <FilesMatch> which take just a regular expression argument.

Directory Indexing Split into Two Modules

At present, the mod_dir module handles directory indexes. It actually does two very different things, each individually controllable:

  • It can return an automatic index of a directory as HTML, configured by the Indexes option
  • It can map an incoming request for a directory onto a filename (typically index.html or index.cgi).

Most of the code in mod_dir deals with the first action, which is quite complex. The second part is much simpler. Many sites require the second part but do not need the first (in fact, the first can expose files which should not be displayed to the user, so it is more secure to not use directory indexes). In Apache 1.3 these two functions have been split into two separate modules. This means people who need the index.html functionality but not the auto-indexing can reduce the size of their executable by removing the auto-indexing module.

The auto-indexing code has been removed from mod_dir (which now just handles index.html style functionality), and placed into the new module mod_autoindex.

Turning off Hostname Lookups

Two weeks ago we reported that 1.3 will ship with a configuration file containing the HostnameLookups off directive. Currently the hostname lookups default to on. The main effect of this - besides better performance - will be that the log file will contain IP numbers instead of hostnames. At the moment, setting HostnameLookups off in 1.2.1 or earlier will also affect access restrictions based on hostname (such as allow from .nasa.gov). In 1.3 this will work even if hostname lookups are set to off. If Apache sees a hostname in an allow or deny directive it will convert the browser's IP address into the corresponding hostname. This means it is quite safe to set hostname lookups to off in 1.3 without affecting existing access restrictions.

Better Support for 64 Bit Systems

At some places in the code, Apache uses variables or arguments which can take either an integer value or a pointer value. These are actually stored as pointers, then cast to the correct type when used. On most systems this is not a problem, since both ints and pointers are stored in the same sized locations (32 bit). However newer systems may use 64 bits for one or more of these types. There is a risk that if the size of an integer is larger than the size of a pointer, data will be lost, and the code will often cause compilation warnings about data type sizes. From 1.3 onwards, Apache's internal code will use a special "generic" data type which is defined to be large enough for whatever data is stored within it. Although a typical way of doing this would be to use a union of all the data types, this would slow down function calls, so Apache uses a type which can be passed by value to functions. This may affect the module API for 1.3.

Unbuffered CGI

Normally the output a CGI scripts is "buffered". That is, Apache reads the output and sends it out when it has got enough, or when the CGI program exits. This is good for performance of the server and the network, but might be undesirable in some situations. For example, if you have a long running CGI program you cannot currently send back a line or two to the user telling them to "please wait....", or a search engine cannot display results as it finds them.

Actually there is a way to do both of those, called "nph" scripts. This is an old system where the CGI output is sent straight back to the client without buffering. NPH actually stands for "Non-parsed Headers", because NPH scripts must also send back all the required HTTP response headers. Given that there are now three different versions of HTTP, and that HTTP/1.1 adds a lot of new requirements, writing a compliant NPH script is very difficult. So using NPH is not recommended.

Recent changes to the 1.3 code will make it possible to have unbuffered scripts without having to use NPH.


Using Apache is not Rocket Science

But rocket scientists at JPL use Apache. The latest news about the PathFinder mission to Mars is being made available by JPL on their website. As might be expected for such as high-profile site, it generated a lot of traffic. Since the touchdown last weekend, the web servers used by JPL have changed quite a bit. To offload their servers there are many mirrors of the PathFinder site around the world, including some high-capacity sites run by SGI, Sun and others. These tend to use the corporate vendor's own server, or one they have a commercial relationship with (so, for example, SGI's PathFinder site runs a Netscape server).

However internally JPL uses Apache servers for its web sites. In fact the main JPL site at www.jpl.nasa.gov is running Apache. This is actually handled by Sun Ultra 1's running Solaris and Apache. These servers were initially handling the PathFinder site as well but when they becaome overloaded another server was setup by SGI (mpfwww.jpl.nasa.gov using Netscape Enterprise server).


Hints for High Hit-Rate Sites

Like many other popular sites, the JPL site at www.jpl.nasa.gov gets a lot of hits. With several million hits per day, they need a server which can cope with more than 50 hits per second. With suitable hardware and some configuration, Apache can easily handle this sort of load on a single system. Combined with multiple servers Apache can also scale to huge numbers of hits. The JPL main site is currently getting about 6,000,000 per day, split across two servers (3 million hits per day per server). The hardware used are Sun Ultra 1's with 256Mb of memory, and Apache on this hardware has no problems with up to 5 million hits per day.

The key to handling high hit-rates with Apache is to ensure that there is enough memory to run the concurrent child processes in RAM without swapping. In this case, 256Mb per server allows for well over 500 concurrent servers (i.e. 500 concurrent clients). Besides memory, the configuration files and operating system should be adjusted for maximum performance (although this is significantly less important that the amount of physical memory). Adjustments should include:

  • Remove all modules you do not use from the running executable
  • Turn off looking for .htaccess files with AllowOverride None
  • Reduce timeouts
  • Increase the listen queue size if necessary
  • Do not read from or write to any NFS mounted disks (especially not for log files)
  • Configure the operating system for large numbers of file descriptors
  • Increase the number of requests per child with MaxRequestPerChild
  • Increase the number of children started and ensuring that there are enough spare children to handle sudden bursts of requests
  • Increase the maximum number of servers with MaxClients (this will also require recompilation with a larger HARD_SERVER_LIMIT)
  • Turn off DNS lookups with HostnameLookups Off. Ensure that all host-based restrictions are done by IP number (until 1.3 comes out)
  • Use Apache 1.2.0 or 1.2.1 which are much more efficient (both on the server and on the network) than early betas of 1.2.

A future release of Apache will be multithreaded (this might be in version 1.4 or 2.0, depending on how development goes). The use of multithreading rather than multiple processes may reduce the amount of memory needed to run Apache efficiently, but probably not a huge amount. Although each Apache executable is often around 700kb to 1Mb in size, most of this is executable code which is shared between all the processes.


Apache in the News

Info World reviewed four web server "solutions", including Apache running on a dual Pentium P6/200 system. The review in " Web platform solutions - Big Blue deja vu" compared Apache on RedHat Linux, Microsoft IIS on NT, Netscape Enterprise on NT and IBM's Internet Connection Secure Server on AIX. They gave Apache last place, and recommended the IBM solution.

The most important part of this review is a performance test which showed Apache having serious problems coping the high loads. This should not have been the case, since Apache can cope with very high load given correct configuration, hardware and software. There are several reasons why they might have had problems at high loads:

  • The hardware used was well under a third the price of the hardware used for the other solutions
  • They used Apache 1.1.3, whereas Apache 1.2.1 includes many optimisations for efficient use of server and network resources
  • The operating system used a single CPU rather than the two available in the hardware
  • The amount of memory was not specified, but may not have been enough for the number of concurrent clients. Much more memory could have been installed in the system and still kept the server cost under a third of the other solutions.
  • The configuration may not have been optimised. They listed the extensive optimisations applied to the other servers (including things like disabling SNMP management and web publishing of Enterprise, despite critising Apache for not having SNMTP managment or content managment).
  • The tests were (presumably) carried out under laboratory conditions on a fast local area network. Apache is designed for use on real-world internet connections with long latency times, badly behaved clients, etc.

It is ironic that Apache looses out for not having the bells-and-whistles of some other servers, but when it comes to performance they had to disable these features! And of course as the JPL site shows, in the real world Apache can easily cope with hit rates of 5 million a day (57 per second) with some minor tuning and adequate hardware.