Chinaunix首页 | 论坛 | 博客
  • 博客访问: 588295
  • 博文数量: 187
  • 博客积分: 10025
  • 博客等级: 上将
  • 技术积分: 2990
  • 用 户 组: 普通用户
  • 注册时间: 2007-06-18 17:40
文章分类

全部博文(187)

文章存档

2010年(1)

2009年(23)

2008年(163)

我的朋友

分类: LINUX

2008-09-27 14:33:53

Writing Apache's Logs to MySQL
by 
02/10/2005


In , I discussed using Apache's built-in logging directives to record server performance metrics. By recording performance metrics, web server administrators can have a historical record of how the server handled incoming HTTP requests. This article expands on that concept by adding a couple of logging directives and recording the logging data directly in a MySQL database.

Quick Review of Blackbox

The Blackbox format is a custom log format put together with some of the lesser used Apache logging directives. It records data to give server administrators a better idea of how well their web servers are performing.

LogFormat "%a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P %T/%D %I/%O/%B" blackbox

In order, this format records the remote IP address, the remote port, the connection status, the time, the request line, the return statuses (before and after redirection), the process ID, the thread ID, the length of time to handle the request (in seconds and microseconds), and the number of bytes in, bytes out, and bytes of content.

Extensions to the Current Blackbox Format

There are a couple of additional fields to add to the Blackbox format to satisfy larger web server environments.

Virtual Host (%v)

The Blackbox format records the performance metrics for a given server instance. If your server handles virtual hosts, you can record the Blackbox data for every virtual host.

Here's an example of a configuration file using three virtual hosts.


ServerName app.example.com
CustomLog /var/opt/apache2/logs/appserver common
CustomLog /var/opt/apache2/audit/blackbox blackbox



ServerName images.example.com
CustomLog /var/opt/apache2/logs/imageserver common
CustomLog /var/opt/apache2/audit/blackbox blackbox



ServerName internal.example.com
CustomLog /var/opt/apache2/logs/intranet common
CustomLog /var/opt/apache2/logs/blackbox blackbox

Related Reading


By , 





 Search this book on Safari:
 

Code Fragments only

Every host writes out two logs: one in the custom log file format and another in the Blackbox format. The Blackbox log for each host is actually the same file. This makes sense for centralizing performance metrics, but it's difficult to identify which hits come from which virtual host. The simplest way to fix this is to add the %v directive to the format.

Unique ID (%{UNIQUE_ID}e or %{forensic-id}n)

Apache has a module known as mod_unique_id. It creates a unique, 19-character, Base64-encoded string to identify every hit made on the server. The string comes from the source IP, the process ID, the timestamp, and a subsecond counter. You can access the value using the environment variable %{UNIQUE_ID}e.

Adding this directive to the Blackbox format ensures that each hit record is unique. It also provides the ability to match hits in the Blackbox log file to specific hits in another log file.

For example, Apache 2.0.52 introduced a new logging module known as mod_log_forensic. When a request comes in, the module immediately writes out the incoming client request headers. Then at the logging phase, the module writes another entry to show that the request has ended. The module prefixes each entry with theUNIQUE_ID environment variable.

By combining the mod_log_forensic module with the Blackbox format, you can record the client input and the server performance metrics. The UNIQUE_ID gives you a unique key so you can trace the log entries back to their forensic log input records.

If you decide to use mod_log_forensic, you have two options for recording the Base64 id string: %{UNIQUE_ID}e or %{forensic-id}n. The mod_log_forensic module generates the latter. They will both be the same value, but the Apache documentation makes the vague suggestion that they might be different in the Apache 2.1 code branch. For now, I recommend %{UNIQUE_ID}e, because it's usable with or without the mod_log_forensic module.

Updated format

The updated configuration for the Blackbox format looks like:

LogFormat "%v %a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P \
	%T/%D %I/%O/%B %{UNIQUE_ID}e" blackbox

This may look a little strange, but it's an attempt to make the format similar to the established conventions. The Apache common+virtual host logging format puts the %v directive first, and all examples showing the UNIQUE_ID variable put it in the last field of the format.

The Perl regular expression for extracting records from this format is:

$Fmt=<

^(\\S+)\\ ([^/]+)/(\\d+)\\ (\\S)\\ \\[([^:]+):(\\d+:\\d+:\\d+)\\ ([^\\]]+)\\]
\\s "(\\S+)\\ (.*?)\\ (\\S+)"\\ (\\d+)/(\\d+)\\ (\\d+)/(\\d+)\\s
(\\d+)/(\\d+)\\ (\\d+)/(\\d+)/(\\d+) (\\S+)\$
EOFMT;

Recording Logs in a Database

Flat files are the simplest way to record hits, but they're impractical for long-term analysis:

  • Processing programs must read the records sequentially.
  • Log writing affects web server performance due to the high number of disk I/O operations.
  • Log rotation must occur regularly.
  • Log retention increases the storage requirements.

The completed log files feed into programs designed to track hits on the web server. These programs parse the log files and create reports so that administrators can track their visitors. Hit-tracking programs usually fall within one of two categories.

Programs in the first category read the log files whenever someone requests a report. These programs are usually small-scale reporting programs or custom scripts that don't deal with a large number of hits. Report generation can take a long time, but the programs can run immediately against one or more log files--even currently active ones.

In the second category, programs import the log files into a database and generate reports from the imported data. A number of high-end commercial tracking programs fall into this category. Report generation is usually much quicker, but it can run only on imported data.

In both cases, the software needs to read the log file in order to build a report or populate a database. If we were to log our hits directly to a database, we could avoid the initial overhead. Third-party modules are available that allow Apache to send logging data directly to a database.

 is an Apache 2.0 module that writes logging data directly to a MySQL database. Another module,, also sends data to a MySQL database. Both do essentially the same thing in different ways. In checking out both modules, I found that mod_log_mysql more ably handles all of the Blackbox logging directives.

Writing to a MySQL database is a pretty radical shift from the tried-and-true method of just hammering out text log files. Some people may welcome the change, but others may not be so accepting.

Here's a quick rundown on some of the pros and cons. I don't mean this as a critique or praise for mod_log_mysql, but it brings up issues to consider when recording logs to a database. Some of these issues affect any remote log file writing solution, not just mod_log_mysql.

Pros

  1. mod_log_mysql can write logs directly to a networked database server. This provides considerable savings on local disk I/O operations. Also, one central database server can handle multiple Apache servers.
  2. If writing to the local disk concerns you, this module moves one step closer to avoiding disk writes. Combining mod_log_mysql with a syslog facility in your ErrorLog directive will drastically reduce the amount of disk writing your server will perform. Technically, you still have the PID file and the scoreboard file, but you can always write that data to the /tmp filesystem.
  3. You need no more parsing scripts. You won't have to read in records sequentially, and you won't need to extract fields using regular expressions. Instead, you can use SQL commands to extract only the fields and records that you want. This makes ad hoc reporting much easier and faster.

Cons

  1. Databases may not be 100 percent trustworthy. It's hard to dismiss this claim, because database logging changes almost ten years of established practice. There's always the chance that the database could fail or become unreachable. mod_log_mysql provides a backup mechanism for database failures, but those log files can grow pretty big plenty fast. You need to make sure you catch the database failures and address them right away.
  2. mod_log_mysql requires a rebuild of Apache, with the entire logging architecture of mod_log_config rewritten to satisfy the needs of this module. Not many administrators or developers will be too eager to rewrite a core Apache module.
  3. This module works only with a MySQL database. Some high-end web reporting software packages rely on other database platforms (such as Oracle), so importing records will require an export and import operation.
  4. The database isn't anywhere close to fourth normal form. Don't bother optimizing it and don't bother indexing it, because it just won't be worth it in the long run. It's still possible to write a program to grab the records and insert them into a better database structure.

Maintaining a logging database isn't that different from maintaining a collection of text log files. You will still run into the same management and administration issues, but you'll have different procedures to follow.

Run periodic tests to make sure the server is sending logging data to the database. Also, check the size of the database and set up a rotation schedule. Export, aggregate, or back up older data depending on what kind of database environment you work with.

Setting up mod_log_mysql

In order to support mod_log_mysql, you need to do the following:

  1. Rebuild your Apache server with the supplied code changes.
  2. Build the database into your MySQL environment.
  3. Configure the logging in your web server configuration.

The mod_log_mysql web page includes the code for the module as well as the replacement mod_log_config.h,mod_log_config.cmod_logio.c, and apr_reslist.c source files.

If you read Part 1 of this series, you'll remember I introduced a code change to mod_log_config.c to support logging the client source port. If you want to continue to support that directive, you will need to make the following code changes:

Insert the following subroutine in the mod_log_config.c file, anywhere near the other logging functions:

static void *log_remote_port(request_rec *r, char *a, ap_log_ehandler_data *d)
{
     d->data=apr_palloc(r->pool,sizeof(ap_log_unumber_t));
     *((ap_log_unumber_t *) d->data) = r->connection->remote_addr->port;
     d->type=AP_LOG_EHANDLER_RETURN_UNUMBER;
}

Insert the following log_pfn_register function into log_pre_config:

log_pfn_eregister(p, "S", log_remote_port, 0);

Once the new source code files are in place, do a make clean and restart the build process at the top-level code directory. Once that build is complete, build the mod_log_mysql DSO using the apxs tool.

$ apxs -a -i -c -L/opt/mysql/lib/mysql -I/opt/mysql/include/mysql 
   -lmysqlclient_r -lz mod_log_mysql.c

Building the database

Next, build the database the Apache server will write to. The module expects that you have already created the database. It also expects that you have set up the fields properly and have configured the fields properly to handle the correct data types.

Logging directly to a database is a big change. The limitations of using flat text files no longer apply. I already mentioned that you don't need to worry about log file rotation or local disk issues, but I also want to bring up centralization.

The previous article on Blackbox logs demonstrated how to split the logs across two different files. One was the common log file format for hit tracking; the other was the Blackbox log for auditing server performance. This gave the opportunity to see only the interesting data.

Unfortunately, both log formats wrote down some of the same fields. That approach wasted space for the sake of convenience because there was no easy way to correlate data between the two files.

Writing logs to a database ameliorates that problem. The database table will have all of the fields from the combined log file format and the Blackbox log file format. SQL queries can pull out only the desired records and fields. Here are the SQL commands to build the database structure. They build a database called apachelogs and define one table named blackbox:

CREATE DATABASE apachelogs;

CREATE TABLE blackbox (
  virtualhost text,
  remoteip text,
  remoteport int unsigned,
  serverid text,
  connectionstatus char(1),
  remoteuser text,
  identuser text,
  datetime datetime default NULL,
  requestmethod text,
  url text,
  protocol text,
  statusbeforeredir int unsigned,
  statusafterredir int unsigned,
  processid int unsigned,
  threadid int unsigned,
  seconds int unsigned,
  microseconds int unsigned,
  bytesin int unsigned,
  bytesout int unsigned,
  bytescontent int unsigned,
  useragent text,
  referer text,
  uniqueid text,
);

The mod_log_mysql module attempts to send every numeric value to the database as an integer, which makes mathematical operations easier. Pure text fields, such as the URL or the user agent, are variable-length text strings.

After you create the database, create all of the access rights in the MySQL database so the Apache server can write to the database.

Be sure to log in to the database using the mysql client so you can test the access credentials you created. Once the database is ready, you need to create the logging format.

Add the following lines to your Apache configuration. Remember, you can split a directive across lines as long as they end with a backslash.

LogFormat "INSERT INTO blackbox SET virtualhost=%v,remoteip=%a,\
        remoteport=%S,serverid=%{SERVERID}e,connectionstatus=%X,\
        remoteuser=%u,identuser=%l,datetime=%t,requestmethod=%m,\
        url=%R,protocol=%H,statusbeforeredir=%s,statusafterredir=%>s,\
        processid=%{pid}P,threadid=%{tid}P,seconds=%T,microseconds=%D,\
        bytesin=%I,bytesout=%O,bytescontent=%B,useragent=%{User-Agent}i,\
        referer=%{Referer}i,uniqueid=%{UNIQUE_ID}e" blackbox

CustomLog mysql:logwriter!logpass@localhost/apachelogs blackbox

The updated mod_log_config file adds a new logging directive, %R. This is a shortcut for the %U%q convention, for people who want to record the URL and the query string.

The mod_log_mysql module connects as a client to the database server and passes along the insert statement exactly as specified in the configuration file. The code modifies the statement to escape strings properly, escaping certain characters to minimize SQL injection attacks.

If there are errors in the statement, inspect the MySQL server log to check for bad input. You can also check the Apache error logs for SQL statement errors.

Identifying individual web servers

If you look at the CREATE and INSERT statements in the above examples, you'll notice an additional field. The serverid field is a string used to identify a running server instance.

Centralizing the web server logs creates a tracking problem. There must be a foolproof way to determine which server has handled each entry in the database. Remember, it's possible for the Apache server to run in a wide variety of host environments:

  • Multiple Apache servers on different hosts serving the same web site
  • Multiple Apache servers running on the same host, each listening on a unique IP and serving different web sites
  • One Apache server running on the same host, listening on multiple IPs and serving different web sites

The best way to identify these servers is to create an environment variable for each Apache process pool. Each web server's configuration should create an environment variable that will identify the pool. Here's the relevant snippet fromhttpd.conf:

SetEnv SERVERID apache1

That one variable needs to identify this server for every possible hit it will answer, whether that hit comes across different IPs in the pool or different virtual hosts.

The other way to address this issue is to have each web server write to a separate table in the same database. That's always an option, but you should still use the serverid field for when you want to do comparative operations. That way you can merge or split tables at a later time based on its value.

Querying the Database

Queries can run on the database immediately. The examples below are in straight SQL syntax. Some use the nested query syntax available only in MySQL 4.1 or above.

If you intend to write automated scripts in Perl or Python or whatever, it may be easier to run multiple SELECT statements or to compute derived values inside the scripting language itself. Whenever you request reports on web server performance, you'll need to specify the date range you want to cover.

Query: total bytes in and out

You want to find out how much network traffic your web server sends and receives. Keep in mind that this reflects traffic levels only at the application layer of the TCP/IP stack. It won't record the size of TCP or IP packets. That's a question SNMP polling tools can answer better.

select sum(bytesin), sum(bytesout) from blackbox \
	where datetime = curdate();

select sum(bytesin),sum(bytesout) from blackbox where datetime >= 
        date_sub(curdate(),interval 7 day);

The first example gives the total bytes in and bytes out for the given day. The second example gives the total for the past seven days.

Query: what percentage of hits goes to RSS syndication pages compared with HTML pages?

This example checks the number of hits against a specific RSS page against every possible hit against the server.

select (select count(url) from blackbox where url regexp '\.rss$')/
        (select count(url) from blackbox where url regexp '\.html$');

RSS hits are typically automated. Because of this, a lot of sites have a higher percentage of pure robot/agent traffic on their site than they may have had two years ago. By checking the RSS hits, you can determine whether the agent traffic is overwhelming your site.

Query: how often do users skip the Flash animation?

If you see a web site with one of those Flash animations on the main page, it usually has a "Skip" link at the bottom. If you're going to include one, throw in a meaningless query string at the end of the link, so you can determine which hits are redirects after the animation has finished and which ones come from people clicking on the link.

select (select count(url) from blackbox where url="/home.html?skipped"),
        (select count(url) from blackbox where url="/home.html"),

Verifying that the database is working

Here's a two-part script you can run to verify that the database really is logging hits from the servers. First, add this script to the cgi-bin directory of every web site that uses mod_log_mysql.

#!/usr/local/bin/perl

# returnid.pl

use strict;
use warnings;

print("X-ServerId: $ENV{'SERVERID'}\n");
print("X-UniqueId: $END{'UNIQUE_ID'}\n");
print "Content-type: text/plain\n\n";

print("Server id is $ENV{'SERVERID'}\n");
print("Unique id is $ENV{'UNIQUE_ID'}\n");

exit 0;

The server will return its serverid and unique ID for that particular hit. The next program will act as an HTTP client. It will retrieve this URL, then connect to the database and search the logs for the unique ID.

#!/usr/local/bin/perl

# checkid.pl

#
# Pragmas
#

use strict;
use warnings;

#
# Includes
#

use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;

use DBI;

#
# Global Variables
#

use vars qw/$Url $Agent $Db $ServerId $UniqueId /;

use constant {
  DSN         => "DBI:mysql:database=apachelog",
  DBUSER      => "logwriter",
  DBPASS      => "logpass",
  QUERYSTRING => 
   "select datetime,uniqueid from blackbox where uniqueid=? and serverid=?",
  DEFAULTURL  => "",
};

#
# Subroutines
#

sub getId
{
    my ($agent,$url) = @_;
    my $request      = HTTP::Request->new(GET => $url);
    my $response     = $agent->request($request);

    if ($response->is_success)
    {
        my $serverid=$response->header("X-ServerId");
        my $uniqueid=$response->header("X-UniqueId");
        print("Unique ID from server is $uniqueid\n");
        print("Server ID is $serverid\n");
        return ($serverid,$uniqueid);
    }

    return undef;
}

sub findId
{
    my ($db,$serverid,$uniqueid) = @_;

    my $query                    = $db->prepare(QUERYSTRING);

    $query->execute($uniqueid,$serverid);

    if (my ($dt,$uid)=$query->fetchrow_array())
    {
        print("Database recorded $uid at $dt\n");
    } else {
        print("Can't find a database record for unique-id $uniqueid\n");
    }

    return;
}

#
# Main Block
#

# Initialize the user agent
$Agent = LWP::UserAgent->new();

# Initialize the database client
$Db = DBI->connect(DSN,DBUSER,DBPASS);

# URL
$Url = shift || DEFAULTURL;

if (($ServerId,$UniqueId) = getId($Agent,$Url))
{
        findId($Db,$ServerId,$UniqueId);
} else {
        print("Unable to get data from webserver");
        exit 1;
}

If you run this program at a regular polling interval, it will warn you when the remote database is not responding or if the Blackbox table is not recording hits from the web servers.

Final Thoughts

If you've read the first article, you should already understand why you want to log your server performance data. The core concepts are still the same; I'm just introducing a few variations on improving the process.

The two new logging directives provide more flexibility with virtual hosting environments. It also allows having just one Blackbox log file for each running server.

If you want to take the really big step, consider the option of writing your logs straight to a database. The initial setup process may be complex, but after that there is a huge administrative benefit. It's an ideal solution for dealing with large server farms.

 works as a system administrator for Internet Broadcasting.



 Database logging modules very bad idea
2005-02-21 22:14:31  Coyote-Moon [ | ]

It is a very bad idea to use an Apache module that logs directly to the database. I know there are countless implementations, but most are toys that don't handle even the most trivial real world issues such as connection pooling and temporarily inaccessible databases. Screw it up and cycling your database (e.g., to apply a security upgrade) could cause your web server to hang. Oops.

A far better design is to use "piped logs" that write logs messages to a separate process's stdin. This gives you clean separation between Apache and the database-enabled logger and there are ways to ensure the pipe is never blocked. (E.g., one thread copies stdin to an internal buffer, a second thread consumes that buffer to feed the database. This allows you to support bursts that far exceed your ability to write to the database.)

Could these tricks be done within an Apache module? Probably, but you would have to get it working reliably within a far more complex environment. Why not simplify your life and use a separate process?

P.S., it shouldn't be hard to normalize your data before writing it to the logs, it's not like much needs to be done. In my implementation the only thing I needed to handle is a separate IP table that has the canonical host name, RBL information, etc. It's trivial to support with a placeholder and separate thread every time a previously unknown IP address is seen.


阅读(675) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~