whenpenguinsattack.com

Tuesday, January 17, 2006

Optimizing PHP - a first look (part one)

By Leon Atkinson

One reason I like PHP is that it allows the freedom to quickly create Web applications without worrying about following all the rules of proper design. When it comes to rapid prototyping, PHP shines. With this power comes the responsibility to write clean code when it's time to write longer-lasting code. Sticking to a style guide helps you write understandable programs, but eventually you will write code that doesn't execute fast enough.

Optimization is the process of fine-tuning a program to increase speed or reduce memory usage. Memory usage is not as important as it once was, because memory is relatively inexpensive. However, shorter execution times are always desirable.

There are many tips for writing efficient programs, and I can hardly discuss them all here. And anyway, that would be like giving you a fish instead of teaching you to catch your own. This month, I will review the techniques I use for speeding up my PHP scripts.

When to optimize


Before you write a program, commit yourself to writing clearly at the expense of performance. Follow coding conventions, such as using mysql_fetch_row instead of mysql_result. But keep in mind that programming time is expensive, especially when programmers must struggle to understand code. The simplest solution is usually best.

When you finish a program, consider whether its performance is adequate. If your project benefits from a formal requirements specification, refer to any performance constraints. It's not unusual to include maximum page load times for Web applications. Many factors affect the time between clicking a link and viewing a complete page. Be sure to eliminate factors you cannot control, such as the speed of the network.

If you determine that your program needs optimization, consider upgrading the hardware first. This may be the least expensive alternative. In 1965, Gordon Moore observed that computing power doubled every 18 months. It's called Moore's Law. Despite the steep increase in power, the cost of computing power drops with time. For example, despite CPU clock speeds doubling, their cost remains relatively stable. Upgrading your server is likely less expensive than hiring programmers to optimize the code.

After upgrading hardware, consider upgrading the software supporting your program. Start with the operating system. Linux and BSD Unix have the reputation of squeezing more performance out of older hardware, and they may outperform commercial operating systems, especially if you factor in server crashes.

If your program uses a database, consider the differences between relational databases. If you can do without stored procedures and sub-queries, MySQL may offer a significant performance enhancement over other database servers. Check out the benchmarks provided on their Web site. Also, consider giving your database server more memory.

Two Zend products can help speed execution times of PHP programs. The first is the Zend Optimizer. This optimizes PHP code as it passes through the Zend Engine. It can run PHP programs 40% to 100% faster than without it. Like PHP, the Zend Optimizer is free. The next product to consider is the Zend Cache. It provides even more performance over the optimizer by keeping compiled code in memory. Some users have experienced 300% improvements. Contact Zend to purchase the Zend Cache.

Measuring performance

Before you can begin optimizing, you must be able to measure performance. The two tools I'll discuss are inserting HTML comments and using Apache's ApacheBench utility. PHP applications run on a Web server, but the overhead added by serving HTML documents over a network should be factored out of your measurements.

You need to isolate the server from other activity, perhaps by barring other users or even disconnecting it from the network. Running tests on a server that's providing a public site may give varying results, as traffic changes during the day. Run your tests on a dedicated server even if the hardware doesn't match the production server. Optimizations made on slower hardware should translate into relative gains when put into production.

The easiest method you can use is insertion of HTML comments into your script's output. This method adds to the overall weight of the page, but it doesn't disturb the display. I usually print the output of the microtime function. I insert a line like:

I place these calls to microtime at the beginning, end and at key points inside my script. To measure performance, I request the page in a Web browser and view the source.

The microtime function returns the number of seconds on the clock. The first figure is a fraction of seconds, and the other is the number of seconds since January 1, 1970. You can add the two numbers and put them in an array, but I prefer to minimize the affect on performance by doing the calculation outside of the script. In the example above, the first part of the script takes approximately 0.005 seconds, and the second part takes 0.03.

If you decide to calculate time differences, consider the method used in the example below. Entries to the clock array contain a one-word description followed by the output of microtime. The explode function breaks up the three values so the script can display a table of timing values. The first column of the table holds the number of seconds elapsed since the last entry.

Inserting HTML comments is my favorite method, because it takes no preparation. But its big weakness is a small sample size. I always try three or four page loads to eliminate any variances due to caching or periodic server tasks.

The Apache Web server includes a program that addresses this problem by measuring the number of requests your server can handle. It's called ApacheBench, but the executable is "ab". ApacheBench makes a number of requests to a given URL and reports on how long it took. Here's an example of running 1000 requests for a plain HTML document:

~> /usr/local/apache/bin/ab -n 1000 http://localhost/test.html

This is ApacheBench, Version 1.3c <$Revision: 1.1.2.6 $> apache-1.3

Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/

Copyright (c) 1998-2000 The Apache Group, http://www.apache.org/

Server Software: Apache/1.3.19

Server Hostname: localhost

Server Port: 80

Document Path: /test.html

Document Length: 6 bytes

Concurrency Level: 1

Time taken for tests: 5.817 seconds

Complete requests: 1000

Failed requests: 0

Total transferred: 262000 bytes

HTML transferred: 6000 bytes

Requests per second: 171.91

Transfer rate: 45.04 kb/s received

Connnection Times (ms)

min avg max

Connect: 1 1 11

Processing: 3 3 16

Total: 4 4 27

I requested an HTML document to get an idea of the baseline performance of my server. Any PHP script ought to be slower than an HTML document. Comparing the figures gives me an idea of the room for improvement. If I found my server could serve a PHP script at 10 requests per second, I'd have a lot of room for improvement.

Keep in mind that I'm running ApacheBench on the server. This eliminates the effects of moving data over the network, but ApacheBench uses some CPU time. I could test from another machine to let the Web server use all the system resources.

By default, ApacheBench makes one connection at a time. If you use 100 for the -n option, it connects to the server one hundred times sequentially. In reality, Web servers handle many requests at once. Use the -c option to set the concurrency level. For example, -n 1000 -c 10 makes one thousand connections with 10 requests active at all times. This usually reduces the number of requests the server can handle, but at low levels the server is waiting for hardware, such as the hard disk.

The ApacheBench program is a good way to measure overall change without inconsistencies, but it can't tell you which parts of a script are slower than others. It also includes the overhead involved with connecting to the server and negotiating for the document using HTTP. You can get around this limitation by altering your script. If you comment out parts and compare performance, you can gain an understanding of which parts are slowest. Alternatively, you may use ApacheBench together with microtime comments.

Whichever method you use, be sure to test with a range of values. If your program uses input from the user, try both the easy cases and the difficult ones, but concentrate on the common cases. For example, when testing a program that analyzes text from a textarea tag, don't limit yourself to typing a few words into the form. Enter realistic data, including large values, but don't bother with values so large they fall out of normal usage. People rarely type a megabyte of text into a textarea, so if performance drops off sharply, it's probably not worth worrying about.

Remember to measure again after each change to your program, and stop when you achieve your goal. If a change reduces performance, return to an earlier version. Let your measurements justify your changes.

Attacking the slowest parts

Although there are other motivations, such as personal satisfaction, most people optimize a program to save money. Don't lose sight of this as you spend time increasing the performance of your programs. There's no sense in spending more time optimizing than the optimization itself saves. Optimizing an application used by many people is usually worth the time, especially if you benefit from licensing fees. It's hard to judge the value of an open-source application you optimize, but I find work on open-source projects satisfying as recreation.

To make the most of your time, try to optimize the slowest parts of your program where you stand to gain the most. Generally, you should try to improve algorithms by finding faster alternatives. Computer scientists use a special notation to describe the relative efficiency of an algorithm called big-O notation. An algorithm that must examine each input datum once is O(n). An algorithm that must examine each element twice is still called O(n) as linear factors are not interesting. A really slow algorithm might be O(n^2), or O of n-squared. A really fast algorithm might be O(n log n), or n times the logarithm of n. This subject is far too complex to cover here -- you will find lots of information on the Internet and in university courses. Understanding it may help you choose faster algorithms.

Permanent storage, such as a hard disk, is much slower to use than volatile storage, such as RAM. Operating systems compensate somewhat by caching disk blocks to system memory, but you can't keep your entire system in RAM. Parts of your program that use permanent storage are good candidates for optimization.

If you are using data stored in files, consider using a relational database instead. Database servers can do a better job of caching data than the operating system because they view the data with a finer granularity. Database servers may also cache open files, saving you the overhead in opening and closing files.

Alternatively, you can try caching data within your own program, but consider the lifecycle of a PHP script. At the end of the request, PHP frees all memory. If during your program you need to refer to the same file many times, you may increase performance by reading the file into a variable.

Consider optimizing your database queries, too. MySQL includes the EXPLAIN statement, which returns information about how the join engine uses indexes. MySQL's online manual includes information about the process of optimizing queries.

Here are two tips for loops. If the number of iterations in a loop is low, you might get some performance gain from replacing the loop with a number of statements. For example, consider a for loop that sets 10 values in an array. You can replace the loop with 10 statements, which is a duplication of code, but may execute slightly faster.

Also, don't recompute values inside a loop. Before the foreach statement appeared in PHP.

Function calls carry a high overhead. You can get a bump in performance if you eliminate a function. Compiled languages, such as C and Java, have the luxury of replacing function calls with inline code. You should avoid functions that you only call once. One technique for readable code is to use functions to hide details. This technique is expensive in PHP.

If all else fails, you have the option of moving part of your code into C, wrapping it in a PHP function. This technique is not for the novice, but many of PHP's functions began as optimizations. Consider the in_array function. You can test for the presence of the value in an array by looping through it, but the function written in C is much faster.

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home