Tuesday, February 07, 2006

php vs. java

by Jack Herrington

PHP scales. There, I said it. The word on the street is that "Java scales and PHP doesn't." The word on the street is wrong, and PHP needs someone to stand up and tell the truth: that it does scale.

Those with a closed mind can head straight to the inevitable flame war located at the end of this article. Those with an open mind who are interested in taking their web development skills and putting them to use building applications in the cross-platform, easy-to-write, easy-to-maintain, scalable, and robust PHP platform, but were hesitant because of the scalability myth, should read on. It starts by looking at the term scalability.

What is Scalability?

There are a number of different aspects of scalability. It always starts with performance, which is what we will cover in this article. But it also covers issues such as code maintainability, fault tolerance, and the availability of programming staff.

These are all reasonable issues, and should be covered whenever you are choosing the development platform for any large project. But in order to convey a convincing argument in this small space, I need to reduce the term scalability to its core concern: performance.

Language and Database Performance

Both Java and PHP run in virtual machines, which means that neither perform as well as compiled C or C++. In the great language shootout, Java beat PHP on most of the performance benchmarks, even substantially on some. However, overall the two languages were not an order of magnitude different. In addition, an older version of PHP was used in the test, and substantial performance improvements have been made since and are continuing to be made.
Another area of performance concern is in the connection to the database. This is a misnomer, however, as the majority of the time spent in a database query is on the database server end, processing the query, and the transmit time to marshal the data between the server and the client. PHP's connectivity to the database consists of either a thin layer on top of the C data access functions, or a database abstraction layer called PEAR::DB. There is nothing to suggest that there is any PHP-specific database access performance penalty.

Yet another area of efficiency concern is in the connection between the language and the web server. In the CGI model, the program or interpreter is booted on each request. In the in-process model, the interpreter stays around after each request. One of the original Java-versus-scripting-languages (e.g. PHP) benchmarks pit in-process Java against CGI invocation on the server. In the CGI model, each page incurred the overhead of the startup and shutdown of the interpreter. Even at the time, the comparison was unfair, as production machines used server-scripting extensions (such as PHP), which run in-process and stay loaded between each page fetch. There is no performance penalty for loading the interpreter and compiled pages remain in memory.

With these basic efficiency questions out of the way, it's time to look at the overall architecture of the web application

Comparing Architectures

There are three basic web architectures in common use today: two-tier, logical three-tier, and physical three-tier. Engineers give them different names and slightly different mechanics, so to be clear about what I mean, I will illustrate the three architectures.

J2EE Web Server Architecture

Perhaps the second most contentious part of this article is my definition of a J2EE web server application architecture. Externally to the Java community, the application structure looks clear: JSPs talk to EJBs, which talk to the database. Within the Java community, the standard J2EE topology is anything but clear. A comparison is only valid between two things, so to decide whether "Java scales and PHP doesn't," I need to be clear about what a Java web application server is.
I'll take the two most common interpretations of J2EE architecture. The first is Sun's EJB 1.0 architecture, and the second is the EJB 2.0 architecture. Shown below is Sun's EJB 1.0 architecture for web application servers:

This is classic physical three-tier architecture, and it pays the performance price. I've highlighted the portions of the architecture that involve network traffic, either via database connection, an overhead shared by PHP, or by Remote Method Invocation (RMI), an overhead not shared by PHP.

To be fair, the connection between the web server and the servlet engine can be avoided with modern application servers and web servers, such as Tomcat and Apache 2.0. At the time when the first versions of the JSP and EJB standards were released, the prevalent web server was (and still is) Apache 1.x, which had a process model that was not compatible with Java's threading model. This meant that a small stub was required on the web server side to communicate with the servlet engine. The remains a non-trivial performance overhead for those that decide to pay it, and was a significant performance overhead when the first scalability comparisons were made.
A much more significant source of overhead was in the RMI connection between the servlet engine and the EJB layer. A page showing ten fields from twenty objects would make two hundred RMI calls before it was completed. This overhead was removed with the EJB 2.0 standard, which introduced local interfaces. This topology is shown below:

This is logical three-tier architecture. The web server box has been removed because more recent web servers are not separated from the servlet code (e.g. Tomcat, Apache 2.0, etc.). As you will see when we compare this model to the PHP model, EJB 2.0 moved Java web application server development closer to the successful, and scalable, PHP model.

PHP Web Server Architecture

PHP has always been capable of running the gamut between a two-tier architecture and a logical three-tier architecture. Early versions of PHP could abstract the business and database access logic into a functional second tier. More recent versions can abstract the business logic with objects, and with PHP 5, these objects can use public, protected, and private access control.

A modern PHP architecture, strikingly similar to the EJB 2.0 model, is shown below:

This is logical three-tier architecture, and this is how modern PHP applications are written. As with Java web servers, the PHP code is in-process with the web server, so there is no overhead in the server talking to the PHP code.

The PHP page acts as a broker between second-tier business objects and Smarty templates, which format the page for presentation. As with the JSP "best practice," the Smarty templates are only capable of displaying the data presented, with rudimentary looping and conditional control structures.

But this is all about the design of the server. What about the architecture of the application itself?

Stateful and Stateless Architecture

The lack of an external, stateful object store, where the application can hold session state, is often voiced as a scalability concern. PHP can use the database as the back-end session store. There is little performance difference, because a network access is required in both cases. An argument can be made that the external object store allows for any arbitrary data to be stored conveniently; however, this is easily offset by the fact that the object store itself is a single point of failure. If the object store is replicated across multiple web servers, that becomes an issue of data replication and cache coherency, which is a very complex problem.

Another familiar Java pattern is the use of a local persistent object store on each web server. The user is limited to a single server by use of sticky sessions on the router. The same could be done in PHP: a local, persistent data store. But this is an anti-pattern anyway, because a sticky session-based server pool is prone to overloading of a single web server. Or should the server go down, the result is the denial of service to a group of customers.

The ideal multi-server model is a pod architecture, where the router round-robins each of the machines and there is only a minimal session store in the database. Transient user interface information is stored in hidden variables on the web page. This allows for the user to run multiple web sessions against the server simultaneously, and alleviates the "back button issue" in web user interfaces.

This section has covered some very complex issues in web application server design. Scalability is mainly about the architecture of the application layer, and there is no one true panacea architecture that will work for all application architectures. The key to success is not in any particular technology, but in simplifying your server model and understanding all of the components of the application layer, from the HTML and HTTP on the front end to the SQL in the back end. Both PHP and Java are flexible enough to create scalable applications for those who spend the time to optimize their application architecture.

The Convergence of Web Application Architecture

This article started by asserting that PHP scales. When the tag-line "Java scales and scripting languages don't" was born, it was based on EJB 1.0, an architecture that most Java architects would consider absurd, based on its high overhead. Based on EJB 1.0, Java's performance was much worse than that of scripting languages. It is only the addition of local interfaces in EJB 2.0 that makes the J2EE architecture perform well.

The argument for PHP scalability is further simplified, however, by the fact that both PHP and J2EE architecture (as well as others) are converging on the same design. And if "J2EE scales" given this simpler, logical three-tier architecture, then it follows that PHP does as well.
The performance principle for scalability is simple: if you want to scale, then you have to serve web pages quickly. To serve web pages quickly, you either have to do less, or do what you do faster. Faster is a non-starter, because Java is not so much faster than PHP that it makes much of a difference. Doing less is the key. By using a logical three-tier architecture and by reducing the number of queries and commands sent to the database, we can get web applications that scale, both in Java and in PHP.

For the open-minded developer, there is a world of applications that can be built quickly, cheaply, robustly, and scalably with PHP. Services such as Amazon, Yahoo, Google, and Slashdot have known about scripting languages for years and used them effectively in production. Yahoo even adopted PHP as its language of choice for development. Don't believe the hype in the white papers that says that PHP isn't for real applications or doesn't scale.

I'm sure that what I have said in this article will be picked to death and ridiculed by some. I stand by what I have said. The idea that PHP does not scale is clearly false at the performance level. In fact, we should have never even gotten to the point where this article was necessary, because as engineers, we should recognize that the argument that one language clearly "scales better" than another is, on its face, ridiculous. As engineers and architects, we need to look objectively at technologies and use a factual and rational basis to make technology decisions.


  • This is the only one post about php vs java that I have read from thr first word to the last one. Congratulations, a very good article.

    By Anonymous Anonymous, at 2:26 PM  

  • the argument that one language clearly "scales better" than another is, on its face, ridiculous
    of course

    language shootout... overall the two languages were not an order of magnitude different
    The Win32 Computer Language Shootout hasn't been updated for years.
    In contrast, The Computer Language Shootout has frequent language updates and program updates.

    The Computer Language Shootout shows order of magnitude differences against Java 1.5 and Java 1.4

    The simplest way to change that is to contribute better PHP programs ;-)

    By Blogger Isaac Gouy, at 12:02 PM  

  • Good article.
    For sure, scalability is not at all linked to the language, but to the design principles.
    On the Web, the only thing that guaranties linear and massive scalability is being stateless where we have to scale, it means everywhere except the two end points : the browser and the database. As a consequence, any state information must be stored in one of the two points. And it is a best practice to consider the browser first, as it scales as easily. If the data should be secured, then put it in the database.

    By Blogger Fran├žois Tricot, at 8:25 AM  

Post a Comment

Links to this post:

Create a Link

<< Home