Template Test

From WikiContent

Jump to: navigation, search

Three things made the Web possible: HTML for encoding documents, HTTP for transferring them, and URLs for identifying them. To fetch and extract information from web pages, you must know all three—you construct a URL for the page you wish to fetch, make an HTTP request for it and decode the HTTP response, then parse the HTML to extract information. This chapter covers the construction of URLs and the concepts behind HTTP. HTML parsing is tricky and gets its own chapters later, as does the module that lets you manipulate URLs.

You'll also learn how to automate the most basic web tasks with the LWP::Simple module. As its name suggests, this module has a very simple interface. You'll learn the limitations of that interface and see how to use other LWP modules to fetch web pages without the limitations of LWP::Simple.


Contents

URLs

Components of a URL

Figure 4.1: Components of a URL

Note:
Don't actually go to that url

A Uniform Resource Locator (URL) is the address of something on the Web. For example:

http://www.oreilly.com/news/bikeweek_day1.html

URLs have a structure, given in RFC 2396. That RFC runs to 40 pages, largely because of the wide variety of things for which you can construct URLs. Because we are interested only in HTTP and FTP URLs, the components of a URL, with the delimiters that separate them, are:


''scheme''://''username''@''server'':''port''/''path''?''query''

In the case of our example URL, the scheme is http, the server is www.oreilly.com, and the path is /news/bikeweek_day1.html.

This is an FTP URL:

ftp://ftp.is.co.za/rfc/rfc1808.txt

The scheme is ftp, the host is ftp.is.co.za, and the path is /rfc/rfc1808.txt. The scheme and the hostname are not case sensitive, but the rest is. That is, ftp://ftp.is.co.za/rfc/rfc1808.txt and fTp://ftp.Is.cO.ZA/rfc/rfc1808.txt are the same, but ftp://ftp.is.co.za/rfc/rfc1808.txt and ftp://ftp.is.co.za/rfc/RFC1808.txt are not, unless that server happens to forgive case differences in requests.

We're ignoring the URLs that don't designate things that a web client can retrieve. For example, telnet://melvyl.ucop.edu/ designates a host with which you can start a Telnet session, and mailto:mojo@jojo.int designates an email address to which you can send.

The only characters allowed in the path portions of a URL are the US-ASCII characters A through Z, a through z, and 0-9 (but excluding extended ASCII characters such as ü and Unicode characters such as Ω or ⊆), and these permitted punctuation characters:


-     _     .     !     ~     *     '     ,
:     @     &     +     $     (     )     /

For a query component, the same rule holds, except that the only punctuation characters allowed are these:


-     _     .     !     ~     *     '     (     )

Any other characters must be URL encoded, i.e., expressed as a percent sign followed by the two hexadecimal digits for that character. So if you wanted to use a space in a URL, it would have to be expressed as %20, because space is character 32 in ASCII, and the number 32 expressed in hexadecimal is 20.

Incidentally, sometimes you might also see some of these characters in a URL:


{     }    |    \    ^    [    ]    `

But the document that defines URLs, RFC 2396, refers to the use of these as unreliable and "unwise." When in doubt, encode it!

The query portion of a URL assigns values to parameters:


name=Hiram%20Veeblefeetzer+age=35+country=Madagascar

There are three parameters in that query string: name, with the value "Hiram Veeblefeetzer" (the space has been encoded); age, with the value 35; and country, with the value "Madagascar".

The URI::Escape module provides the uri_escape( ) function to help you build URLs:


use URI::Escape;
''encoded_string'' = uri_escape(''raw_string'');

For example, to build the name, age, and country query string:


$n = uri_escape("Hiram Veeblefeetzer");
$a = uri_escape(35);
$c = uri_escape("Madagascar");
$query = "name=$n+age=$a+country=$c";
print $query;
''name=Hiram%20Veeblefeetzer+age=35+country=Madagascar''

An HTTP Transaction

The Hypertext Transfer Protocol (HTTP) is used to fetch most documents on the Web. It is formally specified in RFC 2616, but this section explains everything you need to know to use LWP.

HTTP is a server/client protocol: the server has the file, and the client wants it. In regular web surfing, the client is a web browser such as Mozilla or Internet Explorer. The URL for a document identifies the server, which the browser contacts and requests the document from. The server returns either in error ("file not found") or success (in which case the document is attached).

Example 2-1 contains a sample request from a client.


An HTTP request

GET /daily/2001/01/05/1.html HTTP/1.1
Host: www.suck.com
User-Agent: Super Duper Browser 14.6
blank line

A successful response is given in Example 2-2.


A successful HTTP response

HTTP/1.1 200 OK
Content-type: text/html
Content-length: 24204
blank lineand then 24,204 bytes of HTML code

A response indicating failure is given in Example 2-3.


An unsuccessful HTTP response

HTTP/1.1 404 Not Found
Content-type: text/html
Content-length: 135
  
<html><head><title>Not Found</title></head><body>
Sorry, the object you requested was not found.
</body><html>
and then the server closes the connection


Request

An HTTP request has three parts: the request line, the headers, and the body of the request (normally used to pass form parameters).

The request line says what the client wants to do (the method), what it wants to do it to (the path), and what protocol it's speaking. Although the HTTP standard defines several methods, the most common are GET and POST. The path is part of the URL being requested (in Example 2-1 the path is /daily/2001/01/05/1.html). The protocol version is generally HTTP/1.1.

Each header line consists of a key and a value (for example, User-Agent:SuperDuperBrowser/14.6). In versions of HTTP previous to 1.1, header lines were optional. In HTTP 1.1, the Host: header must be present, to name the server to which the browser is talking. This is the "server" part of the URL being requested (e.g., www.suck.com). The headers are terminated with a blank line, which must be present regardless of whether there are any headers. The optional message body can contain arbitrary data. If a body is sent, the request's Content-Type and Content-Length headers help the server decode the data. GET queries don't have any attached data, so this area is blank (that is, nothing is sent by the browser). For our purposes, only POST queries use this third part of the HTTP request.

The following are the most useful headers sent in an HTTP request.


Host
www.youthere.int
This mandatory header line tells the server the hostname from the URL being requested. It may sound odd to be telling a server its own name, but this header line was added in HTTP 1.1 to deal with cases where a single HTTP server answers requests for several different hostnames.
User-Agent
Thing/1.23 details...
This optional header line identifies the make and model of this browser (virtual or otherwise). For an interactive browser, it's usually something like Mozilla/4.76[en](Win98;U) or Mozilla/4.0(compatible;MSIE5.12;Mac_PowerPC). By default, LWP sends a User-Agent header of libwww-perl/5.64 (or whatever your exact LWP version is).
Referer
http://www.thingamabob.int/stuff.html
This optional header line tells the remote server the URL of the page that contained a link to the page being requested.
Accept-Language
en-US, en, es, de
This optional header line tells the remote server the natural languages in which the user would prefer to see content, using language tags. For example, the above list means the user would prefer content in U.S. English, or (in order of decreasing preference) any kind of English, Spanish, or German. (Appendix D lists the most common language tags.) Many browsers do not send this header, and those that do usually send the default header appropriate to the version of the browser that the user installed. For example, if the browser is Netscape with a Spanish-language interface, it would probably send Accept-Language:es, unless the user has dutifully gone through the browser's preferences menus to specify other languages.

Response

The server's response also has three parts: the status line, some headers, and an optional body.

The status line states which protocol the server is speaking, then gives a numeric status code and a short message. For example, "HTTP/1.1 404 Not Found." The numeric status codes are grouped—200-299 are success, 400-499 are permanent failures, and so on. A full list of HTTP status codes is given in Appendix B.

The header lines let the server send additional information about the response. For example, if authentication is required, the server uses headers to indicate the type of authentication. The most common header—almost always present for both successful and unsuccessful requests—is Content-Type, which helps the browser interpret the body. Headers are terminated with a blank line, which must be present even if no headers are sent.

Many responses contain a Content-Length line that specifies the length, in bytes, of the body. However, this line is rarely present on dynamically generated pages, and because you never know which pages are dynamically generated, you can't rely on that header line being there.

(Other, rarer header lines are used for specifying that the content has moved to a given URL, or that the server wants the browser to send HTTP cookies, and so on; however, these things are generally handled for you automatically by LWP.)

The body of the response follows the blank line and can be any arbitrary data. In the case of a typical web request, this is the HTML document to be displayed. If an error occurs, the message body doesn't contain the document that was requested but usually consists of a server-generated error message (generally in HTML, but sometimes not) explaining the error.


LWP::Simple

GET is the simplest and most common type of HTTP request. Form parameters may be supplied in the URL, but there is never a body to the request. The LWP::Simple module has several functions for quickly fetching a document with a GET request. Some functions return the document, others save or print the document.


Basic Document Fetch

The LWP::Simple module's get( ) function takes a URL and returns the body of the document:


$document = get("http://www.suck.com/daily/2001/01/05/1.html");

If the document can't be fetched, get( ) returns undef. Incidentally, if LWP requests that URL and the server replies that it has moved to some other URL, LWP requests that other URL and returns that.

With LWP::Simple's get( ) function, there's no way to set headers to be sent with the GET request or get more information about the response, such as the status code. These are important things, because some web servers have copies of documents in different languages and use the HTTP language header to determine which document to return. Likewise, the HTTP response code can let us distinguish between permanent failures (e.g., "404 Not Found") and temporary failures ("505 Service [Temporarily] Unavailable").

Even the most common type of nontrivial web robot (a link checker), benefits from access to response codes. A 403 ("Forbidden," usually because of file permissions) could be automatically corrected, whereas a 404 ("Not Found") error implies an out-of-date link that requires fixing. But if you want access to these codes or other parts of the response besides just the main content, your task is no longer a simple one, and so you shouldn't use LWP::Simple for it. The "simple" in LWP::Simple refers not just to the style of its interface, but also to the kind of tasks for which it's meant.


Fetch and Store

One way to get the status code is to use LWP::Simple's getstore( ) function, which writes the document to a file and returns the status code from the response:


$status = getstore("http://www.suck.com/daily/2001/01/05/1.html",
                   "/tmp/web.html");

There are two problems with this. The first is that the document is now stored in a file instead of in a variable where you can process it (extract information, convert to another format, etc.). This is readily solved by reading the file using Perl's built-in open( ) and <FH> operators; see below for an example.

The other problem is that a status code by itself isn't very useful: how do you know whether it was successful? That is, does the file contain a document? LWP::Simple offers the is_success( ) and is_error( ) functions to answer that question:


$successful = is_success(''status'');
$failed     = is_error(''status'');

If the status code status indicates a successful request (is in the 200-299 range), is_success( ) returns true. If status is an error (400-599), is_error( ) returns true. For example, this bit of code saves the BookTV (CSPAN2) listings schedule and emits a message if Gore Vidal is mentioned:


use strict;
use warnings;
use LWP::Simple;
my $url  = 'http://www.booktv.org/schedule/';
my $file = 'booktv.html';
my $status = <tt>getstore</tt>($url, $file);
die "Error $status on $url" unless <tt>is_success</tt>($status);
open(IN, "<$file") || die "Can't open $file: $!";
while (<IN>) {
  if (m/Gore\s+Vidal/) {
    print "Look!  Gore Vidal!  $url\n";
    last;
  }
}
close(IN);


Fetch and Print

LWP::Simple also exports the getprint( ) function:


$status = getprint(''url'');

The document is printed to the currently selected output filehandle (usually STDOUT). In other respects, it behaves like getstore( ). This can be very handy in one-liners such as:


% perl -MLWP::Simple -e "getprint('http://cpan.org/RECENT')||die" | grep Apache

That retrieves http://cpan.org/RECENT, which lists the past week's uploads in CPAN (it's a plain text file, not HTML), then sends it to STDOUT, where grep passes through the lines that contain "Apache."


Previewing with HEAD

LWP::Simple also exports the head( ) function, which asks the server, "If I were to request this item with GET, what headers would it have?" This is useful when you are checking links. Although, not all servers support HEAD requests properly, if head( ) says the document is retrievable, then it almost definitely is. (However, if head( ) says it's not, that might just be because the server doesn't support HEAD requests.)

The return value of head( ) depends on whether you call it in scalar context or list context. In scalar context, it is simply:


$is_success = head(''url'');

If the server answers the HEAD request with a successful status code, this returns a true value. Otherwise, it returns a false value. You can use this like so:

die "I don't think I'll be able to get $url" unless head($url);

Regrettably, however, some old servers, and most CGIs running on newer servers, do not understand HEAD requests. In that case, they should reply with a "405 Method Not Allowed" message, but some actually respond as if you had performed a GET request. With the minimal interface that head( ) provides, you can't really deal with either of those cases, because you can't get the status code on unsuccessful requests, nor can you get the content (which, in theory, there should never be any).

In list context, head( ) returns a list of five values, if the request is successful:


(''content_type, document_length, modified_time, expires, server'')
    = head(''url'');

The content_type value is the MIME type string of the form type/subtype; the most common MIME types are listed in Appendix C. The document_length value is whatever is in the Content-Length header, which, if present, should be the number of bytes in the document that you would have gotten if you'd performed a GET request. The modified_time value is the contents of the Last-Modified header converted to a number like you would get from Perl's time( ) function. For normal files (GIFs, HTML files, etc.), the Last-Modified value is just the modification time of that file, but dynamically generated content will not typically have a Last-Modified header.

The last two values are rarely useful; the expires value is a time (expressed as a number like you would get from Perl's time( ) function) from the seldom used Expires header, indicating when the data should no longer be considered valid. The server value is the contents of the Server header line that the server can send, to tell you what kind of software it's running. A typical value is Apache/1.3.22 (Unix).

An unsuccessful request, in list context, returns an empty list. So when you're copying the return list into a bunch of scalars, they will each get assigned undef. Note also that you don't need to save all the values—you can save just the first few, as in Example 2-4.


Link checking with HEAD

use strict;
use LWP::Simple;
foreach my $url (
  'http://us.a1.yimg.com/us.yimg.com/i/ww/m5v9.gif',
  'http://hooboy.no-such-host.int/',
  'http://www.yahoo.com',
  'http://www.ora.com/ask_tim/graphics/asktim_header_main.gif',
  'http://www.guardian.co.uk/',
  'http://www.pixunlimited.co.uk/siteheaders/Guardian.gif',
) {
  print "\n$url\n";

''  my ($type, $length, $mod) = head($url);''''  # so we don't even save the expires or server values!''

  unless (defined $type) {
    print "Couldn't get $url\n";
    next;
  }
  print "That $type document is ", $length || "???", " bytes long.\n";
  if ($mod) {
    my $ago = time(  ) - $mod;
    print "It was modified $ago seconds ago; that's about ",
      int(.5 + $ago / (24 * 60 * 60)), " days ago, at ",
      scalar(localtime($mod)), "!\n";
  } else {
    print "I don't know when it was last modified.\n";
  }
}

Currently, that program prints the following, when run:


''http://us.a1.yimg.com/us.yimg.com/i/ww/m5v9.gif''''That image/gif document is 5611 bytes long''.
''It was modified 251207569 seconds ago; that's about 2907 days ago, at Thu Apr 14 18:00:00 1994!''''http://hooboy.no-such-host.int/''''Couldn't get http://hooboy.no-such-host.int/''''http://www.yahoo.com''''That text/html document is ??? bytes long''.
''I don't know when it was last modified''.

''http://www.ora.com/ask_tim/graphics/asktim_header_main.gif''''That image/gif document is 8588 bytes long''.
''It was modified 62185120 seconds ago; that's about 720 days ago, at Mon Apr 10 12:14:13 2000!''''http://www.guardian.co.uk/''''That text/html document is ??? bytes long''.
''I don't know when it was last modified''.

''http://www.pixunlimited.co.uk/siteheaders/Guardian.gif''''That image/gif document is 4659 bytes long''.
''It was modified 24518302 seconds ago; that's about 284 days ago, at Wed Jun 20 11:14:33 2001!''

Incidentally, if you are using the very popular CGI.pm module, be aware that it exports a function called head( ) too. To avoid a clash, you can just tell LWP::Simple to export every function it normally would except for head( ):


use LWP::Simple qw(<tt>!head</tt>);
use CGI qw(:standard);

If not for that qw(!head), LWP::Simple would export head( ), then CGI would export head( ) (as it's in that module's :standard group), which would clash, producing a mildly cryptic warning such as "Prototype mismatch: sub main::head ($) vs none." Because any program using the CGI library is almost definitely a CGI script, any such warning (or, in fact, any message to STDERR) is usually enough to abort that CGI with a "500 Internal Server Error" message.


Fetching Documents Without LWP::Simple

LWP::Simple is convenient but not all powerful. In particular, we can't make POST requests or set request headers or query response headers. To do these things, we need to go beyond LWP::Simple.

The general all-purpose way to do HTTP GET queries is by using the do_GET( ) subroutine shown in Example 2-5.


The do_GET subroutine

use LWP;
my $browser;
sub do_GET {
  # Parameters: the URL,
  #  and then, optionally, any header lines: (key,value, key,value)
  $browser = LWP::UserAgent->new(  ) unless $browser;
  my $resp = $browser->get(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;
}

A full explanation of the internals of do_GET( ) is given in Chapter 3. Until then, we'll be using it without fully understanding how it works.

You can call the do_GET( ) function in either scalar or list context:


''doc'' = do_GET(''URL ''[''header, value, ...'']);
(''doc'', ''status'', ''successful'', ''response'') = do_GET(''URL ''[''header, value, ...'']);

In scalar context, it returns the document or undef if there is an error. In list context, it returns the document (if any), the status line from the HTTP response, a Boolean value indicating whether the status code indicates a successful response, and an object we can interrogate to find out more about the response.

Recall that assigning to undef discards that value. For example, this is how you fetch a document into a string and learn whether it is successful:


($doc, undef, $successful, undef) = do_GET('http://www.suck.com/');

The optional header and value arguments to do_GET( ) let you add headers to the request. For example, to attempt to fetch the German language version of the European Union home page:


$body = do_GET("http://europa.eu.int/",
  "Accept-language" => "de",
);

The do_GET( ) function that we'll use in this chapter provides the same basic convenience as LWP::Simple's get( ) but without the limitations.


Example: AltaVista

Every so often, two people, somewhere, somehow, will come to argue over a point of English spelling—one of them will hold up a dictionary recommending one spelling, and the other will hold up a dictionary recommending something else. In olden times, such conflicts were tidily settled with a fight to the death, but in these days of overspecialization, it is common for one of the spelling combatants to say "Let's ask a linguist. He'll know I'm right and you're wrong!" And so I am contacted, and my supposedly expert opinion is requested. And if I happen to be answering mail that month, my response is often something like:

Dear Mr. Hing: I have read with intense interest your letter detailing your struggle with the question of whether your favorite savory spice should be spelled in English as "asafoetida" or whether you should heed your secretary's admonishment that all the kids today are spelling it "asafetida." I could note various factors potentially involved here; notably, the fact that in many cases, British/Commonwealth spelling retains many "ae"/"oe" digraphs whereas U.S./Canadian spelling strongly prefers an "e" ("foetus"/"fetus," etc.). But I will instead be (merely) democratic about this and note that if you use AltaVista (http://altavista.com, a well-known search engine) to run a search on "asafetida," it will say that across all the pages that AltaVista has indexed, there are "about 4,170" matched; whereas for "asafoetida" there are many more, "about 8,720." So you, with the "oe," are apparently in the majority.
To automate the task of producing such reports, I've written a small program called alta_count, which queries AltaVista for each term given and reports the count of documents matched:


% alta_count asafetida asafoetida
<tt>asafetida: 4,170 matches</tt><tt>asafoetida: 8,720 matches</tt>

At time of this writing, going to http://altavista.com, putting a word or phrase in the search box, and hitting the Submit button yields a result page with a URL that looks like this:


http://www.altavista.com/sites/search/web?q=%22asafetida%22&kl=XX

Now, you could construct these URLs for any phrase with something like:


$url = 'http://www.altavista.com/sites/search/web?q=%22'
       . $phrase
       . '%22&kl=XX'  ;

But that doesn't take into account the need to encode characters such as spaces in URLs. If I want to run a search on the frequency of "boy toy" (as compared to the alternate spelling "boytoy"), the space in that phrase needs to be encoded as %20, and if I want to run a search on the frequency of "résumé," each "é" needs to be encoded as %E9.

The correct way to generate the query strings is to use the URI::Escape module:


use URI::Escape;    # That gives us the uri_escape function
$url = 'http://www.altavista.com/sites/search/web?q=%22'
       . uri_escape($phrase)
       . '%22&kl=XX'  ;

Now we just have to request that URL and skim the returned content for AltaVista's standard phrase "We found [number] results." (That's assuming the response comes with an okay status code, as we should get unless AltaVista is somehow down or inaccessible.)

Example 2-6 is the complete alta_count program.


The alta_count program

#!/usr/bin/perl -w
use strict;
use URI::Escape;
foreach my $word (@ARGV) {
  next unless length $word; # sanity-checking
  my $url = 'http://www.altavista.com/sites/search/web?q=%22'
    . uri_escape($word) . '%22&kl=XX';
  my ($content, $status, $is_success) = do_GET($url);
  if (!$is_success) {
    print "Sorry, failed: $status\n";
  } elsif ($content =~ />We found ([0-9,]+) results?/) { # like "1,952"
    print "$word: $1 matches\n";
  } else {
    print "$word: Page not processable, at $url\n";
  }
  sleep 2; # Be nice to AltaVista's servers!!!
}

# And then my favorite do_GET routine:
use LWP; # loads lots of necessary classes.
my $browser;
sub do_GET {
  $browser = LWP::UserAgent->new unless $browser;
  my $resp = $browser->get(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;
}

With that, I can run:


% alta_count boytoy 'boy toy'
''boytoy: 6,290 matches''''boy toy: 26,100 matches''

knowing that when it searches for the frequency of "boy toy," it is duly URL-encoding the space character.

This approach to HTTP GET query parameters, where we insert one or two values into an otherwise precooked URL, works fine for most cases. For a more general approach (where we produce the part after the ? completely from scratch in the URL), see Chapter 5.


HTTP POST

Some forms use GET to submit their parameters to the server, but many use POST. The difference is POST requests pass the parameters in the body of the request, whereas GET requests encode the parameters into the URL being requested.

Babelfish (http://babelfish.altavista.com) is a service that lets you translate text from one human language into another. If you're accessing Babelfish from a browser, you see an HTML form where you paste in the text you want translated, specify the language you want it translated from and to, and hit Translate. After a few seconds, a new page appears, with your translation.

Behind the scenes, the browser takes the key/value pairs in the form:


urltext = I like pie
lp = en_fr
enc = utf8

and rolls them into a HTTP request:


POST /translate.dyn HTTP/1.1
Host: babelfish.altavista.com
User-Agent: SuperDuperBrowser/14.6
Content-Type: application/x-www-form-urlencoded
Content-Length: 40
  
urltext=I%20like%20pie&lp=en_fr&enc=utf8

Just as we used a do_GET( ) function to automate a GET query, Example 2-7 uses a do_POST( ) function to automate POST queries.


The do_POST subroutine

use LWP;
my $browser;
sub do_POST {
  # Parameters:
  #  the URL,
  #  an arrayref or hashref for the key/value pairs,
  #  and then, optionally, any header lines: (key,value, key,value)
  $browser = LWP::UserAgent->new(  ) unless $browser;
  my $resp = $browser->post(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;
}

Use do_POST( ) like this:


''doc'' = do_POST(''URL'', [''form_ref'', [''headers_ref'']]);
(''doc'', ''status'', ''success'', ''resp'') = do_GET(''URL'', [''form_ref'', [''headers_ref'']]);

The return values in scalar and list context are as for do_GET( ). The form_ref parameter is a reference to a hash containing the form parameters. The headers_ref parameter is a reference to a hash containing headers you want sent in the request.


Example: Babelfish

Submitting a POST query to Babelfish is as simple as:


my ($content, $message, $is_success) = do_POST(
  'http://babelfish.altavista.com/translate.dyn',
  [ 'urltext' => "I like pie", 'lp' => "en_fr", 'enc' => 'utf8' ],
);

If the request succeeded ($is_success will tell us this), $content will be an HTML page that contains the translation text. At time of this writing, the translation is inside the only textarea element on the page, so it can be extracted with just this regexp:


$content =~ m{<textarea.*?>(.*?)</textarea>}is;

The translated text is now in $1, if the match succeeded.

Knowing this, it's easy to wrap this whole procedure up in a function that takes the text to translate and a specification of what language from and to, and returns the translation. Example 2-8 is such a function.


Using Babelfish to translate

sub translate {
  my ($text, $language_path) = @_;

  my ($content, $message, $is_success) = do_POST(
    'http://babelfish.altavista.com/translate.dyn',
    [ 'urltext' => $text, 'lp' => $language_path, 'enc' => 'utf8' ],
  );
  die "Error in translation $language_path: $message\n"
   unless $is_success;

  if ($content =~ m{<textarea.*?>(.*?)</textarea>}is) {
    my $translation;
    $translation = $1;
    # Trim whitespace:
    $translation =~ s/\s+/ /g;
    $translation =~ s/^ //s;
    $translation =~ s/ $//s;
    return $translation;
  } else {
    die "Can't find translation in response to $language_path";
  }
}

The translate( ) subroutine constructs the request and extracts the translation from the response, cleaning up any whitespace that may surround it. If the request couldn't be completed, the subroutine throws an exception by calling die( ).

The translate( ) subroutine could be used to automate on-demand translation of important content from one language to another. But machine translation is still a fairly new technology, and the real value of it is to be found in translating from English into another language and then back into English, just for fun. (Incidentally, there's a CPAN module that takes care of all these details for you, called Lingua::Translate, but here we're interested in how to carry out the task, rather than whether someone's already figured it out and posted it to CPAN.)

The alienate program given in Example 2-9 does just this (the definitions of translate( ) and do_POST( ) have been omitted from the listing for brevity).


The alienate program

#!/usr/bin/perl -w
# alienate - translate text
use strict;
my $lang;
if (@ARGV and $ARGV[0] =~ /^-(\w\w)$/s) {
  # If the language is specified as a switch like "-fr"
  $lang = lc $1;
  shift;
} else {
  # Otherwise just pick a language at random:
  my @languages = qw(it fr de es ja pt);
  # I.e.: Italian, French, German, Spanish, Japanese, Portugese.
  $lang = $languages[rand @languages];
}

die "What to translate?\n" unless @ARGV;
my $in = join(' ', @ARGV);

print " => via $lang => ",
  translate(
    translate($in, 'en_' . $lang),
    $lang . '_en'
  ), "\n";
exit;

# definitions of do_POST() and translate(  ) go here

Call the alienate program like this:


% alienate [-''lang''] ''phrase''

Specify a language with -lang, for example -fr to translate via French. If you don't specify a language, one will be randomly chosen for you. The phrase to translate is taken from the command line following any switches.

Here are some runs of alienate:


% alienate -de "Pearls before swine!"
''=> via de => Beads before pigs!''

% alienate "Bond, James Bond"
''=> via fr => Link, Link Of James''

% alienate "Shaken, not stirred"
''=> via pt => Agitated, not agitated''

% alienate -it "Shaken, not stirred"
''=> via it => Mental patient, not stirred''

% alienate -it "Guess what! I'm a computer!"
''=> via it => Conjecture that what! They are a calculating!''

% alienate 'It was more fun than a barrel of monkeys'
''=> via de => It was more fun than a barrel drop hammer''

% alienate -ja 'It was more fun than a barrel of monkeys'
''=> via ja => That the barrel of monkey at times was many pleasures''
Personal tools