Grandma's Fridge

A TitanOfOld dev blog

Perl Performance: State Vs. My

I've seen several articles on Perl's state declaration, but not so much on how it might help performance. Let's dig into that a a bit.

This is my first post on Perl. I've been working with it for a couple years now since I first learned about it through O'Reilly School.

Man, I have learned a lot since then.

O'Reilly did a very good job getting me going with Perl, but they were several years behind current modern Perl recommendations, like use strict and use warnings. Much less using my to limit variable scope.

Imagine my surprise when I discovered the little discussed state.

state declares the scope of a variable just as my does, but the variable doesn't lose its value, or get garbage collected when the variable goes out of scope. This is particularly useful for variables you use for iterations:

#!/usr/bin/env perl
use v5.16;
use warnings;

while (1) {
    state $i = 5;
    say $i; # 5, 4, 3, 2, 1
    last unless $i--;
}

Okay, so that's a bit useless and in general just bad. But, this snippet shows that $i is only accessible within its scope and preserves its value from the previous iteration.

Now for something a little more useful, and can increase the performance of your script. The following snippet is an example of something I have to do at work to import data from a horribly relaxed FoxPro database to a more sensible PostgreSQL database. There are somewhere around 120,000 records that I have to migrate from a useless database to a real database that can do real things with data, like store it properly. One thing that's really, really, really bad is date and time storage in FoxPro…there's no such thing. You just store a string. If you're a sensible developer, you may store the string in some logical format such as the ISO standard YYYY-MM-DD HH:mm:SS.ssss. But, in the real world there are these things called `novice developers', a.k.a. fools. As a result, I can't just take the date string out of the FoxPro "database" and send it to PostgreSQL, I have to parse it first.

Perl has thousands of modules to help with a lot of tasks. For date and time handling, my favorite is DateTime. Yes, DateTime is a bit of a heavyweight, but it is accurate and correct. Just what I want out of a module. The initialization costs are high, though. Wouldn't it be better if I only initialized it once instead of 120,000 times? You can bet your colloquial donkey it is!

#!/usr/bin/env perl
use v5.16;

use Benchmark qw(:hireswallclock :all);
use DateTime::Format::Strptime;
use DateTime::Format::Pg;

my $result = timethese(
    120_000,
    {
        'My'    => "using_my('20120927', '11:53:26')",
        'State' => "using_state('20120927', '11:53:26')"
    }
);

cmpthese($result);

# These subroutines are identical except for one using 'my' and the other
# using 'state'.
sub using_my {
    my $date = shift;
    my $time = shift;
    my $strp = DateTime::Format::Strptime->new(
        pattern   => '%Y%m%d %T',
        locale    => 'en_US',
        time_zone => 'America/New_York'
    );
    return DateTime::Format::Pg->format_datetime(
        $strp->parse_datetime("$date $time") );
}

sub using_state {
    my $date = shift;
    my $time = shift;
    state $strp = DateTime::Format::Strptime->new(
        pattern   => '%Y%m%d %T',
        locale    => 'en_US',
        time_zone => 'America/New_York'
    );
    return DateTime::Format::Pg->format_datetime(
        $strp->parse_datetime("$date $time") );
}
$ ./test.pl
Benchmark: timing 120000 iterations of My, State...
        My: 48.0714 wallclock secs (48.00 usr +  0.00 sys = 48.00 CPU) @ 2500.00/s (n=120000)
     State: 14.5937 wallclock secs (14.55 usr +  0.00 sys = 14.55 CPU) @ 8247.42/s (n=120000)
        Rate    My State
My    2500/s    --  -70%
State 8247/s  230%    --

That's a significant reduction in processing time. Why? my initializes the DateTime parser every time the subroutine is called. It must because once it hits return, the object loses its scope and gets garbage collected so that there's nothing to recover the next time it is called. Using state, however, prevents that from happening by keeping the object from being garbage collected, but preserves the scope.

In short, state is awesome, but it can be very sneaky. Check this out:

sub this_does_not_mean_what_you_think_it_means {
  my $date = shift;
  my $time = shift;
  state $strp = DateTime::Format::Strptime->new(
    pattern   => '%Y%m%d %T',
    locale    => 'en_US',
    time_zone => 'America/New_York'
    on_error  => sub { die "Couldn't parse $date $time"}
  );
}

What do you think the error message will print? If you said the date and time it failed to parse, give yourself a smack on the forehead. It will print out the date and time of the first date and time pair you parsed. (If you got that right, have a cookie.)