Archive for Perl

Right Tool for the Right Job

On one of my projects the requirements was to take two csv files, one with words and data and one with words. The goal is to use the list of words to retrieve the data from the other. Long-term storage for the datasets is not needed, so I didn't see a reason to put in a database. Typically the language used for this company is PHP (though I have sneaked in a few ruby scripts.. shhhh) so my first try was to load each file into an array in memory and do a lookup. It worked fine for my sample files but croaked when I fed it the real data. My real data file was 180k some words and my word list was 400k words. So, I tried using a temporary table. I started it when I left for the day and it was still going the next morning. This was unbearable slow and never actually completed. I could have probably tried to optimize or choose different table types, but I didn't...because I was thinking hmmmmm... wonder if Perl will be faster? It was, not only did it complete --- but finished in minutes (with printing each word out to the display) or in seconds (in a more silent mode). I showed the people who would be using it – they were amazed. This sort of thing takes hours in excel and crashes sometimes.

PERL:
  1. # load data list into hash
  2. $parser = Parse::CSV->new( binary => 1, file => $infile_name );
  3. my %DATA = {};
  4. print "\n\nNow indexing data file\n";
  5. while ( $line = $parser->fetch ) {
  6.     next if !$line;
  7.     $word = shift @$line;   
  8.     next if ( (!defined($word)) or ($word eq ''));
  9.     $DATA{$word} = $line;
  10.     print $verbose ? "Indexing $word\n" : '.';
  11. }

The files sometimes have foreign characters in it so I open it in binary mode. Parse::CSV is a wrapper for Text::CSV_XS class and is much easier to use. Its written by Adam Kennedy and I had a question about it and was able to find him in IRC and ask him myself, and got some help from someone else in the room too -- pretty cool! Back to the code, I skip if the line is not defined, shift off the first element of the array. The fetch method always returns an array reference, so I must prefix it with @ to access it as an array and shift off the first element. If thats not defined or blank, I use as the key to store the reference to the array. If verbose option was used, I display a longer notification to the user, otherwise I display just a period. That way the user knows its working!

PERL:
  1. #setup excel file
  2. my $workbook = Spreadsheet::WriteExcel->new($outfile_name);
  3. my $worksheet = $workbook->add_worksheet();
  4.  
  5. #process word list
  6.  
  7. $parser = Parse::CSV->new( binary => 1, file => $wordfile_name );
  8. my $row_count = 0;
  9. my (@row, $row_ref);
  10. print "\nNow starting lookup...\n";
  11. while ( $line = $parser->fetch ) {
  12.     @row = ();
  13.     $word = shift @$line;
  14.     next if ( (!defined($word)) or ($word eq ''));
  15.  
  16.     if (exists $DATA{$word}) {
  17.         #word found
  18.         $row_ref = $DATA{$word}#get ref to array
  19.         print $verbose ? "Found $word\n" : '+';
  20.         push @row, $word;
  21.         push @row, @$row_ref;
  22.         $worksheet->write_row($row_count, 0, \@row);
  23.     } else {
  24.         print $verbose ? "Not Found $word\n" : '-';
  25.         $worksheet->write_row($row_count, 0, [$word, 'not found']);
  26.     }
  27.     $row_count++;
  28. }

Similar to the loading of the hash, I open the file with Parser::CSV as before. I initialize my variables and set a row_count used to write to the excel file. This data file is supposed to be a single column list of words, so I could probably get away with just doing a simple file read but just in case I treat as a CSV file. Once I get the word variable, I see if it exists as a key in my %DATA hash. Funny thing about Perl regarding hashes and arrays – when referring to the whole thing you use % for hash, @ for array. But when referring to a particular element, you use $. Kinda funny but in a way makes sense...hehe. :P I simplified this code a bit so you get the concept and these lines better no doubt. I used @row as a temporary holder for the data I want to write to the excel file, I had some other code here I took out which had more to do with the @row. Finally I write the row array to the the excel file. The method requires an array ref, and \@row gives me a reference.

Thats basically it. The loading of %DATA hash is of course, dependent on how much memory you have. If you have problems you could try tieing it to a DB file:

PERL:
  1. use DB_File;
  2.  
  3. tie %DATA, 'DB_File', 'output.db' or die ('cannot open output.db');

Building a DB file also has the advantage of persistence. If you need to run a lot of lookups, multiple times you may be interested in using it. It will be slower to some extent, but it will work! :)

My script could probably be optimized some and perhaps golfed some which could make it faster. But I try and write Perl in english, at least for the first pass. Then perhaps I can use the shortcuts and see if it runs faster. This was a fun project and I got to learn more Perl. My friends Liz and Yaakov helped me and I thank them :)

Comments (1)

Syntax .. Smintax ..

Last week I was called upon to write some Perl - something I haven't done since last fall. It was funny how as I was working on it, it started to come back to me. So I thought it'd be fun to compare my favorite languages a little bit:

Looks at how arrays are defined and the way I like to loop through them:

PERL:
  1. @books = ('Learning Perl', 'Advanced Perl Programming', 'Perl Best Practices');
  2.  
  3. foreach $book(@books) {
  4.   print "* $book\n";
  5. }

Foreach is actually an alias for "For" and some prefer that in this use because it makes it more readable. It also looks like php...

Now, for the language that has consumed my time the past 6 months...

PHP:
  1. $books = array('Pro PHP Security','PHP Cookbook','Pro PHP XML and Web Services');
  2.  
  3. foreach($books as $book) {
  4.   print "* $book\n";
  5. }

And.... here's ruby:

RUBY:
  1. books = ['Programming Ruby','Ruby Cookbook','Mr. Neighborly\'s Humble Little Ruby Book']
  2.  
  3. for book in books
  4.   print "* #{book}\n"
  5. end

I never actually used the for loop like that with an array, I usually use this version which is what the parser converts it to anyways:

RUBY:
  1. books.each do |book|
  2.   print "* #{book}\n"
  3. end

Fun stuff :)

Comments (1)

Feb Ruby Meeting Report - Capistrano and Starfish

I came to the meeting knowing a bit about capistrano and nothing about starfish and left with a firms grasp of basic concepts of both!
In short:

  • Capistrano - A tool for deploying actions on multiple servers. Not necessarily for Rails and you don't need Ruby on the deployment servers! Presented by Michael H Buselli
  • Starfish - Distributed programming in Ruby. Presented by Peter Chan

In Long(er):

Capistrano
Like Rails, this tool relies on convention over configuration and makes some assumptions about your environment such as Rails, Subversion, Apache 1.x and FastCGI. Of course you can override some of these assumptions and even use it with PHP and CVS (yikes). Future versions will be completely separate from Rails. I know people who stiffen at any mention of Rails, but really.. this is how tools are born, out of a need. This one so happened to be a need by Rails developers and thus it makes sense it would be naturally easier to use to deploy a rails site.
Commands
The basic commands are run, sudo (run as root), put, delete, render (returns output from erb template) and get. In addition you can add your own commands.

Tasks:
You group commands similar to a batch file or shell script. And interesting thing is if your task is called "say_hello" ... you can also have a task "before_say_hello" and "after_say_hello" that will run before and after respectively. This might be useful for making "changes" to some of the standard tasks that will do any preparation or cleanup without having to hack the code. The question was asked if you could call "before_before_say_hello" and yes, recursive calls that that do work..though I think it could get pretty confusing!

Roles:
Machines are grouped by roles, such as "web", "db" and you can have multiple machines in those roles. The db role is unique in that you specify one as primary, because thats where the migrations are run (then I'm guessing that the database is just copied to the other database servers?).
Putting them together:
You can specify on the tasks which machine role it is to be used for such as:

task :say_hello_to_webservers, :roles => :web do
run 'echo "hello world" '
end

Anyways thats the basics as I understood. Please correct me if I am off base.
Link to presentation and resources: http://www.cosinewave.net/ruby/cap

Here's a blog posting that describes how to use Capistrano with Perl or PHP which I bookmarked some time ago, it may be a little out of date but probably has some good information still.
---

Then we had a brief moment of fun as we watched this (which is no joke!)

Erlang - http://tinyurl.com/ytgp27

---

Starfish
This was interesting as I have never done distributed programming or had a need too, but I'm always wondering how things work. The presenter said that he thinks this is one of Google's secret weapons in making things load faster.
The starfish file consists of two sections -- the server and the client. The server section describes the process and the client section describes the output. Once that is set, you run the starfish program and the first time you run it, it starts a client and a server. To start another process, run the starfish command again and this time it sees there's a server already started and then just starts up another client.
ex:
starfish find_primes.rb #starts server, client
starfish find_primes.rb #sttarts another client
Pretty neat, I had to leave before the end of this talk since I have such a long commute home but I got the jist of what starfish was and know where to look if I need distributed programming in the future!

Links to his demo files: http://oaktop.com/go/starfish/

---

Live in Chicago? Join Chirb, the Chicago Ruby group. Can't make it downtown? There some individuals starting meetings in the burbs, join the mailing list for details!

Comments

Perl, PHP and Ruby oh my!

This week, I attended the Perl and PHP meetings. Though they were late nights for me, they were good meetings and worth the loss of sleep.

Perl - Catalyst
Catalyst is a MVC framework for Perl. I like Rails -- alot, but not just cuz its Rails because its Ruby. I love Ruby. Perl I really like also, but I'm not so sure its a fantastic language to develop for the web. After seeing the presentation I say, thats nice... but.. I will probably use Rails if I were to use a MVC framework. I'm wearing my Perl shirt today!

By the way -- if you are south of chicago there's a Perl meet up in Tinley Park on Jan 24, 7pm

Caribou Coffee
16205 Harlem Ave
Tinley Park, IL 60477
(708) 444-0478
PHP - Firebug
We didn't really have any topics planned out for this one -- but we had some volunteers. Peter did a good overview of the indispensable tool for debugging javascript, css and HTML. Unfortunately, it only works in Firebug, but there is an add javascript to use with IE and get a few features to make your life a bit more bearable in IE. Larry jumped in and showed some Javascript Debugging. Next month at PHP -- profiling fest with XDebug, and Valgrind. Should be interesting!

Ruby - no love!
I've been moping about this week about Ruby... I haven't done it in about 4-5 months since being back in PHP-Land? I'm having withdrawals. It all started after I read this article Technologies of the Year 2006 (BTW -I have done all of them!). I was thinking ..awwww... Rails.... and I picked up my Rails book at work that I look at when I am fed up with PHP, just to cheer me up a bit. I skipped the last Ruby meeting, at the last minute the buddies that I thought would go with me backed out.. and.. well.. I was pretty tired and the topic of "Environments" made me think it would be a mac-love fest. I used to want a mac laptop, but... I've decided to stick with the PC environment since I am happiest in Ubuntu. I am determined, buddies or no buddies, to go to the next meeting which should prove more interesting. The Chicago Ruby list has been buzzing with topics and volunteers for presentations. There's even an outbreak of smaller meetups in other areas of the city.. North, South... fun times.

Comments (1)

To Generate or To Template

that is the question ...

read some of my ramblings on the subject here

and vote here!

Comments

Brand New Year

Happy 2006!! I can't believe it’s a new year already! Here's how I spent the last few days of 2005 and first day of 2006.

Background
When I was in college, around 1999 my co-worker did some Perl to process forms and email the data. I was more of their graphics person. He hated it, I'd say.. Oh Jayson! I have another form for you! I don't know if he disliked Perl so much or processing forms is just a boring task (I agree now). I dabbled a little in Perl but thought it was difficult and obsecure. I thought Perl geeks wrote it so they could boast about how clever they were because their code did X function in ONLY 25 characters! NYAH! Then I happened to find PHP, learned it, did a site for my then boyfriend (now husband) and that got me a job as a PHP programmer and like a horse with blinders, pretty much just did PHP (and JavaScript) until this the second half of this year, err last year.

I've been learning (re-learning? I don't know how much Perl I really knew at any one time) I've made some friends who are Perl programmers (Andy Lester and Liz) and they aren't like that at all, although they tell keep telling me that Perl is the "One True Language". Even the folks in the chatterbox at PerlMonks.org have been very nice in answering my noob questions.

I'm digging Perl. Once you get through "The Gory Details" in Programming Perl aka The Camel Book.

Goal
Maybe I'm a strange person, but I like testing. I've been working with Selenium to write tests for DotProject. Watch for an upcoming article on it at CodeSnipers.com this week. Andy Lester has a project called Phalanx which has the goal of getting 100 or so Perl modules with complete tests and documentation. Since I don't have any grand idea of a module that doesn't already exist in the 9283 modules in CPAN, I thought hey if I can help with testing a little bit than I can contribute in some tiny part.

Tools
Test::More - a uber simple test mod which should work for test in most cases. I asked Andy Lester if I should be using Test::Unit for testing a module (in PHP I would use PHPUnit) but he said that Test::More is all I need.

POD::Coverage - a mod that compares how much POD (plain old documentation) is in the module. Its good to have each subroutine documented (called "covered" as a opposed to "uncovered") and gives you a percentage, 100% is good. It will also list the subs that are uncovered. There is a way to set it to skip certain subs (private subs for example). You can run the POD::Coverage as part of the Test::More suites as well. Nifty.

Devel::Cover - a mode that checks your code coverage. I couldn't get this to work, since this is not compiled for windows and my Perl version on my server is a few points behind. Looking at the docs, it can check the statements, branch, condition, path, subroutine, pod and time. Was not able to find out too much about how it works it since I can't get it to run. This module is in alpha stage at this point, so I'm sure more information will be available soon.

I also made my first module, which was just a simple class I stole from an example. I couldn't get my module to run, so I went to PerlMonks.org and asked them. The said OH, you have your code after the __END__ it should be before - doh - So once I did that, it was fine.

Well, that’s about it for now.

Comments (2)