14.170: Programming for Economists

14.170: Programming for Economists

14.170: Programming for Economists 1/12/2009-1/16/2009 Melissa Dell Matt Notowidigdo Paul Schrimpf Perl (for economists) Perl overview slide This short lecture will go over what I feel are the primary uses of Perl (by economists) To use Perls built-in data structures to implement algorithms with asymptotically superior runtime (as compared to Stata/Matlab) Web crawlers to automatically download data. At MIT, I know Paul Schrimpf, Tal Gross, Tom Chang, Mar Reguant Rid and I have all used Perl for this purpose

Web crawlers also used in Ellison & Ellison, Shapiro & Gentzkow, Greg Lewis job market paper, Price and Wolfers). To parse structured text for the purposes of creating a dataset (oftentimes, after that dataset was downloaded by a web crawler) Where to learn Perl Todays goals Learn how to run Perl Learn basic Perl syntax Learn about hash tables See example code doing each of the

following: Preparing data Downloading data Parsing data How to run Perl In theory, Perl is cross-platform. You can write [it] once, run [it] anywhere. In practice, Perl is usually run on UNIX or Linux. In the econ computer cluster, you cant install Perl on Windows machines because they are a (perceived) security risk. So in econ cluster you will have to run on UNIX/ Linux using SecureCRT or some other terminal emulator. Alternatively, you can go to Athena cluster in basement of E51 and run Perl on the Athena computer

Perl is installed on every UNIX/Linux machine by default. How to run Perl, cont SSH into UNIX server blackmarket/shadydealings/etc. (open TWO windows, one window for writing code, one window for running the code) Use emacs (or some other text editor) to edit the Perl file. Make sure the suffix of the file is .pl and then you can run the file by typing perl myfile.pl at the command line To start emacs, type emacs myfile.pl and myfile.pl will be created (click tools on 14.170 course webpage where there is a nice emacs introduction). Its worth learning emacs if you will be writing a lot of Perl code How to run Perl, cont Basic Perl syntax

3 types of variables: scalars arrays hash tables They are created using different characters: scalars are created as $scalar arrays are created as @array hash tables are created as %hashtable So the $ @ % characters tell Perl what is the TYPE of the variable. This is obviously not very clear syntax. In Java, for example, here is how you create an array and a hash table: ArrayList myarray = new ArrayList(); Hashtable myhashtable = new Hashtable(); In Perl the same code is the following: @mylist

= (); %myhashtable = (); Hello World! #!/usr/bin/perl $hello1 = "Hello World!\n"; $econ = 14; @hello2 = ("Hello World!\n", "Hello World again!\n"); print $hello1; print $hello2[0]; print $hello2[1]; print $econ; Control structures #!/usr/bin/perl $top = $ARGV[0]; for ($i = 1; $i < $top; $i++) { if ( int($i / 7) == ($i / 7) ) {

print "$i is a multiple of 7!\n"; } } @ARGV #!/usr/bin/perl $i=1; foreach $arg (@ARGV) { print "Argument $i was $arg \n"; $i+=1; } Regular expressions #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^perl/) { print "The word $arg starts with perl!\n"; } }

Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^([a-zA-Z]+)$/) { print "The argument $arg contains only characters!\n"; } else { if ($arg =~ /^([a-zA-Z0-9]+)$/) { print "The argument $arg contains only numbers and characters!\n"; } else { print "The argument $arg contains non-alphanumeric characters!\n"; } } } Regular expressions, cont #!/usr/bin/perl

foreach $arg (@ARGV) { if ($arg =~ /^\d\d\d\-\d\d\d\-\d\d\d\d$/) { print "$arg is a valid phone number!\n"; } else { print "$arg is an invalid phone number!\n"; } } Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\d{3})-(\d{3})-(\d{4})$/) { print "$arg is a valid phone number!\n"; } else { print "$arg is an invalid phone number!\n"; } }

Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\d{3})-(\d{3})-(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: $1 \n"; print " number: $2-$3 \n"; } else { print "$arg is an invalid phone number!\n"; } } Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^\(?(\d{3})\)?-(\d{3})-(\d{4})$/) {

print "$arg is a valid phone number!\n"; print " area code: $1 \n"; print " number: $2-$3 \n"; } else { print "$arg is an invalid phone number!\n"; } } Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^\(?(\d{3})\)?-(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: $1 \n"; print " number: $2-$3 \n"; }

else { print "$arg is an invalid phone number!\n"; } } Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\(?(\d{3})\)?)?-?(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: " . ($2 eq "" ? "unknown" : $2) . " \n"; print " number: $3-$4 \n"; } else { print "$arg is an invalid phone number!\n"; } }

Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\(?(\d{3})\)?)?-?(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: " . ($2 eq "" ? "unknown" : $2) . " \n"; print " number: $3-$4 \n"; } else { print "$arg is an invalid phone number!\n"; } } QUIZ: What would happen to the following patterns? 5555555555 (666)666-6666 (777)-7777777

Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\(?(\d{3})\)?)?-?(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: " . ($2 eq "" ? "unknown" : $2) . " \n"; print " number: $3-$4 \n"; } else { print "$arg is an invalid phone number!\n"; } } QUIZ: What would happen to the following patterns? 5555555555 (666)666-6666 (777)-7777777

Regular expressions, cont #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\(?(\d{3})\)?)?-?(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: " . ($2 eq "" ? "unknown" : $2) . " \n"; print " number: $3-$4 \n"; } else { print "$arg is an invalid phone number!\n"; } } QUIZ: What would happen to the following patterns? (5555555555 666)-666-6666 Regular expressions, cont

#!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(\(?(\d{3})\)?)?-?(\d{3})-?(\d{4})$/) { print "$arg is a valid phone number!\n"; print " area code: " . ($2 eq "" ? "unknown" : $2) . " \n"; print " number: $3-$4 \n"; } else { print "$arg is an invalid phone number!\n"; } } QUIZ: What would happen to the following patterns? (5555555555 666)-666-6666 Parsing HTML #!/usr/bin/perl

foreach $arg (@ARGV) { if ($arg =~ /^

(.*)<\/td>

(.*)<\/td><\/tr>$/) { print "data: $1, $2\n"; } }

210 ROW 13
ROUND 3 HG 3 TICKETFAST
$85.00

223 ROW 04
ROUND 3 HG 3 TICKETFAST
$90.00

... ... ... ## header row in TAB-delimited file print "ticketId\tsection\tmaxAvailable\tprice\n"; ## fields that parser will try to detect $ticketId = "null"; $price = "null"; $maxAvailable = "null"; $section = "null"; Parsing HTML $on = 0; open(FILE, $ARGV[0]); while ($line = ) { if ($on eq 0 and $line =~ /

if ($on eq 1) { if ($line =~ /addToCart\(\'(.*?)\'\)/) { $ticketId = $1; } if ($line =~ /

} } } close(FILE); Parsing HTML Using control structures for data preparation EXAMPLE: Find all 1-city layover flights given data set of available flights SFO ORD

CMH RCA CHO origin SFO ORD ORD CMH ORD RCA CHO RCA dest ORD SFO CMH

ORD RCA ORD RCA CHO carrier Delta Delta Delta Delta Delta Delta Delta Delta Hash Tables Lets go back to Lecture 1 LAYOVER BUILDER ALGORITHM

In the raw data, observations are (O, D, C, . , . ) tuple where O = origin D = destination C = carrier string and last two arguments are missing (but will be the second carrier and layover city ) FOR each observation i from 1 to N FOR each observation j from i+1 to N IF D[i] == O[j] & O[i] != D[j] CREATE new tuple (O[i], D[j], C[i], C[j], D[i]) Hash Tables Lets loosely prove the runtime FOR each observation i from 1 to N FOR each observation j from i+1 to N IF D[i] == O[j] & O[i] != D[j] CREATE new tuple (O[i], D[j], C[i], C[j], D[i]) First line is done N times. Inside the first loop, there are N i iterations. Assume the last two lines take O(1) time (as they

would in Matlab/C). Then total runtime is (N-1 + N-2 + 2 + 1)*O(1) = O(0.5*N*(N 1) = )O(N2) Hash Tables Lets imagine augmenting the algorithm as follows: NEW(!) LAYOVER BUILDER ALGORITHM FOR each observation i from 1 to N LIST p = GET all flights that start with D[i] FOR each observation j in p IF O[i] != D[j] CREATE new tuple (O[i], D[j], C[i], C[j], D[i]) Hash Tables Whats the runtime here FOR each observation i from 1 to N LIST p = GET all flights that start with D[i] FOR each observation j in p IF O[i] != D[j]

CREATE new tuple (O[i], D[j], C[i], C[j], D[i]) (LOOSE proof) First line is done N times. Inside the first loop, there is a GET command. Assume that the GET command takes O(1) time. Then there are K iterations in the second FOR loop (where K is number of flights that start with D[i]; assume for simplicity this is constant across all observations). Assume, as before, that the last two lines take O(1) time (as they would in Matlab/C). Then total runtime is (N*K)*O(1) = O(K*N) NOTE 1: If K is constant (i.e. doesnt scale with N), then this algorithm is O(N). K being constant is not an unreasonable assumption. It means that as you add more origindestination pairs, the number of flights per airport is constant (i.e. the density of the O-D matrix is constant as N gets larger) NOTE 2: The magic is the O(1) line in the GET command. If that command took O(N) time instead (say, because it had to look through every observation), then the algorithm would be O(N2) as before. Thus we need a data structure that can return all flights that start with D[i] in constant time. Thats what a hash table is used for. Think of a hash table as DICTIONARY. When you want to look up a word in a dictionary, you dont naively look through all the pages, you sorta know where you want to start looking.

Hash table syntax #!/usr/bin/perl foreach $arg (@ARGV) { if ($arg =~ /^(.+)=(.+)$/) { $hashtable{$1} = $2; } } print $hashtable{"economics"} . "\n"; print $hashtable{"art history"} . "\n"; print $hashtable{"political science"} . "\n"; print $hashtable{"math"} . "\n"; dep_str 2:02 AM 7:06 PM 6:39 AM 2:54 PM 1:59 AM 7:39 AM

2:27 AM 2:57 PM 2:57 PM 11:12 AM 12:37 PM 12:29 AM 6:17 AM 7:41 AM 12:48 AM 2:27 PM 3:15 AM 5:36 PM 9:26 AM 9:43 AM 12:15 AM 7:19 PM 6:51 AM 3:11 AM 4:58 AM

9:19 AM 11:14 AM 9:30 AM arr_str 4:45 AM 9:43 PM 8:29 AM 5:01 PM 4:52 AM 10:21 AM 4:54 AM 5:46 PM 4:34 PM 12:38 PM 3:03 PM 2:42 AM 8:06 AM 9:02 AM

3:22 AM 4:07 PM 4:15 AM 7:11 PM 11:54 AM 12:09 PM 1:47 AM 9:46 PM 8:38 AM 5:46 AM 6:01 AM 10:33 AM 12:31 PM 12:22 PM origin GBG ORD BTR

LGA BTR GBG BBB CHO DDS LGA QDE QQE JJJ LAS CMH VFB ITH QDE ITH MYR VDZ GBG

YGR BBB QDE LAX JJJ LLL dest SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO

SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO SFO carrier

Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta Delta

Delta Delta Delta Delta Delta Delta Delta Delta Delta dep_mins 122 1146 399 894 119 459 147 897

897 672 757 29 377 461 48 867 195 1056 566 583 15 1159 411 191 298 559 674

570 arr_mins 285 1303 509 1021 292 621 294 1066 994 758 903 162 486 542 202 967

255 1151 714 729 107 1306 518 346 361 633 751 742 Old algorithm open(FILE, "air.txt"); $numobs= 0; $line = ; while($line = ) { my @data_line = split(/\t|\n|\r/, $line);

push(@data, [@data_line] ); $numobs++; } close(FILE); for ($i = 0; $i < $numobs; $i++) { for ($j = 0; $j < $numobs; $j++) { if ($data[$i][6] + 45 < $data[$j][5] && $data[$i][6] + 240 > $data[$j][5] && $data[$i][3] eq $data[$j][2] && $data[$i][2] ne $data[$j][3]) { print $data[$i][0]\t$data[$j][1]\t$data[$i][2]\t; print $data[$j][3]\t$data[$i][4]\t$data[$i][5]\t; print $data[$j][6]\t$data[$i][3]\n; } } } New algorithm open(FILE, "air.txt");

$numobs= 0; $line = ; while($line = ) { my @data_line = split(/\t|\n|\r/, $line); push(@data, [@data_line] ); $numobs++; } close(FILE); %originHash = (); for ($i = 0; $i < $numobs; $i++) { $originHash{$data[$i][2]} = $originHash{$data[$i][2]} . " " . $i; } for ($i = 0; $i < $numobs; $i++) { $str = $originHash{$data[$i][3]}; if ($str ne "") { @vals = split(" ", $str); for ($k = 0; $k <= $#vals; $k++) { $j = $vals[$k]; if ($data[$i][6] + 45 < $data[$j][5] &&

$data[$i][6] + 240 > $data[$j][5] && $data[$i][2] ne $data[$j][3]) { print $data[$i][0]\t$data[$j][1]\t$data[$i][2]\t; print $data[$j][3]\t$data[$i][4]\t$data[$i][5]\t; print $data[$j][6]\t$data[$i][3]\n; } } } } Runtime New algorithm runs in 9 seconds with a file of 9837 flights and 52 airport codes Old algorithm runs in 5 minutes and 32 seconds Differences becomes much worse as input file and number of airport codes grows For example, if the number of flights and airport codes increases by a factor of 10, then the new algorithm will run in ~90 seconds, while the old algorithm will

run in ~500 minutes Web crawler #!/usr/bin/perl $start = 1000; $end = 86000; for ( $i = $start; $i <= $end; $i++ ) { $folder = int($i / 1000); $url= "http://www.cricketarchive.com/Archive/Scorecards/$folder/$i.html"; print "$folder\t$i\t$url\n"; `mkdir -p $folder`; `wget -q '$url' --output-document=./$folder/$i.html`; sleep 1; } NOTE: Type man wget at command-line of UNIX prompt to learn more about how to download webpages programmatically. Web crawler with cookies

#!/usr/bin/perl $cookies = "/bbkinghome/noto/.mozilla/firefox/a5gqk1zd.default/cookies.txt"; $home = "/bbkinghome/noto/consoles"; $date = "20070115"; $filename = $ARGV[0]; open(FILE, $filename); $j = 0; while($line = ) { $item = $line; $item =~ s/\t|\r|\n//g; print STDERR "doing item=$item \t j=$j ...\n"; $url1 = "http://offer.ebay.com/ws/eBayISAPI.dll?ViewItem&item=$item"; `wget -q --load-cookies $cookies --output-document=$home/${date}_${j}.html '$url1'`; #http://offer.ebay.com/ws/eBayISAPI.dll?ViewBids&item=200029922634 $url2 = "http://offer.ebay.com/ws/eBayISAPI.dll?ViewBids&item=$item"; `wget -q --load-cookies $cookies --output-document=$home/${date}_${j}_bids.html '$url2'`; $j++;

} close(FILE); Chickenfoot Chickenfoot, cont go("http://fisher.lib.virginia.edu/collections/stats/cbp/county.html"); for(var f = find("listitem"); f.hasMatch; f = f.next) { var state = Chickenfoot.trim(f.text); output("STATE: " + state); pick(state); click("1st button"); pick("TOTAL FOR ALL INDUSTRIES"); pick("Week including March 12"); pick("Payroll() Annual"); pick("Total Number of Establishments"); for(var year = 1977; year < 1998; year++) { pick(year + " listitem"); }

pick("Prepare the Data for Downloading"); click("1st button"); click("data file link"); var body = find(document.body); write("cbp/" + state + ".csv", body.toString()); output("going to new page ..."); go("http://fisher.lib.virginia.edu/collections/stats/cbp/county.html"); output("done!"); } Where to learn more Chickenfoot: http://groups.csail.mit.edu/uid/chickenfoot/ Perl: ActivePerl, www.perl.com www.perl.org

Recently Viewed Presentations

  • International Partnership for Nuclear Disarmament Verification Looking toward

    International Partnership for Nuclear Disarmament Verification Looking toward

    Nuclear Disarmament . Verification. Looking toward Phase II - Facilitated Discussion. Take a "quick look ahead" at some of the questions and issues of the three new Working Groups ... not prejudge how they define their own program of work...
  • www.ncetm.org.uk

    www.ncetm.org.uk

    cosx and tanx intersect?? What are the implications for developing students' mathematical reasoning and problem skills?? ... What does the graph look like when pairs of values are plotted as co-ordinate pairs? E.g. (2, 16) and (8, 16) What is...
  • Literacy will be based on Talk for Writing,

    Literacy will be based on Talk for Writing,

    To create images in the style of Romero Britto, a Brazilian Pop artist. To explore the work of Matisse and Picasso. D.T. To understand healthy eating and the need for a balanced diet. To design, make and taste our own...
  • Blackpool Resort Development STARTER: Swap books and look

    Blackpool Resort Development STARTER: Swap books and look

    Plenaries on a plate. How does Blackpool fit with the Butler Model? Match up the key terms with definitions. Disposable income. When it can be maintained without harming the environment, culture or economy. Backpacking holidays. Low-cost, independent, international travel.
  • Fluid Mosaic Model - Mt. San Antonio College

    Fluid Mosaic Model - Mt. San Antonio College

    Fluid Mosaic Model Membrane Structure Phospholipid Bilayer Fluidity Membrane Structure Proteins Protein Functions Membrane Structure Carbohydrates Glycoproteins Glycolipids Diffusion Osmosis Water Balance Water Balance Facilitated Diffusion Active Transport Passive Transport Active Transport Large Molecule Movement Endocytosis Phagocytosis Pinocytosis Receptor-mediated endocytosis...
  • iGen Fuser Bearing Project P11511 - EDGE

    iGen Fuser Bearing Project P11511 - EDGE

    Heating the bearing - Conduction to outer surface. Driving shaft - Belt. Measuring vibration - Accelerometer. Measuring heat - Thermistor. Microcontroller. Labview. ... As described previously, the mica band heater was eliminated from the tester. A different type of heater...
  • Coastal deposition and landforms - Geography

    Coastal deposition and landforms - Geography

    Coastal deposition and landforms ... Backwash Backwash is always at right angles to the beach Longshore drift Landforms of coastal deposition Beaches Spits Sand dunes at Studland Beaches form in sheltered environments, such as bays. When the swash is stronger...
  • Urban Economics, Ninth Edition, Chapter 20

    Urban Economics, Ninth Edition, Chapter 20

    The Tiebout model is a formal model of interjurisdictional mobility. The simple version of the Tiebout model is based on five assumptions. 1.Municipal choice. A household chooses the municipality (or school district or other local jurisdiction) that provides the household's...