Integrating IAC Nationals data from Exploit ACRO
Nationals has been running for many years with the Exploit ACRO software. This means, in order to compute regional series, collegiate, and L. Paul Soucy results, we must somehow integrate the outputs from ACRO into the IAC data.
The approach taken here is to screen scrape the ACRO web outputs. (see note). With the data in hand, there's lots of work to do to match-up names and categories, and otherwise groom it. Having groomed the data, we can import it (process note).
All of this work is done from the command line. There are no admin UI interfaces for accomplishing it.
The 'cmd/acro' directory contains a number of utilities. The following assist with integrating the Nationals results, with brief descriptions:
- does the screen scraping and generates YAML (.yml) files with the scraped data.
- displays the scraped flight names and the detected attributes, as a check.
- has the query needed to generate a member list for name resolution Use it to produce the memberlist.txt file for the resolve_names step.
- is an interactive console driven program for identifying the member records from the names. It produces a file participants.yml that contains the mapping from each judge or pilot name in the ACRO files to a member record
- reads the pilot-flight raw scores into the database. It uses the participants.yml file, if present, to identify member records.
- produces a list of category and flight results, results_list.yml for edit
- reads category and flight results totals from multi_ yml files, together with participants.yml and results_list.yml to patch result computations into the contest data.
Of all of these, the first utility that touches the contest database is import_yml_extracts.rb. That writes sequences, judge pairs, pilots, the contest record, and scores. It kicks-off a delayed job to compute results from the raw scores.
The computed results will differ from the published results because they do not use the same computation used by ACRO. It is useful to run them, however, as they verify the completeness and integrity of the imported raw grading data. They also generate the figure ranks and judge metrics.
The second utility that touches the contest database is patch_results.rb. This overwrites the IAC computed flight and category results for each pilot with the ACRO computed results.
Using the utilities for 2017
Following are detailed steps used to integrate Nationals 2017 results into IACCDB.
Retrieve the results
Pull the results postings from the web. We make a directory for these and change to it before pulling them. This keeps them separate from other years, where they would otherwise overlap.
mkdir nationals2017 cd nationals2017 wget --mirror --convert-links https://www.iac.org/files/nationals-results/2017/indexpage.htm
Four Minute Free results
The import will miss collecting Four Minute Free results.
The Four Minute results are posted in a "single" file, not a "multi" file. The YAML extractor doesn't touch the single files because they normally contain individual flight results already found in the multi files. The simplest way to get the Four Minute results is to copy that single file to a multi file,
cp nationals2017/www.iac.org/files/nationals-results/2017/single_R042s30.htm \ nationals2017/www.iac.org/files/nationals-results/2017/multi_R042s30.htm
cat www.iac.org/files/nationals-results/2017/multi_*.htm |\ python ../cmd/acro/stripPilotsAt.py https://www.iac.org/files/nationals-results/2017/ cd ..
Scrape the results
Create the contest control file in the directory that contains the downloaded results postings. This contains the contest meta information needed, see lib/acro/control_file.rb for documentation of the format. We called this file contest.yml and its initial content is as follows:
--- contestName: 'U.S. National Aerobatic Championships' startDate: '2017-09-24' city: 'Oshkosh' state: 'WI' director: 'Gary DeBaun' region: 'National' chapter : 'IAC'
It is important to get `region: 'National'` in there, in order for the contest to be picked-up for the Regional Series computations. The start date is very important. We try to have the rest of the entries make sense as well.
Now we're ready to do the screen scraping that generates YAML (.yml) files from the results postings. For convenience, we move these files up to the nationals subdirectory. From the project directory,
rails runner cmd/acro/scrape_html_to_yaml.rb \ nationals2017/www.iac.org/files/nationals-results/2017/contest.yml mv nationals2017/www.iac.org/files/nationals-results/2017/*.yml nationals2017
Check the flight names
Run the flights list to ensure that the category and flight are correctly identified for the results.
rails runner cmd/acro/show_flights.rb nationals2017/contest.yml
The first item in each line of the output shows the title from the ACRO published results. The next items show the inferred category, class, and flight. Check these over and, if there's a problem, figure it out. The inference is accomplished with lib/acro/flight_identifier.rb.
Here is a complete sample output from 2016:
'Unlimited - Power : Programme 1: Free Known', Unlimited, P, Free 'Unlimited - Power : Programme 2: Free Unknown #1', Unlimited, P, Unknown 'Unlimited - Power : Free Unknown Sequence', Unlimited, P, Unknown 'Unlimited - Glider : Programme 1: Free Known', Unlimited, G, Free 'Unlimited - Glider : 1st Unknown Sequence', Unlimited, G, Unknown 'Unlimited - Glider : 2nd Unknown Sequence', Unlimited, G, Unknown 'Advanced - Glider : Free Known Sequence', Advanced, G, Free 'Advanced - Glider : 1st Unknown Sequence', Advanced, G, Unknown 'Advanced - Glider : Free Unknown Sequence', Advanced, G, Unknown 'Advanced - Power : Known Sequence', Advanced, P, Known 'Advanced - Power : Free Sequence', Advanced, P, Free 'Advanced - Power : Free Unknown Sequence', Advanced, P, Unknown 'Intermediate - Glider : Known Sequence', Intermediate, G, Known 'Intermediate - Glider : Free Sequence', Intermediate, G, Free 'Intermediate - Glider : Unknown Sequence', Intermediate, G, Unknown 'Sportsman - Glider : Known Sequence', Sportsman, G, Known 'Sportsman - Glider : Free Sequence', Sportsman, G, Free 'Primary : Programme 1: Free Known', Primary, P, Free 'Primary : 1st Known Sequence', Primary, P, Flight 1 'Primary : 2nd Known Sequence', Primary, P, Flight 2 'Intermediate - Power : Known Sequence', Intermediate, P, Known 'Intermediate - Power : Free Sequence', Intermediate, P, Free 'Intermediate - Power : Unknown Sequence', Intermediate, P, Unknown 'Sportsman - Power : Known Sequence', Sportsman, P, Known 'Sportsman - Power : Free Sequence', Sportsman, P, Free
Note that there are problems with the programs in Unlimited and Advanced, and with the Known program in Primary. This is due to the fact that we've introduced the word 'Free' into the names. There are some alternatives:
- update the code to recognize these properly, or
- change the names in the files such that the code makes the expected program identification.
- train a machine learning algorithm to get these right
Before changing the code, note that there is a test suite, spec/acro/flight_identifier_spec.rb that should have additional tests, and whose existing tests ought to remain passing.
Note also that the program names don't have to fit into IAC program names. We don't necessarily need "Known", "Free", "Unknown". "Free Known," "Free Unknown 1", and "Free Unknown 2" are perfectly acceptable program names. The main requirement is that the code uniquely identifies the program, correctly identifies the category, and correctly identifies the class (P = Power, G = Glider, F = Four Minute Free).
The program, even a trained classifier, doesn't have a chance with 'Unlimited - Power : Free Unknown Sequence' that is actually the third program, and second Free Unknown for Unlimited Power. We have to fix the files.
First we determine that ACRO identifies the program as "s15". We do that by simply looking for the description in the files:
grep -rH 'Unlimited - Power : Free Unknown Sequence' nationals2017
Next we run the stream editor to make the correction in those files:
for f in `ls -1 nationals2017/pilot_p*s15.htm.yml` ; \ do sed -i -e '3 s/Free Unknown Sequence/Programme 3: Free Unknown #2/' $f ; \ done
For safety, that command edits only that specific text, and only on the third line of the file.
Similarly, the program doesn't have a chance with "Advanced - Glider : Free Unknown Sequence" or "Advanced - Power : Free Unknown Sequence", that are both the second Free Unknown. We follow a similar procedure.
grep -rH 'Free Unknown Sequence' nationals2017
for f in `ls -1 nationals2017/pilot_p*s12.htm.yml` ; \ do sed -i -e '3 s/Free Unknown Sequence/2nd Unknown Sequence/' $f ; \ done
for f in `ls -1 nationals2017/pilot_p*s24.htm.yml` ; \ do sed -i -e '3 s/Free Unknown Sequence/2nd Unknown Sequence/' $f ; \ done
Both Sportsman Glider and Power have the same name on both their second and third flight programs. The program doesn't have a chance with this, either. We similarly edit some of those files:
grep -rH 'Sportsman' nationals2017 | grep 'Free Sequence'
for f in `ls -1 nationals2017/pilot_p*s17.htm.yml` ; \ do sed -i -e '3 s/Free Sequence/Programme 2/' $f ; \ done
for f in `ls -1 nationals2017/pilot_p*s18.htm.yml` ; \ do sed -i -e '3 s/Free Sequence/Programme 3/' $f ; \ done
for f in `ls -1 nationals2017/pilot_p*s05.htm.yml` ; \ do sed -i -e '3 s/Free Sequence/Programme 2/' $f ; \ done
for f in `ls -1 nationals2017/pilot_p*s06.htm.yml` ; \ do sed -i -e '3 s/Free Sequence/Programme 3/' $f ; \ done
Similarly for Primary
grep -rH 'Primary' nationals2017
for f in `ls -1 nationals2017/pilot_p*s02.htm.yml` ; \ do sed -i -e '3 s/1st Known Sequence/Programme 2/' $f ; \ done
for f in `ls -1 nationals2017/pilot_p*s03.htm.yml` ; \ do sed -i -e '3 s/2nd Known Sequence/Programme 3/' $f ; \ done
All of these changes are contained in the shell script, fix_cat_names.sh found in the nationals2016 (zipped) directory.
In 2017, we have some trouble with Sportsman. And in Advanced Power we're picking-up the team flight, which isn't part of the Nationals results. The other categories are cruising along flawlessly:
'Advanced - Power : Known Sequence', Advanced, P, Known 'Advanced - Power : Known Sequence', Advanced, P, Known 'Advanced - Power : Free Sequence', Advanced, P, Free 'Advanced - Power : Free Unknown #1', Advanced, P, Free Unknown 1 'Advanced - Power : Free Unknown #2', Advanced, P, Free Unknown 2 'Sports - Power : Known Sequence', Sportsman, P, Known 'Sports - Power : Free #1 Sequence', Sportsman, P, Known 'Sports - Power : Free #2 Sequence', Sportsman, P, Flight 2 'Sports - Glider : Known Sequence', Sportsman, G, Known 'Sports - Glider : Free #1 Sequence', Sportsman, G, Known 'Sports - Glider : Free #2 Sequence', Sportsman, G, Flight 2We handled the Sportsman difficulty by making updates to lib/acro/flight_identifier.rb, adding tests, and (always) making the prior and existing tests pass. We handled the Advanced category difficulty, as always, by removing the screen scrapes for the team flight. In this year we removed all of the files that ended with 's19.htm.yml'.
Match names to member records
Before we import this data into IACCDB, we have to ensure that the pilot names used in the ACRO files will match member names as used in IACCDB. Without this important step, we get duplicate pilots with different name spellings in IACCDB. Some pilots and judges don't get proper credit for the Nationals flights because they have two identities.
We can run pilot and judge name resolution, so that names match to the correct members on import.
First, generate the member file. We didn't write a program for this, rather, we simply run a SQL query against the database.
mysql --defaults-extra-file=./.my.cnf iaccdb <cmd/acro/memberlist.sql >nationals2017/memberlist.txt
Now we run the member name matching program. Because it is interactive, we run it as a ruby program. (The rails runner is non-interactive.)
ruby cmd/acro/resolve_names.rb nationals2017/contest.yml
This presents a dialog with the found name and suggested name matches for each pilot and judge in the ACRO results files. We select the correct member record for the name presented. The program outputs a name resolution file, participant_list.yml. that matches ACRO entered names to member records.
Here is an example entry for a name not in the IACCDB database:
Frank Borman (USA): !ruby/object:ACRO::ParticipantList::Participant db_id: given_name: Frank family_name: Borman iac_id: '939393'
The program assumes nothing; so, we must make a selection for every name found in the records. It's a little unforgiving. If we make a mistake we have to either start over, or make a note and do the correction with a manual edit. The program is used once a year; so, there's poor payoff in making it too fancy. It does show any prior selection as the first choice, with a star ('*') next to it.
Invariably there are names that do not match or miss matches in the database. For these, we do some manual spelunking in the member database (assuming we have access) and/or follow-up with queries. Thank you, Trish! Thank you, Lorrie!
With resolutions, we manually add entries to participant_list.yml. We do our best. Lacking found matches, the names will go into the database as new entries.
Import the pilot-flight grades
With the name resolution done, we can import the pilot flight data-- all of the grades. This is the first command that writes to the database, so we first make a back-up.
mysqldump --defaults-extra-file=./.my.cnf iaccdb >20171030pre-nats.sql rails runner cmd/acro/import_yml_extracts.rb nationals2017/contest.yml
The command creates the contest and all of the grades in the database. It leaves an asynchronous task to make the IAC results computations.
If the jobs worker is not already running on the server, Run
The job computes results from the imported data. We run this to create the result records and judge metrics.
Import the ACRO computed results
Match new participant names
After adding the grades, we now have member records for the members who were not yet present in the database. Run ruby cmd/acro/resolve_names.rb a second time, or edit the participant_list.yml file with id's of newly created members. Better to edit the file and search for nil, then fill-out the entry in the format of the others:
Jim Johnson: !ruby/object:ACRO::ParticipantList::Participant db_id: '617' given_name: Jim family_name: Johnson iad_id: '24521912'
Find the db_id, the most important value, by browsing the contest online. Find the pilot or judge and check their link URL:
The number after "pilots", 2552 in the example, is the db_id. These numbers will be consecutive, newly created id numbers.
It's extremely helpful to have an accurate IAC id in the records. Use all legal means necessary to find and provide it.
If you get to the end of this process and find that a pilot's overall scores don't match the published results, it's very possibly because you mistyped the database id of that pilot. (You know how we know this.)
Identify the contest
Edit the contest.yml file with the id of the contest. Find the id of the contest in the URL,
The number after "contests," 456 in the example, is the contest id. The contest.yml control file should now have an additional line,
--- contestName: 'U.S. National Aerobatic Championships' startDate: '2017-09-21' city: 'Sherman and Denison' state: 'TX' director: 'Gray Brandt' region: 'National' chapter : 'IAC' contest_id: 456
Once you have added the contest ID, any subsequent imports will overwrite the earlier ones, rather than creating a new contest.
Match category result files and flight headings to flight records
We can usually identify the category, and possibly line-up the flight names from the overall results postings with the flights in the individual pilot files. Rather than take a chance with heuristic matching, however, it's straightforward to manually identify the flights in the overall results files with flights now present in the iac database. To generate the list of flights, run
rails runner cmd/acro/results_list.rb nationals2017/contest.yml
This produces a file, results_list.yml that we edit. Wherever we find 'missing_id' we insert the correct iac database id for the flight result. Find the database id's in the url's for the flights, for example:
has database id 6653.
You can also open the rails console and run a query, as follows:
puts JSON.pretty_generate(Flight.where(contest_id:588).includes(\ :contest, :category).to_a)
Either way, editing results_list.yml, take care because the flights for a category do not necessarily display in the same order in the file as on the results display. Here is a portion of the file. The first entry has been edited. The second entry has not.
- !ruby/object:ACRO::ResultsList::ResultsFile file_name: multi_R001s06s15s24.htm.yml category: Advanced description: Advanced Power flights: - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Known P1 flight_id: 6462 sequence: 1 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: ! 'Free #1' flight_id: 6463 sequence: 2 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Free Unk flight_id: 6464 sequence: 3 - !ruby/object:ACRO::ResultsList::ResultsFile file_name: multi_R016s19s20s21.htm.yml category: Unlimited description: Unlimited Glider flights: - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Known flight_id: missing_id sequence: 1 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Free flight_id: missing_id sequence: 1 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Unkwn flight_id: missing_id sequence: 1
We edit the sequence values to get the correct sequence of columns in the results display.
In 2016, the Unlimited Glider and Power files were missing values and showed an extra flight. This was because these two published results included team identifiers. The extra columns had thrown-off the scraper. We edited the scraper to account for the extra columns when they are present and re-ran the data extraction.
- !ruby/object:ACRO::ResultsList::ResultsFile file_name: multi_R017s28s29s30.htm.yml category: description: flights: - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Known P1 flight_id: missing_id sequence: 1 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: ! 'Free #1' flight_id: missing_id sequence: 1 - !ruby/object:ACRO::ResultsList::ResultsFlight flight_name: Free Unk flight_id: missing_id sequence: 1
In 2016, the entry for the multi_R017s28s29s30.htm.yml example file shown above had no category and description. Looking at it, we saw that it lacked the "Contest Results:" prefix that all of the other files have, that the scraper uses to identify the line containing the category and class. Supplying the category and description in the results_list.yml file sufficed to get it properly identified.
Check and double-check the flight_id entries in results_list.yml. It is essential that these be accurate. One sanity check you can use is to see that the id's used are sequential and unique:
grep flight_id nationals2017/results_list.yml | sort
Patch the ACRO computed results
Now patch the IAC computed results with the results computed by ACRO. They are different.
rails runner cmd/acro/patch_results.rb nationals2017/contest.yml
Compare the posted overall category and flight result scores with those in the original ACRO posted results. If they match, all is well.
- Compare the order of the pilots.
- Compare the category total points and percentages.
- Spot check the flight total points and percentages.
- See that the number of judges matches.
- See that the judge names match.
With Nationals loaded, we recompute the regionals.
rails runner cmd/recompute_regionals.rb 2017
Take care not to queue a job to recompute the contest, else the patched results are overwritten.
We are done.
I go through this process on a development machine, with a copy of the database, as a dry run, before doing this on the IAC server against the live database.
- keep and use a version of participant_list.yml prior to adding the new member database id's, because they might be different.
- copy the yml files to the IAC server
- carefully remove the contest_id from contest.yml because it might change
- know that the database id's in results_list.yml might be different
- make a fresh backup of the database
- restart the process at "Import the pilot-flight grades"
Screen scraping note
ACRO does have some CSV format outputs; but, it's extra work to get them, the web screens have the same information, and it's almost as easy to scrape the web screens as to scrape the CSV. Possibly not really as easy, but it is reliable to scrape the data from the already published outputs versus chase-down someone in hope of getting special outputs. Scraping the data is only a small part of the task.