neithere

Python, Music, Laziness

MRN

full title:

Municipal Register of Population of Yekaterinburg city

Муниципальный Регистр Населения г. Екатеринбурга

what:
  • Data processing (ETL) and storage system.
  • Local implementation of regional layer of the new unified federal database.
  • Part of the Electronic Russia (Электронная Россия) program.
  • Integrates various sources into the unified municipal database on the fly. Uses heuristics for processing of incomplete or erroneous records and identification of same real-world entities behind data from different sources (tens of them).
my effort:

Architecture (parts), algorithms, coding.

when:

2006–2007

technologies:
  • FreeBSD
  • Perl
  • MySQL
  • XHTML+CSS+JavaScript. Ajax using Prototype.
tools:

Subversion (kdesvn), Trac

status:

Implemented. Now it’s being used by the city as planned.

Retrospective (2009)

I really enjoyed the project; it was complex and took too much time but I had never learned so much while doing the same thing.

The first day

The team had already designed the database and wrote an import script before I joined them. The problem was that the script would run forever. And my script (which I wrote in the first couple of days) did the thing within ~15 minutes.

The script was just a prototype and should have been discarded but it was decided to make it the new basis of the system. As the development progressed, it became more and more evident that the initial script was poorly designed in terms of extensibility and there was no architecture to speak of. The speed of the script was slowing down with every added feature; the development stalled.

Lesson learned:

  • write no code at first; if you do, do not show it; just study the requirements, problems, etc. When you start working, the chance to study the project well is lost.
  • divide and conquer. split your work into smaller tasks. make incremental releases.

Management and Teamwork

There were organizational issues which were new to me. I’m glad I experienced them then so I can prevent most of them now. Sorry, no details.

Lessons learned:

  • hurry or succeed
  • never hire people who don’t argue with you: you’ll have to do their work
  • be ready to teach a lot (or work alone or just don’t start the project)

Modeling and Coding

I used to love Perl but this project required many classes and Perl was extremely verbose, fragile and ugly in this sense. I would play with UML (drawing mostly class and sequence diagrams), attach them to Trac tickets or wiki pages for reference and then write the Perl code. Coding in object-oriented Perl felt like eating soup with a toothpick.

In the late 2006 (or early 2007?) I finally tried Python and had to admit that not only it was clean, beautiful and robust, but it also allowed to skip a lot of mess with UML or pseudocode. Python turned out to be an “executable pseudocode”, very quick to write and rearrange. I even started writing all the classes in Python and then porting them to Perl!

Now I would write the whole project in Python but at that point this meant rewriting whole lotta LOCs and, to make things worse, making my team members learn Python; that would be crazy (even crazier than prototyping in Python and “compiling” to Perl), given that it wasn’t easy to even make everybody follow guidelines.

Lesson learned:

  • choose the appropriate tools and never stop learning.

The Locations Database

A real pain in the assets was the locations database and the related algos (heuristics for error correction and so on). It was so depressing that I can’t even talk about it. Sure it worked at the end. But damn.

Lessons learned:

  • do not let your program be too smart. magic hurts its master.
  • take your time. if the boss is pressing, he’s wrong.

The RegCom

The most important process – registration and comparison – was yet another pain.

Goal: find similar groups of records grabbed from various sources, compare these groups thoroughly and decide whether they describe the same real-world entity.

The naïve implementation was at first so time-consuming that we just stared at the logs and benchmarks and couldn’t believe the project can ever be completed because the limited test was seemingly going to run forever... then there were optimizations, then rethinking, then rewriting, then more optimizations... and finally we could launch the process (which could be stopped and resumed at any time) and leave it run for something like weeks (I really don’t remember now).

I can’t go into details but imagine you have several millions of records in each of ~ten tables and you have to compare each with each allowing some data to be a bit different (e.g. Foobar ~= Foobaz) and calculating the overall percentage of similarity which, being above some threshold, means that the sets of records belong to the same real-world entity (a person, a location, etc.) and should be merged by re-linking all related records (and allowing to roll back later).

We used three-phased comparison:

0. at dump level: diff’ing current and previous dumps → lists of added and removed records (changed = removed and added), this enormously boosted the whole ETL performance;

1. at DB level: splitting of each text field into triplets (foobarfoob, obar, foar) and searching “candidates” by triplets, and 2. at Perl level:

  1. filtering out some of the fetched candidate sets using Levenshtein distance, and
  2. calculating the percentage of similarity between the candidates that went through all the aspera to the astra.

Lessons learned:

  • if stuck, step back and find another way.
  • premature optimization = evil.
  • if only I knew the “ETL” abbrev. before, it would save a lot of time.

Outcome

We enjoyed it and succeeded. =)