Skip to content

Get Started on Data Science Projects

Workflow on a data science project from idea to deploy. See the Python package on https://pypi.org/project/ez-address-parser/

Idea

Let's make a Canadian postal address parser that can recognize different parts in an address line, say a street number, street name, or postal code. Some address parsers use pattern matching. But it's difficult for them to cover all possible cases of how address lines are written by a human. For example, some addresses may contain unit numbers and some do not. For those addresses with unit numbers, the unit numbers may come in different places.

There must be a way to apply machine learning techniques to learn the possible transitions from different parts of address information and generate labels sequentially.

Data

I crawled some Canadian postal addresses from opendatacanada.com (this site is down now) a while back. To my best knowledge, those are corporate addresses in different provinces across Canada. The data is stored in raw test per province. In this project, I only take 10 addresses apiece from each province to generate a seed dataset for annotation.

Git init

First things first, I will create a Git repo for this project. And I intend to use my cookiecutter template for Python package as a starting point to avoid working from scratch. The template is minimal. However, it does show my preferences in Python development. Such as using vscode as the go-to editor, markdown for package's long description, black for linting, pytest for testing, and setuptools_scm for package version control by Git tags.

python -m cookiecutter gh:zehengl/cookiecutter-py-package

Then, I can create a virtual environment for this project.

python -m venv .venv

Again I prefer to use venv since it's already included the standard library.

Annotation

It's time to get hands dirty on labeling whatever data we have. I came across a new Python package called label-studio which provides an easy-to-use labeling tool for many tasks, e.g., named entity recognition, bounding box, and classification. To know more about this project, please check out this link.

pip install label-studio

After installation, I initialized a labeling project named ez_address_annotator and configured the labels.

label-studio init ez_address_annotator

The labels are defined in a xml file config.xml.

<View>
  <Labels name="ner" toName="address">
    <Label value="StreetNumber"></Label>
    <Label value="StreetName"></Label>
    <Label value="StreetType"></Label>
    <Label value="StreetDirection"></Label>
    <Label value="Municipality"></Label>
    <Label value="Province"></Label>
    <Label value="PostalCode"></Label>
    <Label value="GDIndicator"></Label>
    <Label value="AdditionalInfo"></Label>
    <Label value="Building"></Label>
    <Label value="BuildingNumber"></Label>
    <Label value="PostalBox"></Label>
    <Label value="PostalBoxNumber"></Label>
    <Label value="Station"></Label>
    <Label value="StationNumber"></Label>
    <Label value="RuralRoute"></Label>
    <Label value="RuralRouteNumber"></Label>
    <Label value="Unit"></Label>
    <Label value="UnitNumber"></Label>
  </Labels>
  <Text name="address" value="$address"></Text>
</View>

Run some utility scripts to generate seed data. Then I can start the web interface for labeling, import seed.csv under ez_address_annotator/data, and have fun labeling the seed data set.

python ez_address_annotator/data/convert.py
python ez_address_annotator/data/create_seed.py
label-studio start ez_address_annotator

This is how it looks when labeling. label-studio provides a friendly interface.

labeling-example

Pretrained Model

Strong hints in Idea suggest that I consider address parsing as a named entity recognition task. Hence I choose Conditional random field to learn from the annotated address data.

The cross validation achieves about 90% f-score, which looks good to me.

Top likely transitions:

weight label_from label_to
7.552 StreetNumber StreetName
6.333 Building BuildingNumber
5.815 RuralRoute RuralRouteNumber
5.763 Municipality Province
5.481 StreetName StreetType
5.142 PostalBox PostalBoxNumber
4.839 Station StationNumber
4.815 Building Building
3.835 Unit UnitNumber

Top unlikely transitions:

weight label_from label_to
-0.270 PostalBoxNumber Building
-0.304 StreetName StreetNumber
-0.385 Unit Municipality
-0.505 AdditionalInfo Municipality
-0.592 StreetName PostalBox
-0.789 StreetName Province
-0.791 StreetType Province
-0.988 StreetType StreetType
-1.202 StreetNumber Municipality
-1.621 Province Municipality

Deploy

The final step is to publish a reusable python package on PyPi with the pretrained model.

To begin with, make sure the latest setuptools and wheel packages are available.

pip install -U setuptools wheel

Build the distribution archive and wheel files.

python setup.py sdist bdist_wheel

Next, we install the latest twine package to upload the distribution files to PyPi.

pip install -U twine

For testing purpose, we can use TestPyPI.

python -m twine upload --repository-url https://test.pypi.org/legacy/ dist/*

In order to test the package from TestPyPI, specify the repo url using pip

pip install --index-url https://test.pypi.org/simple/ ez-address-parser

Everything works as expected. Now we can upload to the live PyPi.

python -m twine upload dist/*

Ta-Da! The ez-address-parser package is now available on PyPi and can be installed like any other packages

pip install ez-address-parser

Final Words

The project is open sourced on GitHub. Feel free to check out the codes.