Get Started on Data Science Projects
Workflow on a data science project from idea to deploy. See the Python package on https://pypi.org/project/ez-address-parser/
Idea
Let's make a Canadian postal address parser that can recognize different parts in an address line, say a street number, street name, or postal code. Some address parsers use pattern matching. But it's difficult for them to cover all possible cases of how address lines are written by a human. For example, some addresses may contain unit numbers and some do not. For those addresses with unit numbers, the unit numbers may come in different places.
There must be a way to apply machine learning techniques to learn the possible transitions from different parts of address information and generate labels sequentially.
Data
I crawled some Canadian postal addresses from opendatacanada.com (this site is down now) a while back. To my best knowledge, those are corporate addresses in different provinces across Canada. The data is stored in raw test per province. In this project, I only take 10 addresses apiece from each province to generate a seed dataset for annotation.
Git init
First things first, I will create a Git repo for this project. And I intend to use my cookiecutter template for Python package as a starting point to avoid working from scratch. The template is minimal. However, it does show my preferences in Python development. Such as using vscode
as the go-to editor, markdown
for package's long description, black
for linting, pytest
for testing, and setuptools_scm
for package version control by Git tags.
Then, I can create a virtual environment for this project.
Again I prefer to use venv
since it's already included the standard library.
Annotation
It's time to get hands dirty on labeling whatever data we have. I came across a new Python package called label-studio
which provides an easy-to-use labeling tool for many tasks, e.g., named entity recognition, bounding box, and classification. To know more about this project, please check out this link.
After installation, I initialized a labeling project named ez_address_annotator
and configured the labels.
The labels are defined in a xml file config.xml
.
<View>
<Labels name="ner" toName="address">
<Label value="StreetNumber"></Label>
<Label value="StreetName"></Label>
<Label value="StreetType"></Label>
<Label value="StreetDirection"></Label>
<Label value="Municipality"></Label>
<Label value="Province"></Label>
<Label value="PostalCode"></Label>
<Label value="GDIndicator"></Label>
<Label value="AdditionalInfo"></Label>
<Label value="Building"></Label>
<Label value="BuildingNumber"></Label>
<Label value="PostalBox"></Label>
<Label value="PostalBoxNumber"></Label>
<Label value="Station"></Label>
<Label value="StationNumber"></Label>
<Label value="RuralRoute"></Label>
<Label value="RuralRouteNumber"></Label>
<Label value="Unit"></Label>
<Label value="UnitNumber"></Label>
</Labels>
<Text name="address" value="$address"></Text>
</View>
Run some utility scripts to generate seed data. Then I can start the web interface for labeling, import seed.csv
under ez_address_annotator/data
, and have fun labeling the seed data set.
python ez_address_annotator/data/convert.py
python ez_address_annotator/data/create_seed.py
label-studio start ez_address_annotator
This is how it looks when labeling. label-studio
provides a friendly interface.
Pretrained Model
Strong hints in Idea suggest that I consider address parsing as a named entity recognition task. Hence I choose Conditional random field to learn from the annotated address data.
The cross validation achieves about 90% f-score, which looks good to me.
Top likely transitions:
weight | label_from | label_to |
---|---|---|
7.552 | StreetNumber | StreetName |
6.333 | Building | BuildingNumber |
5.815 | RuralRoute | RuralRouteNumber |
5.763 | Municipality | Province |
5.481 | StreetName | StreetType |
5.142 | PostalBox | PostalBoxNumber |
4.839 | Station | StationNumber |
4.815 | Building | Building |
3.835 | Unit | UnitNumber |
Top unlikely transitions:
weight | label_from | label_to |
---|---|---|
-0.270 | PostalBoxNumber | Building |
-0.304 | StreetName | StreetNumber |
-0.385 | Unit | Municipality |
-0.505 | AdditionalInfo | Municipality |
-0.592 | StreetName | PostalBox |
-0.789 | StreetName | Province |
-0.791 | StreetType | Province |
-0.988 | StreetType | StreetType |
-1.202 | StreetNumber | Municipality |
-1.621 | Province | Municipality |
Deploy
The final step is to publish a reusable python package on PyPi with the pretrained model.
To begin with, make sure the latest setuptools
and wheel
packages are available.
Build the distribution archive and wheel files.
Next, we install the latest twine
package to upload the distribution files to PyPi.
For testing purpose, we can use TestPyPI.
In order to test the package from TestPyPI, specify the repo url using pip
Everything works as expected. Now we can upload to the live PyPi.
Ta-Da! The ez-address-parser
package is now available on PyPi and can be installed like any other packages
Final Words
The project is open sourced on GitHub. Feel free to check out the codes.