This “post” is actually just the pyporter2 homepage, reformatted as a blog post. This is mainly so that the link doesn’t disappear on anybody.

about

This is an implementation of the Porter2 (english) stemming algorithm in Python. It was born out of some academic work I did on clustering algorithms in the spring of 2008. The Porter Stemming Algorithm was first published in this 1979 paper – it is now one of the most widely known and used stemming algorithms. An implementation of the Porter stemmer already existed in Python, but not of the updated Porter2 stemmer. I decided to implement a Python version of Porter2 as an exercise.

Python bindings for the official C version of the Porter2 stemmer exist here. If using these bindings is an option, it will probably be much more efficient than using the pure Python implementation here. pyporter2 is useful when the C bindings are not an option (like in Jython, IronPython, Babble or App Engine).

download

pyporter2 is open source software, released under an MIT-style license. The latest version of pyporter2 is available here. To check out the source, install Git and run:

$ git clone git://github.com/mdirolf/pyporter2.git

usage

The new API matches that of PyStemmer. Here is an example of how to use pyporter2:

>>> import Stemmer
>>> print Stemmer.algorithms()
['english']
>>> stemmer = Stemmer.Stemmer('english')
>>> print stemmer.stemWord('cycling')
cycl
>>> print stemmer.stemWords(['cycling', 'cyclist'])
['cycl', 'cyclist']
>>> print stemmer.stemWords(['cycling', u'cyclist'])
['cycl', u'cyclist']

testing

pyporter2 includes a test suite written using unittest. To run the tests, do:

$ python Stemmer.py

questions

Feel free to leave a comment with any questions. It’d also be cool to let me know if you find pyporter2 useful for anything.