A New GridFS Implementation for PyMongo

The API (Application Programming Interface) for PyMongo, the Python driver for MongoDB, has been pretty stable for quite some time now. In general, I think that the driver does a great job of exposing all of MongoDB’s functionality in a way that is both MongoDB-ish and Pythonic (of course I’m biased — if you have suggestions for improvements please let me know). There have been some minor tweaks recently, like an improvement for the command API, but there haven’t been any huge changes in a while.

All that said, I think there is one place where PyMongo has been deficient for a long time: its support for GridFS. There are a couple of problems with the GridFS implementation in previous versions (= 1.5.2) of PyMongo:

It is slower and less concurrency-friendly than it needs to be.
It could be much simpler and easier to work with.
It allows some operations (modifying existing files) that are incorrect according to the GridFS semantics.

I think that all of these deficiencies stem from one fatal flaw in the original API design: it was trying too hard to mimic Python’s filesystem API. Exposing file-like objects for writing and reading is great, but things like focusing on filename as a file-handle (filename is less important in GridFS) and allowing users to modify files (which is not a concept supported by GridFS) were bad decisions. This post introduces a new GridFS API for PyMongo which tries to address all of the deficiencies in the old implementation.

The New API

The new GridFS API is available in PyMongo versions = 1.6. There are API docs available for the new version of GridFS, but this post walks through the API in a little more detail.

Almost all of the API is exposed through the GridFS class, here’s an example instantiation:

>>> from pymongo import Connection
>>> from gridfs import GridFS
>>> db = Connection().test_database
>>> fs = GridFS(db)

You can also use an alternate root collection for GridFS by passing a collection name as the second argument to GridFS. Given an instance of GridFS, creating new files and getting data from existing ones is easy:

>>> file_id = fs.put("hello world")
>>> fs.get(file_id).read()
'hello world'

The put method takes a string or file-like object containing data to be written to a GridFS file, and returns the \_id of the newly created file. It also accepts keyword arguments for any of the fields available in the GridFS file spec. So to insert the contents of the local file myimage.jpg with the content type “image/jpeg” and the filename “myimage” we would do:

>>> with open("myimage.jpg") as myimage:
...   oid = fs.put(myimage, content_type="image/jpeg", filename="myimage")
...

The equivalent of doing this using the old API would be:

>>> from pymongo.objectid import ObjectId
>>> with open("myimage.jpg") as myimage:
...   with fs.open({"filename": "myimage",
...                 "contentType": "image/jpeg",
...                 "_id": ObjectId()}, "w") as grid_file:
...     grid_file.write(myimage.read())
...     oid = grid_file._id
...

The get method takes an \_id of a file in GridFS, and returns a file-like object (an instance of gridfs.grid\_file.GridOut) that can be used to read that file’s data. If there is no file with the given \_id an exception is raised:

>>> oid = fs.put("hello world")
>>> fs.get(oid)
<gridfs.grid_file.GridOut object at ...>
>>> fs.get("non-existant _id")
Traceback (most recent call last):
gridfs.errors.NoFile: ...

To delete a file in GridFS, pass its \_id to the delete method:

>>> fs.get(oid)
<gridfs.grid_file.GridOut object at ...>
>>> fs.delete(oid)
>>> fs.get(oid)
Traceback (most recent call last):
gridfs.errors.NoFile: ...

Advanced Usage

While put should cover most use cases that opening a file in write mode used to, there may be some cases where you don’t want to write all of the file’s data at once. In that case you can use the new\_file method to get a new gridfs.grid\_file.GridIn instance which you can write to. When you’re done with the instance call close (or just use Python 2.6’s with statement to handle it for you). Like put, new\_file takes keyword arguments for any of the values in the GridFS file spec. A cool note about both methods is that any unrecognized keyword arguments (in this example, location) will automatically be set as attributes on the underlying file document:

>>> myfile = fs.new_file(location=[-74, 40.74])
>>> myfile.write("hello ")
>>> myfile.write("world")
>>> myfile.writelines([" and have a ", "good day!"])
>>> myfile.close()
>>> out = fs.get(myfile._id)
>>> out.read()
'hello world and have a good day!'
>>> out.location
[-74, 40.740000000000002]

Most of the examples above have been referencing files in GridFS by their \_id. The old API made more of a point of emphasizing filename as the primary mode of access. While the new API encourages the better practice of referencing by \_id, there are still some cases where you might need to work with files based on filename, or where your application treats filename as a unique identifier.

The way to work with filenames in the new API is to create a new GridFS file with the same filename every time you need to modify a file. Since files in GridFS store their upload date, we can always get the most recent version of a file by filename. We can also reference by \_id to get any previous version — we can treat GridFS as a versioned filestore. The method to get the last version of a file by name is get\_last\_version:

>>> a = fs.put("foo", filename="test")
>>> fs.get_last_version("test").read()
'foo'
>>> b = fs.put("bar", filename="test")
>>> fs.get_last_version("test").read()
'bar'
>>> fs.delete(b)
>>> fs.get_last_version("test").read()
'foo'

Performance

I’ve only done some very basic benchmarking of the new GridFS implementation. It doesn’t appear to be a drastic improvement when writing large files, but writing small files is about four times faster in a simple benchmark of mine. This is because the simplification of the API has reduced the per-file overhead dramatically. There are also huge performance improvements when uploading a large file from disk or some other file-like source (especially if it was being done naively using the old API), as the new API automatically handles streaming from a file-like source into chunk-sized buffers.

Reading small files is about 10% faster with the new API, while there doesn’t seem to be much difference with reading large files. A big difference in performance for both reads and writes is that concurrency is vastly improved. Since the GridFS semantics are handled correctly, there is no longer a per-file lock for access - concurrent operations are fully supported and safe now (with the notable exception of deleting a file, which could cause concurrent readers to see partial data).

If anybody has any ideas for benchmarks, or existing benchmarks for the old API, I’d love to see how they compare — please leave a note in the comments.

Deprecation of the Old API

In the current state of the master branch I have removed the old GridFS API completely, raising an UnsupportedAPI acception when attempts are made to use it. Normally I would go through a deprecation window, but I’m considering just releasing like this. My reasoning is that it’s a big change and I want people to be making it as soon as possible. There is also a problem that mixed use of the APIs could be unsafe (in general, any use of the old API can be unsafe, if it’s used to overwrite existing files). Let me know what your thoughts on this decision are as well. Hoping to get this right as I think it will be a big improvement to PyMongo!

email	twitter
last.fm	gchat
github	books