A New GridFS Implementation for PyMongo
29 March 2010The API (Application Programming Interface) for PyMongo, the Python driver for MongoDB, has been pretty stable for quite some time now. In general, I think that the driver does a great job of exposing all of MongoDB’s functionality in a way that is both MongoDB-ish and Pythonic (of course I’m biased — if you have suggestions for improvements please let me know). There have been some minor tweaks recently, like an improvement for the command API, but there haven’t been any huge changes in a while.
All that said, I think there is one place where PyMongo has been deficient for a long time: its support for GridFS. There are a couple of problems with the GridFS implementation in previous versions (= 1.5.2) of PyMongo:
-
It is slower and less concurrency-friendly than it needs to be.
-
It could be much simpler and easier to work with.
-
It allows some operations (modifying existing files) that are incorrect according to the GridFS semantics.
I think that all of these deficiencies stem from one fatal flaw in the original API design: it was trying too hard to mimic Python’s filesystem API. Exposing file-like objects for writing and reading is great, but things like focusing on filename as a file-handle (filename is less important in GridFS) and allowing users to modify files (which is not a concept supported by GridFS) were bad decisions. This post introduces a new GridFS API for PyMongo which tries to address all of the deficiencies in the old implementation.
The New API
The new GridFS API is available in PyMongo versions = 1.6. There are API docs available for the new version of GridFS, but this post walks through the API in a little more detail.
Almost all of the API is exposed through the GridFS class, here’s an example instantiation:
You can also use an alternate root collection for GridFS by passing a
collection name as the second argument to GridFS
. Given an
instance of GridFS, creating new files and getting data from existing
ones is easy:
The put
method takes a string or file-like object
containing data to be written to a GridFS file, and returns the
\_id
of the newly created file. It also accepts keyword
arguments for any of the fields available in the GridFS file spec. So to
insert the contents of the local file myimage.jpg with the content
type “image/jpeg” and the filename “myimage” we would do:
The equivalent of doing this using the old API would be:
The get
method takes an \_id
of a file in
GridFS, and returns a file-like object (an instance of
gridfs.grid\_file.GridOut
) that can be used to read that
file’s data. If there is no file with the given \_id
an
exception is raised:
To delete a file in GridFS, pass its \_id
to the
delete
method:
Advanced Usage
While put
should cover most use cases that opening a file
in write mode used to, there may be some cases where you don’t want to
write all of the file’s data at once. In that case you can use the
new\_file
method to get a new
gridfs.grid\_file.GridIn
instance which you can write to.
When you’re done with the instance call close
(or just use
Python 2.6’s with
statement to handle it for you). Like
put
, new\_file
takes keyword arguments for any
of the values in the GridFS file spec. A cool note about both methods is
that any unrecognized keyword arguments (in this example,
location
) will automatically be set as attributes on the
underlying file document:
Most of the examples above have been referencing files in GridFS by
their \_id
. The old API made more of a point of emphasizing
filename
as the primary mode of access. While the new API
encourages the better practice of referencing by \_id
,
there are still some cases where you might need to work with files based
on filename
, or where your application treats
filename
as a unique identifier.
The way to work with filenames in the new API is to create a new GridFS
file with the same filename every time you need to modify a file. Since
files in GridFS store their upload date, we can always get the most
recent version of a file by filename. We can also reference by
\_id
to get any previous version — we can treat GridFS as a
versioned filestore. The method to get the last version of a file by
name is get\_last\_version
:
Performance
I’ve only done some very basic benchmarking of the new GridFS implementation. It doesn’t appear to be a drastic improvement when writing large files, but writing small files is about four times faster in a simple benchmark of mine. This is because the simplification of the API has reduced the per-file overhead dramatically. There are also huge performance improvements when uploading a large file from disk or some other file-like source (especially if it was being done naively using the old API), as the new API automatically handles streaming from a file-like source into chunk-sized buffers.
Reading small files is about 10% faster with the new API, while there doesn’t seem to be much difference with reading large files. A big difference in performance for both reads and writes is that concurrency is vastly improved. Since the GridFS semantics are handled correctly, there is no longer a per-file lock for access - concurrent operations are fully supported and safe now (with the notable exception of deleting a file, which could cause concurrent readers to see partial data).
If anybody has any ideas for benchmarks, or existing benchmarks for the old API, I’d love to see how they compare — please leave a note in the comments.
Deprecation of the Old API
In the current state of the master branch I have removed the old GridFS
API completely, raising an UnsupportedAPI
acception when
attempts are made to use it. Normally I would go through a deprecation
window, but I’m considering just releasing like this. My reasoning is
that it’s a big change and I want people to be making it as soon as
possible. There is also a problem that mixed use of the APIs could be
unsafe (in general, any use of the old API can be unsafe, if it’s used
to overwrite existing files). Let me know what your thoughts on this
decision are as well. Hoping to get this right as I think it will be a
big improvement to PyMongo!