authored by Wayne Witzel III

Hooking up Whoosh and SQLalchemy (sawhoosh)

On August 05, 2011 In python Permalink

I was talking to Tim VanSteenburgh (we both do work for Geek.net) and he had a proof of concept for automatically updating a search index when a change happens to a model and pairing a unified search results page with that to make an easy one-stop search solution for a website. I thought this was a pretty cool idea, so I decided to do a implementation of it on top of the Pyramid framework and share it the readers of my blog. I choose Whoosh only so that I could have an easy way for users to try the project out. Since whoosh is pure python it installs when you run your setup.py develop. But this approach could be applied to any searching system. Tim was using solr.

Demo Project

The purpose of the post is to summarize briefly the steps taken to achieve the desired results of automatically updating the index information when models change. Please browse the source and clone and run the project, looking at the source as a whole will definitely give you a better start to finish understanding than just reading this blog post.

GitHub: sawhoosh

Whooshing Your Models

There are few parts that make up the magic of all this, but each part is really simple. First you have the index schema itself. This is setup to hold the unique ID of an item, as well as a pickled version of the uninstantiated class, and an aggregate of values (decided by you) that you want to be searchable for the item. The schema looks like this:

class SawhooshSchema(SchemaClass):
    value = TEXT
    id = ID(stored=True, unique=True)
    cls = ID(stored=True)

In the Whoosh world this basically says store ID and CLS so it comes back with the search results, but not value. value will be what we will search over in order to find our results. We need to tell our models what attributes are important to use for indexing, I've chosen to do this with a whoosh_value property on the model itself.

class Document(Base):
    __tablename__ = 'document'
    __whoosh_value__ = 'id,title,content,author_id'
    id = Column(CHAR(32), primary_key=True, default=new_uuid)
    title = Column(String, nullable=False)
    content = Column(Text, nullable=False)
    author_id = Column(Integer, ForeignKey('author.id'), index=True)
    def __str__(self):
        return u'Document - Title: {0}'.format(self.title)

This says make the id, title, content, and author_id for document all searchable values. This happens automatically for us because our Base class has a parent class that handles all of the Whoosh work. This parent class, I've called it SawhooshBase, handles all the indexing, reindexing, and deindexing of our models. The class itself looks like this:

class SawhooshBase(object):
    # The fields of the class you want to index (make searchable)
    __whoosh_value__ = 'attribue,attribute,...'
    def index(self, writer):
        id = u'{0}'.format(self.id)
        cls = u'{0}'.format(pickle.dumps(self.__class__))
        value = u' '.join([getattr(self, attr) for attr in self.__whoosh_value__.split(',')])
        writer.add_document(id=id, cls=cls, value=value)
    def reindex(self, writer):
        id = u'{0}'.format(self.id)
        writer.delete_by_term('id', self.id)
        self.index(writer)
    def deindex(self, writer):
        id = u'{0}'.format(self.id)
        writer.delete_by_term('id', self.id)

Base = declarative_base(cls=SawhooshBase)

As you can see in the index method, we use the whoosh_value to create an aggregate of searchable strings and supply that to the value we defined in our search schema. We pickle the class and also store the ID. This is what later lets us easily take search results and turn them in to object instances.

Those methods above are only ever run by our SQLalchemy after_flush session event, though that is not to say you couldn't run them manually. The callback and event hook looks like this:

def update_indexes(session, flush_context):
    writer = WIX.writer()
    for i in session.new:
        i.index(writer)
    for i in session.dirty:
        i.reindex(writer)
    for i in session.deleted:
        i.deindex(writer)        
    writer.commit()
event.listen(DBSession, 'after_flush', update_indexes)

The WIX object you see being used here is created much the same way you create your DBSession and you can view that in the search.py file of the project. Everything else is pretty straight forward. Call the respective indexer methods for new, dirty, and deleted items in the session.

Using in the View

Now we have our models all whooshified (technical term) and we want to have our users searching and displaying usable results. To start we have a simple search form that when submitted does a GET request to our search method with keywords. Now since the whoosh query parser is already smart enough to pickup things like AND and OR we don't have to do much.

The view code itself is where we use another helper method called results_to_instance (also in search.py). This method takes our search results, unpickles the class, fetches the instance from the DB and places the result in to a list. That method looks like this:

def results_to_instances(request, results):
    instances = []
    for r in results:
        cls = pickle.loads('{0}'.format(r.get('cls')))
        id = r.get('id')
        instance = request.db.query(cls).get(id)
        instances.append(instance)
    return instances

Now this is where you see the benefit of pickling the class type with the search result. We have ready to use instances of multiple types from a single search result. We can pass them in to a template to render the results, but since we have a mix of instance types, we need a unified way to render them out without explicitly checking their type or using a series of if attribute checks. I decided I would use str on the model to be a search results safe friendly way to display the object to the user and that I would define a route_name helper method on the model so that I can use request.route_url(object.route_name()) to generate the clickable links in the results. The combination of this can be viewed below:

%if results:
    % for r in results:
  • ${r}
  • %endfor
%else:

No results found

%endif

And the view itself uses some things we haven't talked about yet. QueryParser is just part of Whoosh and it is what turns your keywords in to something useful. request.ix is just a shortcut to our WIX object we created, we've added it using a custom Request class (same as we've done with db), you can see that in the security.py file of the project. From there we render the search results html and return it to our AJAX caller to inject in to the page. You can see this is also where we use the results_to_instances method discussed earlier.

@view_config(route_name='search', renderer='json', xhr=True)
def search_ajax(request):
    query = request.params.get('keywords')
    parser = QueryParser('value', request.ix.schema)
    with request.ix.searcher() as searcher:
        query = parser.parse(query)
        results = searcher.search(query)
        search_results_html=render('sawhoosh:templates/search/results.mako',
                              dict(results=results_to_instances(request, results)),
                              request=request)
    return dict(search_results_html=search_results_html)

Summary

There you have it. A search indexer using SQLalchemy on top of Pyramid. I highly recommend you clone the git repo for this project or review the code using the Git website if you are interested in getting a start to finish feel. The project allows you to add Authors and Documents and Edit and Delete them all while updating the local search index stored in the a directory on the file system.

GitHub: sawhoosh

Read and Post Comments

Pyramid and Traversal with a RESTful interface

On August 01, 2011 In python Permalink

UPDATE (2011-08-05)

Please use caution when reading this post. A lot of the approach and implementation here is flawed. I am keeping the post up for historical purposes, but I am currently working on a follow up post that has a much better and proper implementation of traversal for SQLalchemy models. The practice of not returning real instances as traversal expects and tightly coupling the models to the traversal method is something that is less than desirable and will lead to more pain than gain long term. That being said, some of the approaches here are a good way to learn about traversal and how one might want to use it with their data model.

Original Post

When Pyramid was first being developed I was intrigued by the idea that I could create context aware views and use a host of methods to check permissions on those contexts, generate URLs based off those contexts, and auto-magically call the view required based on the context and the requested resource path.

So one of my first experiments with Pyramid was to implement proper resource urls for contexts in a RESTful fashion. Eventually I plan to do this for the entire collection as well, but for now all I need is the context level RESTful interface. The goal of which is to have URLs that go something like this.

  • /resource/id (GET) - default view of the resource
  • /resource/id/edit (GET) - the form that allows you to edit the resource
  • /resource/id/create (GET) - the form that allows you to edit the resource
  • /resource/id (PUT) - updates
  • /resource/id (POST) - create
  • /resource/id (DELETE) - delete

This ends up being pretty damn simple with Pyramid and Traversal and for those of you new to traversal or even those who aren't, I highly recommend reading the Much Ado About Traversal chapter in the Pyramid documentation. Also on a side note all of the snippets from this post are part of a real project called Stockpot and the code is freely available via SourceForge.

My Root

So first step for me was to design my Root object. This is the really the foundation for traversal and determines what resources it will be able to find and how to interact with them once it finds them. My Root object is simple and looks like this.

def _owned(obj, name, parent):
    obj.__name__ = name
    obj.__parent__ = parent
    return obj

class Root(dict):
    __name__ = None
    __parent__ = None
    #
    def __init__(self, request):
        dict.__init__(self)
        self.request = request
        self['user'] = _owned(User, 'user', self)

This is pretty straightforward. We create a user entry point for the first call to getitem and return the User model with a name of user and the Root object as the parent.

My Model

For my Root object to really do anything useful our model class needs to do some work so that when the traversal algorithm calls getitem on our User model it actually gets something useful back. I've done this using a base class for my declarative_base call.

class StockpotBase(object):
    @classmethod
    def __getitem__(cls, k):
        try:
            result =  DBSession().query(cls).filter_by(id=k).one()
            result.__parent__ = result
            result.__name__ = str(k)
            return result
        except NoResultFound, e:
            raise KeyError
    @classmethod
    def __len__(cls):
        return DBSession().query(cls).count()    
    @classmethod
    def __iter__(cls):
        return (x for x in DBSession().query(cls))

Base = declarative_base(cls=StockpotBase)

class User(Base):
    __tablename__ = 'users'
    __name__ = 'user'
    #
    def __init__(self, email, password=None, display_name=None):
        self.email = email
        self.password = password
        self.display_name = display_name
    #
    id = Column(Integer, primary_key=True)
    email = Column(String, nullable=False, unique=True)
    password = Column(String, nullable=True)
    display_name = Column(String, nullable=True)
    user_groups = relation(Group, backref='user', secondary=groups)
    groups = association_proxy('user_groups', 'name', creator=Group.group_creator)
    recipes = relation(Recipe, backref='user')
    #
    def __str__(self):
        return 'User(id={0}, email={1}, groups={2})'.format(self.id, self.email, self.groups)

def __repr__(self):
        return self.__str__()

So that is a pretty big chunk of code so let me go through what is happening, it is rather simple. I've created StockpotBase which has the methods our traversal algorithm is going to want. I've used that as the cls for my declarative_base call so that any class that I create that inherits from Base will have all of the proper methods needed.

The getitem itself ensures that the parent is set to the generic user class and the name of the class is set to the primary key. This is important later when we start using resource_url() to generate links for us in our templates, if you consider that the urls will be generated with the pattern of /parent.name/context.name

My Views

With the Root object setup and our model "traversal enabled", we can look at how the views for this will be setup. I personally like to use the config.scan('stockpot.views') helper and use the @view_config decorator for my views. I find it cleaner and easier to to have the view_config right with the actually def.

# RESOURCE_URL = /user/id
@view_config(context=User, renderer='user/view.mako')
def get(request):
    return dict(user=request.context)

# RESOURCE_URL = /user/id/edit
@view_config(name='edit', context=User, renderer='user/edit.mako')
def edit(request):
    return dict(user=request.context)

So here is the default GET view. It allows anyone to use this view, but I will have a blog post about permissions with ACL and traversal later, and it uses the renderer of my user/view.mako template. Then we have the edit view which requires User:edit permissions and uses the edit.mako template. Pretty simple. Next we have the first of the JSON views (they don't have to be JSON).

@view_config(context=User, request_method='PUT', xhr=True, renderer='json')
def put(request):
    user = request.context
    return dict(method='PUT', user_id=user.id, email=user.email)

And the mako template jQuery for this might look something like this

$$code(lang=javascript, linenums=True) $(document).ready(function() { $('#put').click(function() { $.ajax({ url: '${request.resource_url(user)}', type: 'PUT', context: document.body, dataType: 'json', success: function(data) { console.log(data); alert('done'); } }); }); });

And that is it. You would repeat the same view pattern for request_method POST and request_method DELETE and you would have RESTful API in to your resources/models in a very clean fashion.

What Happens

When a user visits the resource url a simple series of calls to getitem happens. The Root (/) object is called with 'user'. A User object with the name of 'user' and the parent of Root is returned. The User class has it's getitem called and uses the DBSession to lookup a user based on the key given. For example /user/1 (Root / User / k) would result in '1' being passed to the user objects getitem as the key. If it locates the user, it returns the instance and sets the name and parent. If you don't set the name when you call resource_url with the context, the generated URL would look read /user instead of /user/1.

There is nothing after the 1 so it looks for a generic unnamed view that handles the User context. In our case, our get method. When you add on edit, /user/1/edit it works in the same fashion, but when it tries to call getitem a second time on the User instance it will throw a key error which tells Pyramid that I am looking for a view named edit with the context of User. This traversal works the same way for the JSON calls as well.

Feedback

I don't like the fact that there are extra DB calls here, but it is a trade off. Even the /user/1/edit has to make two database calls to get the KeyError and review the proper view, but as a side-effect I can do something like /user/1/collection/1 and get the specific item of the collection owned by the user. That extends to edits as well ... /user/1/collection/1/edit. Overall I like how this pattern has evolved in my application, but would appreciate any feedback or suggested improvements to what I've done so far.

Read and Post Comments

PyCodeConf 2011

On July 28, 2011 In python Permalink

See You At PyCodeConf

I will see everyone at PyCodeConf 2011 in Miami, FL this year. I will be there from October 5th through the 9th. Look me up if you are going. I will be posting updates of the event on Twitter and Google+. Looking forward to seeing everyone there!

Read and Post Comments

Pyramid and velruse for Google authentication

On July 24, 2011 In python Permalink

As I continue to work with Pyramid I find myself really enjoying the web development I am doing. Recently I needed to integrate Google OAuth (and eventually Facebook and Twitter) as part of the options for the user signup experience. I knew Chris Davies has done something similar recently based on some Google+ activity I had read of his, so while I was exploring options, I also spoke with him about his experience with python-openid and velruse. The feedback and examples he gave to me are pretty much the basis for what you are about to read on how to get Google OAuth working with your Pyramid application.

This post only describes how to do the Google Authentication with Pyramid and velruse, but you can easily adapt the information in the post to work with Twitter and Facebook. All the code that these snippets were taken from are available in their entirety in the stockpot git repository on Sourceforge.

Setting it all up

When I first looked at velruse I was a little intimidated. The documentation isn't the greatest in the world and the code is very compact (but well done), so it took a bit for me to become comfortable with it. After some code reading and submitting a small fix so it could use SQLite as a storage type I was ready to rock and roll.

Once I had it pip installed and in my setup.py as a requirement the next step was to setup the ini to create and serve an instance of verluse along side my pyramid application. Here are the app and pipeline sections of the ini file.

[app:stockpot]
use = egg:stockpot
default_locale_name = en

[app:velruse]
use = egg:velruse
config_file = %(here)s/CONFIG.yaml

[pipeline:pstockpot]
pipeline = exc tm stockpot

[pipeline:pvelruse]
pipeline = exc tm velruse

[composite:main]
use = egg:Paste#urlmap
/ = pstockpot
/velruse = pvelruse

Integrating the pieces

velruse is configured using a YAML file. This file at a minimum needs a Store, which will hold the key,value pairs your callback receives. It also needs the OpenID and OpenID store which hold the generic realm, endpoint regex, and store information for all the providers. It is important for most providers that your realm and endpoint match what you've setup in their system.

Tip: Use /etc/hosts file to point to your local machine so that when you are redirected to the endpoint with the GET token your application receives everything ok

Store:
    Type: SQL
    DB: sqlite:////path/to/data/stockpot.db
Google:
    OAuth Consumer Key: MYKEY
    OAuth Consumer Secret: MYSECRET
OpenID:
    Realm: http://example.com:6543
    Endpoint Regex: http:/example.com
OpenID Store:
    Type: openid.store.memstore:MemoryStore

Now in your Pyramid application you will need to do some setup.

You will need to make sure that the KeyStorage table is being created along with your application tables and you will also need to create a callback method to handle the response from the OAuth endpoint. Also somewhere you'll need to add a simple view that contains the Google Login form.

And here is the simple addition to the models. This assumes you are using the same DB for your application and velruse.

from velruse.store.sqlstore import SQLBase

## inside initialize_sql
SQLBase.metadata.bind = engine
SQLBase.metadata.create_all(engine)

I added the callback code to my existing login handler, it looks for the token, attempts to lookup the values for that token in the KeyStorage of velruse. Upon the success of the lookup it loads the JSON string, extracts the verifiedEmail. The email is used to lookup a pre-existing user or create a new one if it doesn't exist. Then we call remember with the request and the user.id just like normal. Now we have a logged in user.

if 'token' in request.params:
    try:
        token = request.params.get('token')
        storage = DBSession.query(KeyStorage).filter_by(key=token).one()
        values = json.loads(storage.value)
        if values.get('status') == 'ok':
            email = values.get('profile',dict()).get('verifiedEmail')
            try:
                user = DBSession.query(User).filter_by(email=email).one()
            except orm.exc.NoResultFound:
                user = User(email)
                request.db.add(user)
                request.db.flush()

headers = remember(request, user.id)
            return HTTPFound(location=resource_url(request.next, request),
                             headers=headers)
    except orm.exc.NoResultFound:
        request.session.flash('Unable to Authenticate you using OpenID')

From here you can treat this user like any other. They have created an entry in your user table and you can start adding the user to groups or setting permanent properties on the user. All in all a pretty simple process when you put it down on paper, but I know for me it all felt a little overwhelming until I actually got it all put together and working.

Feedback

If you know of ways to improve the code or see obviously glaring issues please leave a comment or email me at my first name @ pieceofpy.com.

Read and Post Comments

Quick SQLalchemy Shell and Blog Update

On April 27, 2011 In python Permalink

Update

So I haven't updated my blog in a long time. Some people actually apparently still check in and read it every now and again. A few more, even cared enough to asked me why I haven't updated my blog? Embarassed and ashamed I lied and said, "Oh I've been busy.". Ok, so not so much busy as lazy. I recently switched to Blogofile and in the process I never setup my version control to re-build and auto publish my commits to the site. Knowing that it wasn't setup I've been too lazy to blog because I knew I would have to commit, login, build, and copy the files to the public folder. Yeah seriously, my barrier to entry for somethings is that low, sue me!

Anyway, I've finally got it all setup to do the build and deploy for me and I even managed to setup SSH keys and have it mirror the changes up to bitbucket for those interested in the source code of this Blogofile blog.

Quick SQLalchemy Shell

Ok so now for something I use a lot but took for granted until I used it in front of a friend the other day. Who asked me what? how? A lot of times when I am working with SQLalchemy I just want a quick shell I can jump in to and start poking around. I use this a lot of when people ask questions on IRC or the ML. It helps me play around with stuff if I don't know the answer right away. It is also a great resource for myself when testing new features or trying to figure things out. Nice thing is you can also change it to do reflection and easily have a session in to a pre-existing table structure, which is nice if you are like me and know SA better than SQL.

The structure of the code is pretty simple:

  • sqla (folder)
    • __main__.py
    • models.py
# __main__.py
import os
import sqlalchemy as sa

from models import *

os.environ['PYTHONINSPECT'] = 'True'
engine = sa.create_engine('sqlite:///:memory', echo=True)
Base.metadata.create_all(engine)
Session = sa.orm.sessionmaker(bind=engine)
session = Session()
And then you have models.py
import sqlalchemy as sa
from sqlalchemy.ext.declarative import declarative_base

__all__ = ['Base', 'Test']

Base = declarative_base()

class Test(Base):
    __tablename__ = 'test'
    id = sa.Column(sa.Integer, primary_key=True)

Now you can run something like this, changing your models to have what ever type of objects and relations you desire.

(sqla)mac-wwitzel:code wayne.witzel$ python sqla
2011-04-27 14:18:45,764 INFO sqlalchemy.engine.base.Engine.0x...d410 
PRAGMA table_info("test")
2011-04-27 14:18:45,764 INFO sqlalchemy.engine.base.Engine.0x...d410 ()
>>> test = Test()
>>> session.add(test)
>>> session.commit()
2011-04-27 14:19:12,011 INFO sqlalchemy.engine.base.Engine.0x...d410 BEGIN (implicit)
2011-04-27 14:19:12,012 INFO sqlalchemy.engine.base.Engine.0x...d410 INSERT INTO test 
DEFAULT VALUES
2011-04-27 14:19:12,012 INFO sqlalchemy.engine.base.Engine.0x...d410 ()
2011-04-27 14:19:12,015 INFO sqlalchemy.engine.base.Engine.0x...d410 COMMIT
>>> t = session.query(Test).first()
2011-04-27 14:19:20,571 INFO sqlalchemy.engine.base.Engine.0x...d410 BEGIN (implicit)
2011-04-27 14:19:20,571 INFO sqlalchemy.engine.base.Engine.0x...d410 SELECT test.id AS 
test_id 
FROM test 
 LIMIT 1 OFFSET 0
2011-04-27 14:19:20,571 INFO sqlalchemy.engine.base.Engine.0x...d410 ()
>>> t

>>>
Read and Post Comments
« Previous Page   |   Next Page »