authored by Wayne Witzel III

SQLalchemy Cleanup Challenge

On October 13, 2011 In python, sqlalchemy Permalink

Yesterday I found myself writing some very interesting SQLalchemy. The problem is I have a date column in PostgreSQL that is stored as epoch time, so it is just an Interger column. I need to group by year,month and grab the total count of status='A' groups for that year,month combination.

Here is what I came up with, can you make it cleaner? Faster? I am curious to see the different variations people come up with.

        pg_date_part_month = sa.func.date_part('month',
                sa.func.to_timestamp(Group.register_time))
        pg_date_part_year = sa.func.date_part('year',
                sa.func.to_timestamp(Group.register_time))

group_month_select = ( 
            db.query(
                sa.sql.label('year', pg_date_part_year),
                sa.sql.label('month', pg_date_part_month),
                sa.sql.label('total', sa.func.count(Group.status))
            )   
            .filter_by(status='A')
            .group_by(pg_date_part_year)
            .group_by(pg_date_part_month)
            .group_by(Group.status)
            .order_by(pg_date_part_year)
            .order_by(pg_date_part_month)
        )
Read and Post Comments

Using SQLAlchemy Custom Types to Convert Integers to DateTime

On October 12, 2011 In python, sqlalchemy Permalink

Today I was working on fetching out some data from an existing PostgreSQL server and generating some BSON output that would later be imported in to MongoDB. One of the problems I ran in to was that I needed to format the timestamps easily for each row of data.

Searching the internet I ran across this blog post by Ralph Bean, which does just that, but at a level that was well beyond what I needed. So taking away some inspiration from Ralph's blog post, I decided to just go with a Custom Type.

from time import mktime
from datetime import datetime

class IntegerDateTime(types.TypeDecorator):
    """Used for working with epoch timestamps.

Converts datetimes into epoch on the way in.
    Converts epoch timestamps to datetimes on the way out.
    """
    impl = types.INTEGER
    def process_bind_param(self, value, dialect):
        return mktime(value.timetuple())
    def process_result_value(self, value, dialect):
        return datetime.fromtimestamp(value)

Then in my reflected table, I just override the column that holds the integer representation of the datetime I want.

group_table = sa.Table('groups', metadata,
    sa.Column('register_time', IntegerDateTime),
    autoload=True,
    include_columns=[
        'group_id',
    'register_time',
    'type'
    ],
)

Now when we query and begin to use our results, register_time will be a DateTime object making it very easy to do any timedelta arithmetic or string formatting.

Read and Post Comments

Working for Geeknet

On October 05, 2011 In python Permalink

I joined Geeknet full time this week as a Senior Software Engineer on the SourceForge team. I am really looking forward to the challenge ahead and working with everyone on the team. As part of the SourceForge team I will be helping to make SourceForge better for the current users and working on improvements that will attract new projects to SourceForge. My personal goal is to make SourceForge part of the day to day vocabulary like it used to be. I want see SourceForge back in the top three when people ask "Where is a good place to host my code?" or "If you were creating an OSS project, where would you put?". I believe that this team can do that and I am really excited for the future.

I will be candid here, when I first thought of working for Geeknet on the SourceForge team, I was thinking probably the same thing you are ... isn't that only CVS and SVN? That is just for hosting downloads right?

So as part of my initial curiosity of what people currently think of SourceForge versus what is actually happening on the site I did some googling and read up on anything that mentioned SourceForge that was less than 6-months old. I found that it wasn't so much that SourceForge wasn't keeping pace with the other options out there, but that it wasn't marketing the fact that it was keeping pace with the other options out there.

Here are a list of improvements that I didn't know about until I did some research.

  • SourceForge supports Git and Mercurial (and Bazzar).
  • You can fork and submit merge requests with Git and Mercurial.
  • You don't have to get approval for projects.
  • There is a user project space for creating miscellaneous repositories.
  • Integrated Tracker / Commit messages. [#TICKET]
  • 9 out of 10 Ninjas recommend it.

When you combine that with one of the best mirror networks out there, live IRC support, and a great community; You get a pretty nice resource. And at a price of FREE.99 (like beer), it was all I needed to know that this was where I wanted to work. Contributing to an long-time open source project that paved the path for other project hosting sites while helping to keep that open spirit alive and well.

Read and Post Comments

Scheduling Posts with Blogofile

On August 28, 2011 In python Permalink

I wanted to be able to schedule posts with blogofile and this was the quickest way I could think of to do it. If someone knows of a better way, than please comment cause I would love to read about.

My change is pretty simple, if the date in the YAML header is > than now, throw a PostProcessing exception and continue on with the next post. I added a cronjob that runs blogofile build every hour, so this solution sucks for blogs that take a long time to build or if you want precision scheduling, but my site builds fast and I am ok with an hour delay.

diff -r e370cb5a903f blog/_controllers/blog/post.py
--- a/blog/_controllers/blog/post.py    Tue Aug 23 23:16:37 2011 -0400
+++ b/blog/_controllers/blog/post.py    Sun Aug 28 19:03:28 2011 -0400
@@ -70,7 +70,14 @@
     def __str__(self):
         return repr(self.value)

+class PostProcessException(Exception):

+    def __init__(self, value):
+        self.value = value
+
+    def __str__(self):
+        return repr(self.value)
+                
 class Post(object):
     """
     Class to describe a blog post and associated metadata
@@ -179,7 +186,11 @@
             self.slug = slug

if not self.date:
-            self.date = datetime.datetime.now(pytz.timezone(self.__timezone))
+            self.date       = datetime.datetime.now(pytz.timezone(self.__timezone))
+        else:
+            if self.date > datetime.datetime.now(pytz.timezone(self.__timezone)):
+                raise PostProcessException('Post date is in the future.')
+
         if not self.updated:
             self.updated = self.date

@@ -367,7 +378,7 @@
             raise
         try:
             p = Post(src, filename=post_fn)
-        except PostParseException as e:
+        except (PostParseException,PostProcessException) as e:
             logger.warning(u"{0} : Skipping this post.".format(e.value))
             continue
         #Exclude some posts

Read and Post Comments

Hooking up Whoosh and SQLalchemy (sawhoosh)

On August 05, 2011 In python Permalink

I was talking to Tim VanSteenburgh (we both do work for Geek.net) and he had a proof of concept for automatically updating a search index when a change happens to a model and pairing a unified search results page with that to make an easy one-stop search solution for a website. I thought this was a pretty cool idea, so I decided to do a implementation of it on top of the Pyramid framework and share it the readers of my blog. I choose Whoosh only so that I could have an easy way for users to try the project out. Since whoosh is pure python it installs when you run your setup.py develop. But this approach could be applied to any searching system. Tim was using solr.

Demo Project

The purpose of the post is to summarize briefly the steps taken to achieve the desired results of automatically updating the index information when models change. Please browse the source and clone and run the project, looking at the source as a whole will definitely give you a better start to finish understanding than just reading this blog post.

GitHub: sawhoosh

Whooshing Your Models

There are few parts that make up the magic of all this, but each part is really simple. First you have the index schema itself. This is setup to hold the unique ID of an item, as well as a pickled version of the uninstantiated class, and an aggregate of values (decided by you) that you want to be searchable for the item. The schema looks like this:

class SawhooshSchema(SchemaClass):
    value = TEXT
    id = ID(stored=True, unique=True)
    cls = ID(stored=True)

In the Whoosh world this basically says store ID and CLS so it comes back with the search results, but not value. value will be what we will search over in order to find our results. We need to tell our models what attributes are important to use for indexing, I've chosen to do this with a whoosh_value property on the model itself.

class Document(Base):
    __tablename__ = 'document'
    __whoosh_value__ = 'id,title,content,author_id'
    id = Column(CHAR(32), primary_key=True, default=new_uuid)
    title = Column(String, nullable=False)
    content = Column(Text, nullable=False)
    author_id = Column(Integer, ForeignKey('author.id'), index=True)
    def __str__(self):
        return u'Document - Title: {0}'.format(self.title)

This says make the id, title, content, and author_id for document all searchable values. This happens automatically for us because our Base class has a parent class that handles all of the Whoosh work. This parent class, I've called it SawhooshBase, handles all the indexing, reindexing, and deindexing of our models. The class itself looks like this:

class SawhooshBase(object):
    # The fields of the class you want to index (make searchable)
    __whoosh_value__ = 'attribue,attribute,...'
    def index(self, writer):
        id = u'{0}'.format(self.id)
        cls = u'{0}'.format(pickle.dumps(self.__class__))
        value = u' '.join([getattr(self, attr) for attr in self.__whoosh_value__.split(',')])
        writer.add_document(id=id, cls=cls, value=value)
    def reindex(self, writer):
        id = u'{0}'.format(self.id)
        writer.delete_by_term('id', self.id)
        self.index(writer)
    def deindex(self, writer):
        id = u'{0}'.format(self.id)
        writer.delete_by_term('id', self.id)

Base = declarative_base(cls=SawhooshBase)

As you can see in the index method, we use the whoosh_value to create an aggregate of searchable strings and supply that to the value we defined in our search schema. We pickle the class and also store the ID. This is what later lets us easily take search results and turn them in to object instances.

Those methods above are only ever run by our SQLalchemy after_flush session event, though that is not to say you couldn't run them manually. The callback and event hook looks like this:

def update_indexes(session, flush_context):
    writer = WIX.writer()
    for i in session.new:
        i.index(writer)
    for i in session.dirty:
        i.reindex(writer)
    for i in session.deleted:
        i.deindex(writer)        
    writer.commit()
event.listen(DBSession, 'after_flush', update_indexes)

The WIX object you see being used here is created much the same way you create your DBSession and you can view that in the search.py file of the project. Everything else is pretty straight forward. Call the respective indexer methods for new, dirty, and deleted items in the session.

Using in the View

Now we have our models all whooshified (technical term) and we want to have our users searching and displaying usable results. To start we have a simple search form that when submitted does a GET request to our search method with keywords. Now since the whoosh query parser is already smart enough to pickup things like AND and OR we don't have to do much.

The view code itself is where we use another helper method called results_to_instance (also in search.py). This method takes our search results, unpickles the class, fetches the instance from the DB and places the result in to a list. That method looks like this:

def results_to_instances(request, results):
    instances = []
    for r in results:
        cls = pickle.loads('{0}'.format(r.get('cls')))
        id = r.get('id')
        instance = request.db.query(cls).get(id)
        instances.append(instance)
    return instances

Now this is where you see the benefit of pickling the class type with the search result. We have ready to use instances of multiple types from a single search result. We can pass them in to a template to render the results, but since we have a mix of instance types, we need a unified way to render them out without explicitly checking their type or using a series of if attribute checks. I decided I would use str on the model to be a search results safe friendly way to display the object to the user and that I would define a route_name helper method on the model so that I can use request.route_url(object.route_name()) to generate the clickable links in the results. The combination of this can be viewed below:

%if results:
    % for r in results:
  • ${r}
  • %endfor
%else:

No results found

%endif

And the view itself uses some things we haven't talked about yet. QueryParser is just part of Whoosh and it is what turns your keywords in to something useful. request.ix is just a shortcut to our WIX object we created, we've added it using a custom Request class (same as we've done with db), you can see that in the security.py file of the project. From there we render the search results html and return it to our AJAX caller to inject in to the page. You can see this is also where we use the results_to_instances method discussed earlier.

@view_config(route_name='search', renderer='json', xhr=True)
def search_ajax(request):
    query = request.params.get('keywords')
    parser = QueryParser('value', request.ix.schema)
    with request.ix.searcher() as searcher:
        query = parser.parse(query)
        results = searcher.search(query)
        search_results_html=render('sawhoosh:templates/search/results.mako',
                              dict(results=results_to_instances(request, results)),
                              request=request)
    return dict(search_results_html=search_results_html)

Summary

There you have it. A search indexer using SQLalchemy on top of Pyramid. I highly recommend you clone the git repo for this project or review the code using the Git website if you are interested in getting a start to finish feel. The project allows you to add Authors and Documents and Edit and Delete them all while updating the local search index stored in the a directory on the file system.

GitHub: sawhoosh

Read and Post Comments
« Previous Page   |   Next Page »