Re: [Snowball-discuss] Optimising among

From: Olly Betts (olly@survex.com)
Date: Fri Sep 15 2006 - 02:05:54 BST


On Fri, Sep 15, 2006 at 12:49:44AM +0100, richard@lemurconsulting.com wrote:
> On Thu, Sep 14, 2006 at 06:57:25PM +0100, Olly Betts wrote:
> > So for each among I look at the first character to be checked in each
> > string. If they are all the same mod 32, I add a simple bitmap check
> > which in many cases means I can give signal f before even calling
> > find_among or find_among_b.
>
> I assume you mean "all the same div 32" (not mod) here, reading the code.

Yes!

> This makes a lot of sense to me. I'll give Martin a couple of days to
> comment, since I've not looked at this bit of the code in detail, but if he
> has no objections, I'd be happy to commit this.

I suspect I'll have a second version of this patch - I mainly posted it
now for comment (and to avoid duplication of effort in case Martin also
got inspired by the conversation!)

> I wonder if the characters in the among structures used are as nicely
> chunked into groups of 32 characters in other languages (particularly
> Russian). Taking a quick look at the algorithms makes me think they are,
> but if not, the same kind of idea might be applied with a larger bitmap, or
> some such. Anyway, they probably are.

The bitmaps don't have to be aligned on 32 character blocks, but it
makes the checking code simpler (and the generating code simpler to,
which doesn't matter generally but did when I wasn't sure if this was
going to work out!) I've now added code to look at the cases I'm not
currently adding the check for, and I don't believe we'd actually be
able to encode any more with a non-aligned bitmap.

It might also be worth special casing when we have 2 or fewer candidates
since checking "z->p[z->c] == 'a' || z->p[z->v] == 'b'" is likely to be
faster than the bit shifting and and-ing. And this would also allow us to
add a check for the among with "'s'", "'s" or "'" in the English stemmer
which current fails to fit in a 32 character block.

The other case I don't deal with in the English stemmer is one which
includes an empty string - this just needs a slightly different variant
of the bitmap check (which I'm just trying to implement now).

Looking at what doesn't get encoded for other languages, it's pretty
common to have all lower case letters and one other character (typically
an upper case letter, accented character, or an apostrophe). Perhaps
that's worth checking for too.

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST