RE: [Snowball-discuss] Stop word lists

From: Alex Murzaku (lists@lissus.com)
Date: Tue Oct 08 2002 - 14:15:01 BST


The sources for those lists are either SMART, Oracle, or a Russian
search engine that I don't remember (maybe mnogosearch or mysql).

Anyway, I haven't spent time studying them: I needed a solution and this
was much faster than what I initially started doing (sieving through
high frequency wordlists and removing or adding words depending on my
judgment.)

I think that the content of a stopword list is really application
dependent. Apparently, the English stopwords I provided are derived by
some business text corpora and were intended for that audience which
coincided with what I needed. Of course, depending on the intelligence
of indexing (some kind of context aware indexing e.g.), stopword lists
might not be needed at all because, in that case, the difference between
A BOOK and THE BOOK would mean a lot.

In any case, the lists I sent were only offered as "better than nothing"
and as a possible starting point. I didn't know you had them already.
Maybe native speakers of the other languages might have also preexisting
lists and/or suggestions to be used for perfectioning the word sets.
Sometime these discussions help, but when judgments are too subjective,
they have the risk of creating more confusion... :)

-----Original Message-----
From:
snowball-discuss-admin@lists.tartarus.org
[mailto:snowball-discuss-admin@lists.tartarus.org] On Behalf Of Martin
Porter
Sent: Tuesday, October 08, 2002 6:34 AM
To: Snowball discuss
Cc: lists@lissus.com
Subject: RE: [Snowball-discuss] Stop word lists

Alex,

I have now looked at the stopword lists you sent yesterday, and they
have increased my confidence in the quality of the Snowball ones. I have
looked at the English one very carefully, and can report on the
findings.

If x is the Snowball stopword list for English, and y is the English
stopword list you sent me, we can look at the various sets x, y, x-y,
y-x, x or y, x and y. Their sizes are as follows:

| x | = 119
| y | = 76
| x-y | = 59
| y-x | = 16
| x or y | = 135
| x and y | = 60

and the sets themselves are,

x = { a about above after again against all am an and any are as at be
because been before being below between both but by did do does doing
down during each few for from further had has have having he her here
hers herself him himself his how i if in into is it its itself me more
most my myself no nor not of off on once only or other our ours
ourselves out over own same she so some such than that the their theirs
them themselves then there these they this those through to too under
until up very was we were what when where which while who whom why with
you your yours yourself yourselves }

y = { a about after all also an and any are as at be because been but by
can co corp could for from had has have he her his if in inc into is it
its last more most mr mrs ms mz no not of on one only or other out over
s says she so some such than that the their there they this to up was we
were when which who will with would }

x-y = { above again against am before being below between both did do
does doing down during each few further having here hers herself him
himself how i itself me my myself nor off once our ours ourselves own
same theirs them themselves then these those through too under until
very what where while whom why you your yours yourself yourselves }

y-x = { also can co corp could inc last mr mrs ms mz one s says will
would }

As you can see, x is substantially larger than y, and the terms in x-y
are plausible stopwords. But if you take the 16 terms in y-x, 6 are
mentioned in the comments in the source of x, and so could always be
picked up by users working from the source:

    auxiliaries: can could will would
    common words: also says

7 are components of names of people and organisations and should only be
treated as stopwords in rather special circumstances:

    co corp inc mr mrs ms mz

which leaves

    s one last.

's' is the second component in words like John's, boy's ... and is not
really a stopword, assuming the indexing is done intelligently. 'last' I
don't think should be a stopword ('The Last Detail', 'The cobbler's
last' ...). 'one' on the other hand is an omission from x, even if it
should only be mentioned in the notes. I will fix it up. (I can see how
'one' came to be omitted, but won't bore you with the
details.)

I will look more closely at the other stopword lists in due course.

Where did they come from? I would like to put the Finnish one in place
in the interim.

------

Actually the English stopword list is the only one I did not make up
myself. It derives from a list which used to be used in IR experiments
in Cambridge and which I have modified over the years. An early form of
it can be found on pp.18-19 of van Rijsbergen's 'Information Retrieval',
Butterworths, 1975. Interestingly, that list contains 'co', which I
remember removing many years ago. I am still doubtful about some of the
entries: 'very' and 'further' for example.

Martin

_______________________________________________
Snowball-discuss mailing list Snowball-discuss@lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST