Phrase analysis and expansion with Ruby

The idea is to take a phrase and analyze it for use in Information Retrieval. We need to tokenize it into words, possibly transmute some of the tokens, possibly expand some tokens into subphrases. This class lets you register lambdas to perform transformations, substitutions, and expansions. Expansions can take a numerical value representing the cost of the operation; this is intended for raising or lowering the scores of matches in the theoretical IR application.

Given the phrase “joe’s sushi & bait-shop shack”, assume I want to tokenize on whitespace, replace the ampersand with the word “and”, and create word variants for the hyphenized and apostrophized words. See the last spec for an example of the Ruby data structure this class generates.

class Analyzer
def initialize
@expansions = []
@transformations = []
@substitutions = {}
@tokenizer = lambda { |string| string.split }
end
def tokenizer(&proc)
@tokenizer = proc
end
def expansion(cost=0.0, &proc)
@expansions << [cost, proc]
end
def substitution(input, output)
@substitutions[input] = output
end
alias_method :sub, :substitution
def transformation(&proc)
@transformations << proc
end
def tokenize(string)
@tokenizer.call(string)
end
def process_token(token)
@transformations.each do |proc|
token = proc.call(token)
end
if out = @substitutions[token]
token = out
end
variants = {}
@expansions.each do |cost, proc|
if variant = proc.call(token)
variants[variant] = cost
end
end
variants.size > 0 ? [token, variants] : token
end
def analyze(string)
tokenize(string).map { |token| process_token(token) }
end
end
describe "An Analyzer" do
before do
@analyzer = Analyzer.new
end
it "can take a custom tokenizer" do
@analyzer.tokenizer { |string| string.split(/\s+/) }
@analyzer.tokenize("three blind mice").should == %w{three blind mice}
@analyzer.tokenizer { |string| string.scan(/[\w']+/) }
@analyzer.tokenize("joe's bait-shop").should == %w{joe's bait shop}
end
it "can perform weighted term expansions" do
@analyzer.expansion(0.5) { |word| word.tr( "'", "") if word =~ /'/ }
@analyzer.expansion(0.5) { |word| word.chomp("'s") if word =~ /'s$/ }
@analyzer.process_token("joe's").should == ["joe's", {"joe" => 0.5, "joes" => 0.5}]
@analyzer.process_token("boring").should == "boring"
end
it "can transform terms" do
@analyzer.transformation { |word| word.reverse }
@analyzer.process_token("123").should == "321"
end
it "can substitute terms" do
@analyzer.substitution("&", "and")
@analyzer.process_token("&").should == "and"
end
it "expands terms after substitutions" do
@analyzer.expansion { |word| "ampersand" if word == "and" }
@analyzer.substitution("&", "and")
@analyzer.process_token("&").should == ["and", {"ampersand" => 0.0}]
end
it "substitutes after transformations" do
@analyzer.substitution("joe", "joseph")
@analyzer.transformation { |word| word.tr('m', 'j') }
@analyzer.process_token("moe").should == "joseph"
end
it "does phrases, if you know how to Enumerable#map" do
@analyzer.sub("&", "and")
@analyzer.expansion(0.5) { |word| word.tr( "'", "") if word =~ /'/ }
@analyzer.expansion(0.5) { |word| word.chomp("'s") if word =~ /'s$/ }
@analyzer.expansion(3.0) { |word| word.split('-') if word =~ /-/ }
@analyzer.expansion(0.1) { |word| word.tr('-', '') if word =~ /-/ }
orig = "joe's sushi & bait-shop shack"
analyzed = [
["joe's", {"joe" => 0.5, "joes" => 0.5}],
"sushi",
"and",
["bait-shop", {"baitshop" => 0.1, ["bait", "shop"] => 3.0}],
"shack"
]
@analyzer.analyze(orig).should == analyzed
end
end

view raw
analyzer.rb
hosted with ❤ by GitHub

Deletes, Transposes, Replaces, Inserts

Very simplistic rudiments of a spell checker in Ruby. Based on Norvig’s article.

# useful for things like http://norvig.com/spell-correct.html
module Edits
DICT = { "cap" => 1, "carp" => 1, "clap" => 1, "cramp" => 1 }
def deletes
map_transforms { |word, i| word.delete_at(i) }
end
def transposes
map_transforms { |word, i| word[i], word[i+1] = word[i+1], word[i] }
end
def replaces
("a".."z").map do |c|
map_transforms { |word, i| word[i] = c }
end.flatten
end
def inserts
("a".."z").map do |c|
r = map_transforms { |word, i| word.insert(i, c) }
terminal = "#{self}#{c}"
r << terminal if terminal.score
r
end.flatten
end
def map_transforms
out = []
chars = self.split('')
self.size.times do |i|
yield(word = chars.dup, i)
word = word.join
out << word if word.score && word != self
end
out
end
def score
DICT[self]
end
end
String.send(:include, Edits)
describe "a String, imbued with Edits" do
it "works" do
r = "crap".deletes
r.should == %w{ cap }
r = "crap".transposes
r.should == %w{ carp }
r = "crap".replaces
r.should == %w{ clap }
r = "crap".inserts
r.should == %w{ cramp }
end
end

view raw
edits.rb
hosted with ❤ by GitHub

Instant.rake: Compile and run individual Java classes using Rake

Sometimes, when forced to work with Java, you just want to copy and paste some code and fiddle with it. A real project build system is overkill. Try Instant.rake:

https://gist.github.com/231231

Improved object wrapper for JRuby Embed

New in JRuby 1.4 is JRuby Embed, which lets you eval Ruby from Java classes. It works, appears to be well-written, and needs some sugar. Here’s a class that limits your options in a helpful way.

https://gist.github.com/238940

King’s Third Rule of Software Development

Any software project not written in Java will clearly state on its homepage the implementation language.