Random ramblings

Thoughts from a .NET developer

Updated version of the QueryVisitor

clock January 11, 2012 21:22 by author Simon Svensson

I've recently revisited the task of providing a meaningful did-you-mean feature for an existing search implementation. It ended up requiring a rewrite of the previously published QueryVisitor; it now takes a single query as input, and can generate an infinite amount of queries as a result.

Our application requires all terms to be present for documents to match, which means that even if two terms are correctly spelled, it may still end up with an empty result. The solution is to override the VisitTerm method, and use any means available to determine if the term can be replaced with something else to increase the likelihood of a match.

A very simple (and totally inappropriate) visitor could take the term and return them reversed to expand the single query into several.

public class ReverseTermSuggester : QueryVisitor {
    protected override IEnumerable VisitTerm(Term term) {
        yield return term;

        var chars = term.Text().ToCharArray();
        Array.Reverse(chars);
        yield return new Term(term.Field(), new string(chars));
    }
}
public class Program {
    public static void Main() {
        var queryParser = new QueryParser("f", new StandardAnalyzer());
        var query = queryParser.Parse("first second^0.5 \"within phrase\"");

        foreach(var newQuery in new ReverseTermSuggester().Visit(query))
            Console.WriteLine(newQuery);
    }
}
f:first f:second^0.5 f:"within phrase"
f:first f:second^0.5 f:"within esarhp"
f:first f:second^0.5 f:"nihtiw phrase"
f:first f:second^0.5 f:"nihtiw esarhp"
f:first f:dnoces^0.5 f:"within phrase"
f:first f:dnoces^0.5 f:"within esarhp"
f:first f:dnoces^0.5 f:"nihtiw phrase"
f:first f:dnoces^0.5 f:"nihtiw esarhp"
f:tsrif f:second^0.5 f:"within phrase"
f:tsrif f:second^0.5 f:"within esarhp"
f:tsrif f:second^0.5 f:"nihtiw phrase"
f:tsrif f:second^0.5 f:"nihtiw esarhp"
f:tsrif f:dnoces^0.5 f:"within phrase"
f:tsrif f:dnoces^0.5 f:"within esarhp"
f:tsrif f:dnoces^0.5 f:"nihtiw phrase"
f:tsrif f:dnoces^0.5 f:"nihtiw esarhp"

There's a Contrib/SpellChecker project that can be used for term expansion if a separate spellchecking index is acceptable.

Next comes the task to iterate all queries generated and choose which to present to the user. It's up to the implementer to decide the scoring; taking into account the number of results and comparison between the original query and the generated one.

Download: QueryVisitor.zip (1.65 kb)



Updated version of the lucene-hunspell port

clock January 3, 2012 18:46 by author Simon Svensson

[2012-01-07: Source moved to Gibhub, updated example code]

I've recently updated my C# port of lucene-hunspell. This includes changing the signature of methods previously accepting a Char[] into taking a String. There's also some thread safety issues where a StringBuilder will throw an exception regarding an invalid chunk length.

Here's a basic example how the HunspellStemmer can be used.

public class SwedishHunspellAnalyzer : Analyzer {
    private static readonly HunspellDictionary Dictionary = GetDictionary(@"sv_SE");
    private static HunspellDictionary GetDictionary(String culture) {
        using (var affixStream = OpenStream(culture + @".aff"))
        using (var wordStream = OpenStream(culture + @".dic")) {
            if (affixStream != null && wordStream != null) {
                return new HunspellDictionary(affixStream, wordStream);
            }
        }

        throw new InvalidOperationException("Missing affix- or word stream.");
    }

    protected static Stream OpenStream(String fileName) {
        // TODO: Read from embedded resource, file system, ...
        throw new NotImplementedException();
    }

    public override TokenStream TokenStream(String fieldName, TextReader reader) {
        var stream = new StandardTokenizer(Version.LUCENE_29, reader);

        TokenFilter filter = new LowerCaseFilter(stream);
        filter = new HunspellStemFilter(filter, Dictionary);

        return filter;
    }
}

Source available at Github.



Modifying knockoutjs' visible-binding to add scrollIntoView

clock September 14, 2011 21:56 by author Simon Svensson

I've got a container element databound using Knockout's visible binding to show content when my model tells it to. However, the default behavior of the visible binding will only toggle the display-style of the element, which mean that the container will be visible, but possible outside the viewport of the screen. This can be solved with the scrollIntoView function, which seems to be supported in every major browser. It has several limitations; one is that it will always scroll the window, even if the element is always visible. My very specific scenario has the element container placed as the bottom-most container, so scrolling the window to the bottom will always work.

ko.bindingHandlers['visible'] = {
    'update': function (element, valueAccessor, allBindingsAccessor) {
        var value = ko.utils.unwrapObservable(valueAccessor());
        var isCurrentlyVisible = !(element.style.display == "none");
        if (value && !isCurrentlyVisible) {
            element.style.display = "";

            var allBindings = allBindingsAccessor();
            if (allBindings['scrollIntoView']) {
                // Use setTimeout to scroll after any observers have been
                // fired, perhaps modifying what's inside the element.
                // Doing the scroll before the observers have fired will
                // cause a problem if a observer adds to the elements height,
                // causing it to be partly hidden again.
                setTimeout(function () {
                    element.scrollIntoView(!!allBindings['alignWithTop']);
                }, 0);
            }

        } else if ((!value) && isCurrentlyVisible) {
            element.style.display = "none";
        }
    }
}
<div data-bind="visible: myModel.Visible, scrollIntoView: true, alignWithTop: false">


A QueryVisitor for Lucene

clock April 21, 2011 21:03 by author Simon Svensson

[2012-01-11: Updated version available.]

There's often a need to modify queries in different ways, like changing every word in a search string into a prefix search (word => word*). A common thought is to do this using regular expressions, or basic string mangling. This, however, is hard to get right when you take into consideration the many different query types, and their syntax. A easier way to solve this is to rewrite the query generated by the QueryParser. This needs to be done recursivly since queries can nest other queries ("(a b)" is a boolean query with two nested term queries). I've attached a QueryVisitor class which simplifies this.

An example; to change every TermQuery (word) into a PrefixQuery (word*), we just need to override the VisitTermQuery method, and return a PrefixQuery instead. This wont match phrase queries ("a word"), but those often mean that the user wants the exact spelling/formulation specified.

public class PrefixRewriter : QueryVisitor {
    protected override Query VisitTermQuery(TermQuery query) {
        var term = query.GetTerm();
        var newQuery = new PrefixQuery(term);
        return CopyBoost(query, newQuery);
    }
}

Note that phrases with only one term ("word") are parsed as a TermQuery and thus rewritten. This is done after the analyzer has removed all stopwords, causing phrases like "with a word" to become a search for a single term, "word", which is rewritten in our example. 

public static class Program {
    public static void Main() {
        var queryParser = new QueryParser("f", new StandardAnalyzer());
        var query = queryParser.Parse("awesome rewrite^0.5 \"including one phrase\"");
        var rewritten = new PrefixRewriter().Visit(query);

        Console.WriteLine(query);
        Console.WriteLine(rewritten);
    }
}

Outputs...

f:awesome f:rewrite^0.5 f:"including one phrase"
f:awesome* f:rewrite*^0.5 f:"including one phrase"

Download: QueryVisitor.cs (5.52 kb)



Disabling the cache in BlogEngine for web farm compliance

clock April 20, 2011 19:57 by author Simon Svensson

I'm using BlogEngine at a hosting company where several webservers serve the content from a shared filesystem storage behind the scenes. This is a web farm scenario I believe is very common, and also a known problem for BlogEngine.

A default installation BlogEngine stores content in several xml-files stored in App_Data. These are cached in-memory in a static List<T> which introduces problems when another server updates the file. Those that have already read the file will have stale information, which may show up as newly created posts not showing up, deleted posts shows up again, or other scenarios where everything acts like changes weren't persisted.

There's an Web Farm Extension which invalidates the cache of all the configured servers by calling handlers on the servers internal ip addresses.

I instead opted to remove the cache completely. My solution to this problem is admittedly worse performance-wise, but doesn't require internal knowledge about the hosting configuration, and wont need to be kept update when my hosting company decides to change anything, like adding/removing machines.

The primary culprit can be found in the Post class, which can be found in the BlogEngine.Core project. You'll need to download the sources to do the changes. I was just after a quick fix, and removed functionality by disabling them using #if directives. First get the static fields posts and deletedposts out of the way, and then modify the code as needed. I've attached the code that powers the blog at the time of writing this entry.

Download: Post.cs (48.95 kb)



Using Knockout with <input type="search" />

clock April 19, 2011 09:10 by author Simon Svensson

One of the new features in html 5 is the <input type="search" />. Chrome renders it as a textbox with an x inside, which clears the content when clicked. However, this doesn't trigger the change event that Knockout listens for, but the click event. We need to modify the ko.bindingHandlers['value'] to listen for this event. While at it, we also hook onto the parent form element's reset event, so that the value is updated when an <input type="reset" /> is triggered.

ko.bindingHandlers['value'] = {
    init: function (element, valueAccessor, allBindingsAccessor) {
        originalValueBinding.init(element, valueAccessor, allBindingsAccessor);

        // <input type="search" /> triggers onClick when clicking the X.
        ko.utils.registerEventHandler(element, "click", function (event) {
            var modelValue = valueAccessor();
            if (element.value === "" && ko.utils.unwrapObservable(modelValue) !== "") {
                if (ko.isWriteableObservable(modelValue)) {
                    modelValue("");
                } else {
                    modelValue = "";
                }
            }
        });

        // listen for any <input type="reset" />
        var forms = $(element).parents('form');
        if (forms[0]) {
            ko.utils.registerEventHandler(forms[0], 'reset', function () {
                setTimeout(function () {
                    var modelValue = valueAccessor();
                    var elementValue = ko.selectExtensions.readValue(element);
                    if (ko.isWriteableObservable(modelValue)) {
                        modelValue(elementValue);
                    }
                }, 0);
            });
        }
    },
    update: function (element, valueAccessor, allBindings) {
        originalValueBinding.update(element, valueAccessor, allBindings);
    }
}

Code based on responses on the KnockoutJS Google group.



Model serializer for Knockout

clock April 10, 2011 20:30 by author Simon Svensson

Knockout is a javascript library that enables two-way binding between html elements and a javascript-based view model. Their introduction contains a glimpse of the simplicity in their binding declarations. 

I've build a KnockoutModelSerializer that takes any type and serializes it as a Knockout compatible model. It's basically a javascript serializer that output calls to ko.observable and ko.observableArray (see Knockout: Observables) around values which is marked with a [KnockoutObservable] attribute.

<script type="text/javascript">
// <![CDATA[
    var Model = <%= new KnockoutModelSerializer().Serialize(Model) %>;
// ]]>
</script>

<asp:ContentPlaceHolder runat="server" ID="Main" />

<script type="text/javascript">
// <![CDATA[
    ko.applyBindings(Model);
// ]]>
</script>
public class AwesomeViewModel {
    [KnockoutObservable]
    public IList<FieldViewModel> Fields { get; set; }
}

public class FieldViewModel {
    private readonly Guid _id;
    private readonly String _header;

    public FieldViewModel(Guid id, String header) {
        _id = id;
        _header = header;
    }

    public Guid Id {
        get { return _id; }
    }

    [KnockoutObservable]
    public String Header {
        get { return _header; }
    }
}

This example model will result in a very simple output which can easily be bound to using dropdown lists or jquery tmpl.

<script type="text/javascript">
// <![CDATA[
    var Model = {
        'Fields': ko.observableArray([
            {
                'Header': ko.observable('Adress'),
                'Id': '2db5d4fe-c39e-4590-85b0-9e6c0150ae06'
            },
            {
                'Header': ko.observable('Namn'),
                'Id': 'b68a2a59-81fa-4a7b-a1eb-9e7400ccc585'
            },
            {
                'Header': ko.observable('Awesomeness in the house!'),
                'Id': '52b574a3-44c9-46df-a85a-9e9300dfe8ce'
            }
        ])
    };
// ]]>
</script>

Download: KnockoutModelSerializer.zip (3.02 kb)



Intercepting method invocations using RealProxy

clock April 9, 2011 00:00 by author Simon Svensson

One rarely used type in the .NET Framework is the RealProxy. It allows us to intercept any invocation done against the generated transparent proxy (retrieved from RealProxy.GetTransparentProxy()). We get complete control over the real invocation, including skipping it, logging it or retrying it. This is, oddly, something I abuse often enough to simplify using a base class.

using System;
using System.Globalization;
using System.Linq;
using System.Reflection;
using System.Runtime.Remoting.Messaging;
using System.Runtime.Remoting.Proxies;

namespace RealProxyDemo {
    public abstract class ProxyBase<T> : RealProxy where T : class {
        protected static readonly Type InstanceType = typeof(T);

        private readonly T _instance;

        protected T Instance {
            get { return _instance; }
        }

        protected ProxyBase(T instance) : base(InstanceType) {
            _instance = instance;
        }

        public override IMessage Invoke(IMessage msg) {
            var methodCallMessage = msg as IMethodCallMessage;
            if (methodCallMessage != null)
                return InvokeMethodCall(methodCallMessage);

            var ifaceList = msg.GetType().GetInterfaces().Select(i => i.Name).ToArray();
            var messageType = msg.GetType().Name;
            var messageInterfaces = String.Join(", ", ifaceList);

            throw new NotImplementedException(String.Format(
                CultureInfo.InvariantCulture,
                "Unknown message type '{0}' was passed to the ProxyBase. It implements {1}, of which none was supported.",
                messageType, messageInterfaces
            ));
        }

        protected virtual IMethodReturnMessage InvokeMethodCall(IMethodCallMessage msg) {
            var args = msg.Args;

            try {
                var result = InvokeMethodBase(msg.MethodBase, Instance, args);
                return new ReturnMessage(result, args, msg.ArgCount, msg.LogicalCallContext, msg);
            } catch (TargetInvocationException targetEx) {
                var realEx = targetEx.InnerException ?? targetEx;
                return new ReturnMessage(realEx, msg);
            } catch (Exception ex) {
                return new ReturnMessage(ex, msg);
            }
        }

        protected virtual object InvokeMethodBase(MethodBase methodBase, Object subject, Object[] args) {
            return methodBase.Invoke(subject, args);
        }
    }
}

 

Imagine a scenario where we need to log every method invocation on a IList<T>; we could just create a wrapping object that would log and forward any calls to the wrapped list. However, it would be better with a solution that works for more types than just IList<T>. Here comes a limitation using the RealProxy approach, it only supports interfaces and types derived from MarshalByRefObject

using System;
using System.Collections.Generic;
using System.Runtime.Remoting.Messaging;

namespace RealProxyDemo {
    public static class Program {
        public static void Main() {
            var someList = new List<String>();
            var loggingList = Log<IList<String>>(someList);
            loggingList.Add("test1");
            loggingList.Add("test2");

            foreach(var item in loggingList)
                Console.WriteLine("Item: {0}", item);
        }

        public static T Log<T>(T instance) where T : class {
            var loggingProxy = new LoggingProxy<T>(instance);
            return (T)loggingProxy.GetTransparentProxy();
        }
    }

    public class LoggingProxy<T> : ProxyBase<T> where T : class {
        public LoggingProxy(T instance) : base(instance) {
        }

        protected override IMethodReturnMessage InvokeMethodCall(IMethodCallMessage msg) {
            Console.Write("Calling {0}(", msg.MethodName);
            for (var i = 0; i < msg.InArgCount; ++i)
                Console.Write("{0}: '{1}'", msg.GetInArgName(i), msg.InArgs[i]);
            Console.WriteLine(")");
            
            // Forward the call to actually execute it.
            return base.InvokeMethodCall(msg);
        }
    }
}

 

Another scenario is when the invocation fails with a specific exception, and it's possible to try and execute it again without introducing any unwanted side effects. This was originally built for a cms that threw an exception when an application closes the login session even if another is still using it. In this case it was possible to execute a call to open a new session, and retry the invocation that originally failed. A much simpler scenario would be our FailAlot class, which fails about 80% of the time. 

using System;
using System.Runtime.Remoting.Messaging;

namespace RealProxyDemo {
    public static class Program {
        public static void Main() {
            var failAlot = new FailAlot();
            var retryingProxy = new RetryingProxy<IFailAlot>(failAlot);
            var retryAlot = (IFailAlot)retryingProxy.GetTransparentProxy();

            Console.WriteLine("First:  {0}", retryAlot.Try());
            Console.WriteLine("Second: {0}", retryAlot.Try());
            Console.WriteLine("Third:  {0}", retryAlot.Try());
        }
    }

    public interface IFailAlot {
        Int32 Try();
    }

    public class RetryException : Exception { }

    public class FailAlot : IFailAlot {
        private readonly Random _random = new Random();
        private Int32 _tryCount;

        public Int32 Try() {
            _tryCount++;

            // Fail with 80% probability.
            if (_random.Next(10) <= 8)
                throw new RetryException();

            var oldTryCount = _tryCount;
            _tryCount = 0;
            return oldTryCount;
        }
    }

    public class RetryingProxy<T> : ProxyBase<T> where T : class {
        public RetryingProxy(T instance)
            : base(instance) {
        }

        protected override IMethodReturnMessage InvokeMethodCall(IMethodCallMessage msg) {
            IMethodReturnMessage result;
            do {
                result = base.InvokeMethodCall(msg);
            } while (result.Exception is RetryException);
            
            return result;
        }
    }
}

 

Other scenarios includes...

  • ... a proxy that caches invocation results in a unit-of-work container, like a HttpContext.Items (which is a per-request dictionary).
  • ... a proxy that serves old invocation results if the real invocation fails.
  • ... a proxy that validates, and optionally modify, arguments and results.


C# port of lucene-hunspell

clock April 7, 2011 20:13 by author Simon Svensson

[2012-01-03: Updated version available]

[2012-01-07: Source moved to Gibhub]

I've spent some time porting lucene-hunspell (v0.2) to C#.

Hunspell is a free spell checker used by OpenOffice, FireFox, Chrome, and other applications. Hunspell dictionaries usually consists of a .aff and a .dic file containing word stems and rules detailing how they are affixed. This project provides a HunspellStemFilter class which will work with the [separately] provided dictionaries and stem words while indexing/querying.

It started with a line-by-line port, followed with some refactoring to approach .NET coding guidelines (capitalization, properties, etc).

Source available at Github.



Sign in