Hand written Parser vs. Parser Library

Started by
5 comments, last by Shaarigan 4 years, 6 months ago

Hello Forum,

I reached a point in my rework of our build tool where I have to improve parsing different coding languages like C# and C++ to detect dependencies between files and projects. Maybe I will also add some kind of linting to the build tool. Unfortunately I'm fixed to an older version of C# to stay backwards compatible to older Visual Studio versions. This is the reason for not having the chance to use the Roslyn parser framework out of the box and I had to write my own "analyzer" to solve the task.

That I'm now ready for refactoring, I want to make it into a real parser instead of Regex driven analyzer. For this task, I also had a look at parser utility libraries like Sprache and Pidgin to have some more convinience in declaring tokens and rules. I have written parsers on my own by hand many times before and they were always optimized to the special token/ rule and usecase. For example instead of just reading the next string and then comparing it to a keyword, then conditionally reset the stream or pass the token, I wrote something like this


char[] buffer = new buffer[<max keyword length>];

bool MyKeywordRule(Stream data, TokenStream result)
{
    if(data.Peek() != myKeywordFirstChar)
        return false;
  
    long cursor = data.Position;
    int count = data.Read(buffer, 0, Math.Min(myKeywordLength, data.Length - cursor));
  
    if(count == myKeywordLength && buffer.Compare("keyword", myKeywordLength))
    {
        result.Add(myToken);
        return true;
    }
    else
    {
        data.Position = cursor;
        return false;
    }
}

My hand written function is optimized in performance to first test the character, then fill a buffer with data that is as long as the keyword and finally test this for equality with the keyword. If I write this in Sprache or Pidgen, it will just read the stream, compare the result for the keyword and reset on failure, no early out will be performed or whatever speed optimization. If I want to test against a range of keywords or character sequences like for operators, in a generic solution it will read every statement of the sequence from the stream and test it while my hand written solution can fill the buffer once and test the result in a switch statement and early outs for the character count read.

My question is if there exists a solution that can perform both, define rules in a convinient way like Sprache or Pidgin and on the other hand compete somehow with a hand written solution (it must not be as fast but in any way faster than simply replay the parts of a rule). I thought about a solution using C# Expressions where an operation for example my keyword, can be written as a sequence of pre-condition (testing the character), bootstrapping (reading the buffer for certain size) and validation (test if the buffer matches the keyword) and be merged together for example, so that a rule "A or B" dosen't test for a first then resets the stream and tests for B second but merges the pre-condition into
 


IF not('A') and not('B')
   FAIL

fills the buffer only once and then tests for equality in a switch-case statement. Or is it the best solution to keep writing such code by hand for each individual rule?

Thank you in advance for your suggestions experts!

Advertisement

Why not use a bottom-up parser instead?  Those don't ever need to backtrack the stream position or re-evaluate anything.

1 hour ago, Nypyren said:

Why not use a bottom-up parser instead?  Those don't ever need to backtrack the stream position or re-evaluate anything.

I fully agree on that, try to find a LALR(1) parser generator that generates C#. You write a nice clean elementary token list with the pieces of text that should be recognized and a (left-recursive) grammar where you also build the resulting object tree, and the generator does the rest. Have a look how lex/yacc (or the gnu equivalents flex/bison) work.

LALR(1) is defacto standard for real programming languages.

What I have seen so far in HTML and even Roslyn seems to be top-down, could it? I also mean to simplify tokenizing the data, not just testing the rules btw.

Edit: Did someone ever try to use a state maschine for tokenization or else why not?

TokenTree.jpg

Top-down is much simpler to implement and debug, which is why it is commonly used. On the other hand, bottom-up is really a stronger parsing algorithm. I have been using LALR(1) lex/yacc parser generators for the past 30-ish years, and while I have had to resolve a few shift/shift and shift/reduce problems (which are errors that you get at code generation time), I never ever have had to debug a generated parser or scanner. These tools exist already longer than I live, and are used by lots of people.

Tokenizing state machines are standard technology in scanners. Lex generates it from your RE token descriptions in less than a second, and it handles longest match and state switching to select tokens from different sets of tokens dynamically. Most scanner generators can also track line and column positions in the source file.

Ok, thank you for your statements. From my research (no I'm not sitting here and waiting for people solve my problems ?) I came across some interesting posts that I want to share to all of you. Not just to not have to bookmark them by my own but someone might come across this post and finds them usefull.

It seems to not really be clear if and which parser generator to use. The biggest issue seems to be able to parse recursive and specific stuff, for example operator precedence. So I finally decided to craft my parsers/ a parser generator by my own because I'll need a few more in the future and have somehow experience in parsing HTML/CSS, JSON and also already wrote a C++ compilant preprocessor in pure C#. This might be some work until anything works properly but I'm not happy with the provided solutions so I'll try to find my own way.

I had for example a look at ANTLR source code and C# runtime and thought it very blown for the case that the code for parsing whatever language was still missing. I want something simple that don't has just a simple interface to a huge and sometimes ugly code base but also a simple code base. Some System.IO:Stream extensions and a class or two for the AST should be enougth in my opinion. But yeah, this is my (hopefully) final decision.

Here are the links, I don't know how well these people are or if they really worked at Microsoft but their statements sounded solid to me

https://mortoray.com/2012/07/20/why-i-dont-use-a-parser-generator/

especially this one provided 3 good reasons

https://news.ycombinator.com/item?id=13915150

And I finally found someone who is writing a new parser generator with a tiny code base called Newt. The code seems very clear and also has some examples so I thought it worth to mention.

Thank you @Alberth for taking time

 

This topic is closed to new replies.

Advertisement