python - How to start writing lexical analyzer from scratch?

I want to write a lexical analyzer for python from scratch. But I do not know where and how to begin. For starters I want to assume that we will have a python program as a set of strings passed to the analyzer. The analyzer should figure out where is a new line and the appropriate whitespace to be looked at. How to figure out new lines in python source code? I've read the lexical specs of python and we can use a stack based approach in resolving indentation with whitespaces, but can't figure how to look at.Is it just a regular expression check with '\n' or is there any algorithmic way to determine this?

I purposely don't want to use things like lex,yacc or flex for that matter.

1 Answer

  1. Laurent- Reply


    I recently developed a parser for finding SQL statements. I know the rules are different, but the approach might be similar. I used python to develop the parser. The first step was a regex that finds the beginnings of tokens. I then passed that list to a finite state machine that determined what the tokens represented.

    The rules needed to to account for things like line breaks, semicolons and strings and comments that can span lines. In python you will be more concerned about line feeds and indentation.

    Once you are able to parse out individual statements then you should pass those to a function that can parse a statement, it probably would take a similar approach.

    You can find the SQL parser at This illustrates the approach to find tokens, and use an FSM to process them. And yes, it handles strings that cross over line breaks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>