parsing - Very simple C++ lexical analysis in Java

I need to analyze some C++ source files in order to generate some very basic information. The thing that I am trying to do goes like this:

  1. Assume that we have the C++ grammar and the C++ source file to be analyzed
  2. The analyzer will read the source file like a lexical analyzer, it'll identify the keywords etc. as defined by the C++ grammar.
  3. After reading each line, the analyzer will output the following information: Line#: lexical information. For example, consider this:

    int main(int x, int y)
       return x+y;

The program will output:

    Line 1: function: main, params: x, y
    Line 2: paren "{"
    Line 3: keyword: "return"

or something similar.

Can somebody please tell me how to do this? I have looked at Antlr and TXL but I'm guessing that there should be a simpler way. I'd like to write a Java program that'll do this work. The first thing that I would like to do is to get the function definitions in a file, with their corresponding line numbers. Any help will be much appreciated.

Thanks, Anton

2 Answers

  1. Jacob- Reply


    Your best bet, as of today, is probably CLang.

    While CLang is known for being a C/C++/Objective-C/Objective-C++ frontend on top of LLVM, it has been designed as a set of libraries specifically so that individual components could be re-used outside the compiler itself.

    Of interest, libclang is a C library that wraps the core preprocessing and semantic analysis to provide a "parsed-tree" in C, because C is the lingua franca. libclang notably serves as a basis for the Python bindings, so if you really want it in Java you should be able to use the JNI (if I remember correctly the terms) to interface with it. Also, the libclang interface is extremely stable (unlike the internal compiler representations) as it is meant to be used by external users.

    The Python bindings have already been used to create clang_complete, a vim plugin for auto-completion. You can read this blog article about it for example (there is a nifty video showing it in action).

    Insider note: the Python bindings are currently being significantly improved by Gregory Szorc under the guidance of Tobias Grosser, you can see Gregory's announcement here.

  2. James- Reply


    If you want the output that you've suggested, you are not doing lexical analysis; you're doing at least a bit of parsing, since you have to understand that what you're looking it is, say, a function declaration versus an object initialization. Lexical analysis would give you back something like


    Whereas you'd like to understand that those first tokens actually define a function.

    Unfortunately, parsing C++ is notoriously difficult and most parser generators have to be very specially calibrated to produce a working C++ parser. I would suggest looking into an existing C++ parser framework, perhaps something like Elsa. This will give you the information you need in a format that you should be able to use. Another option would be to hook into the C++ compiler directly and process the underlying AST, which handles all the parsing and line number information for you. The Treehydra tool might be useful here, though you'd be writing JavaScript code rather than Java.

    Hope this helps!

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>