Python | Simple Lexical Analyzer

Python | Simple Lexical Analyzer

---------- SIMPLE LEXICAL ANALYSER ----------


The compilation is spread across many stages. A compiler does not immediately convert a high-level language into binary, it takes time to complete, During the compilation process, the first step that is undertaken is called lexical analysis.

During the Lexical Analyzer process, the program typed by the user is shredded to pieces and every token that is a part of it is extracted and stored separately (tokens are the smallest indivisible parts of a program). These tokens need to be classified into types before the compilation process can begin.


Lexical analysis is the first phase of a compiler. It takes the modified source code from language pre-processors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code.

In simple words we can say that it is the process whereby the scanned characters are turned into lexemes.


A Lexer is a program or a tool that performs lexical analysis, it takes the modified source code which is written in the form of sentences in a specific programming language by following programming rules and convert that into tokens.

A lexer contains tokenizer or scanner.

If the lexical analyzer detects that the token is invalid, it generates an error.

It reads character streams from the source code, checks for legal tokens, and pass the data (tokens) to the syntax analyzer when it demands.


A lexeme is a sequence of alphanumeric characters in a token. The term is used in both the study of language and in the lexical analysis of computer program compilation. In the context of computer programming, lexemes are part of the input stream from which tokens are identified. An invalid or illegal token produces an error. A lexeme is one of the building blocks of language.

So, in simple words we can say that a lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Example :

for(i; i<arr.length; i++){ }

lexeme 1: for

lexeme 2: (

lexeme 3: i

lexeme 4: ;

lexeme 5: i

lexeme 6: <

lexeme 7: arr

lexeme 8: .

lexeme 9: length

lexeme 10: ;

lexeme 11: i

lexeme 12: ++

lexeme 13: )

lexeme 14: {

lexeme 15: }


Token is a sequence of characters that can be treated as a single logical entity.

In programming language, keywords, constants, identifiers, strings, numbers, operators, and punctuations symbols can be considered as tokens.


There are several types of tokens which are associated with any language. 

The naming given by the user for several parts of the program like functions and variables is called identifiers. They are called as such because they identify a named storage location in the memory.

Then comes keywords :

several words used by the language for some of its functionality. Punctuators are used for the construction of expressions and statements. They are useful only when used in conjunction with identifiers or keywords in a statement. Operators are used for performing actual operations with the data (like arithmetic, logical, and shift operations). Literals are constant data that the programs need to deal with, like numbers or alphabets (or a combination of both).


Lexical analyzer performs below given tasks:

  • Helps to identify token into the symbol table
  • Removes white spaces and comments from the source program
  • Correlates error messages with the source program
  • Helps you to expands the macros if it is found in the
    source program
  • Read input characters from the source program 

Example : Python Code to implement the Lexical Analyzer, in this example we are producing tokens from the java program. 

Code -

Output - TOKENS


  • Lexical analyzer method is used by programs like compilers which can use the parsed data from a programmer's code to create a compiled binary executable code
  • It is used by web browsers to format and display a web page with the help of parsed data from JavsScript, HTML, CSS
  • A separate lexical analyzer helps you to construct a specialized and potentially more efficient processor for the task


  • You need to spend significant time reading the source program and partitioning it in the form of tokens
  • Some regular expressions are quite difficult to understand compared to PEG or EBNF rules
  • More effort is needed to develop and debug the lexer and its token descriptions
  • Additional runtime overhead is required to generate the lexer tables and construct the tokens 


Our implementation of a Python lexical analyser should be enough to demonstrate how it works as part of the compiler. Hope this helped you in understanding the lexical analysis in Python programming.

Next Article :

shlex Module