PHP Static Code Analysis 101

An introduction to what's behind PHP static code analysis

🐘 🐘 🐘

Hello! I'm beram πŸ‘‹

🐷 Benjamin Rambaud
🐘 PHP Engineer at @ekino_France

πŸ¦„

What is Static Code Analysis❓

Code Analysis❓

Process of automatically analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness.

Thanks Wikipedia πŸ˜‰

Static❓

Static Dynamic
Execute the program? No Yes
Examples Bug Finder
Coding Standards
etc..
Code Coverage
Security
etc..

PHP Static Analysis Tools ❓

PHP Static Analysis Tools

More at https://github.com/exakat/php-static-analysis-tools

How to do Static Analysis❓

  • Regular Expressions 😰
  • Tokens πŸ‘ˆ 🐘
  • Abstract Syntax Tree (AST) πŸ‘ˆ 🐘

Overview of How PHP Works

             -------   Tokens   --------    AST    ----------   Opcodes   -----------------
PHP Code -->| Lexer |--------->| Parser |-------->| Compiler |---------->| VM -> Execution |
             -------            --------           ----------             -----------------

Starting from:

<?php

// Chookity
echo "I love ponies";

To execute the opcodes:


000 ECHO string("I love ponies")
001 RETURN int(1)

Tokens.. Lexer..❓ πŸ€”

  • Lexical analysis, lexing or tokenization

    process of converting a sequence of characters into a sequence of tokens.

  • Tokens

    strings/symbols with an assigned and thus identified meaning.

Thanks Wikipedia πŸ˜‰

Tokens.. Lexer..❓ πŸ€”

For instance this sentence in English is composed of tokens.

  • Word (instance)
  • Whitespace ( )
  • full stop (.)

Tokens in PHP

  • T_COMMENT: // or #, and /* */
  • T_CONST: const
  • T_ELSE: else
  • T_FOREACH: foreach
  • T_FUNCTION: function
  • T_IS_EQUAL: ==
  • T_IS_IDENTICAL: ===
  • T_VARIABLE: $foo
  • etc..

See List of Tokens

Tokens in PHP:

token_get_all() returns an array of tokens. Each token is either:

  • a single character (i.e. ;, ., >, !, etc...)
  • a three element array containing in element:
    • 0: the token index
    • 1: the string content of the original token
    • 2: the line number
See: token_get_all() manual or token_name() manual or List of Tokens
<?php \var_dump(\token_get_all('<?php echo "I love ponies";', TOKEN_PARSE));
array(5) {
  [0]=>
  array(3) {
    [0]=>
    int(382)
    [1]=>
    string(6) "<?php "
    [2]=>
    int(1)
  }
  [1]=>
  array(3) {
    [0]=>
    int(324)
    [1]=>
    string(4) "echo"
    [2]=>
    int(1)
  }
  [2]=>
  array(3) {
    [0]=>
    int(385)
    [1]=>
    string(1) " "
    [2]=>
    int(1)
  }
  [3]=>
  array(3) {
    [0]=>
    int(315)
    [1]=>
    string(15) ""I love ponies""
    [2]=>
    int(1)
  }
  [4]=>
  string(1) ";"
}

Tokens in PHP

<?php

$src = '<?php echo "I love ponies";';
$tokens = \token_get_all($src, TOKEN_PARSE);

foreach ($tokens as $token) {
    if (\is_array($token)) {
        \printf('Line %d: %s (%s) (%s)%s', $token[2], \token_name($token[0]), $token[0], $token[1], PHP_EOL);
        continue;
    }
    
    \printf('String token: %s%s', $token, PHP_EOL);
}
Line 1: T_OPEN_TAG (382) (<?php )
Line 1: T_ECHO (324) (echo)
Line 1: T_WHITESPACE (385) ( )
Line 1: T_CONSTANT_ENCAPSED_STRING (315) ("I love ponies")
String token: ;

Tokens in PHP

But... wait since PHP 8.0 the
PHP RFC Object-based token_get_all() alternative
has been accepted
πŸ₯°πŸ₯°πŸ₯°

Tokens in PHP

PhpToken::tokenize() returns an array of PhpToken objects πŸŽ‰

PhpToken {
    public int $id;
    public string $text;
    public int $line;
    public int $pos;
    
    final public __construct ( int $id , string $text , int $line = -1 , int $pos = -1 )
    public getTokenName ( ) : string|null
    public is ( int|string|array $kind ) : bool
    public isIgnorable ( ) : bool
    public __toString ( ) : string
    public static tokenize ( string $code , int $flags = 0 ) : array
}
See: PhpToken manual

Tokens in PHP

<?php

$src = '<?php echo "I love ponies";';

$tokens = \PhpToken::tokenize($src, TOKEN_PARSE);
foreach ($tokens as $token) {
    \printf('Line %d: %s (%s) (%s)%s', $token->line, $token->getTokenName(), $token->id, $token->text, PHP_EOL);
}
Line 1: T_OPEN_TAG (390) (<?php )
Line 1: T_ECHO (326) (echo)
Line 1: T_WHITESPACE (393) ( )
Line 1: T_CONSTANT_ENCAPSED_STRING (318) ("I love ponies")
Line 1: ; (59) (;)

Tokens in PHP

$tokens = \token_get_all($src, TOKEN_PARSE);
foreach ($tokens as $token) {
    if (\is_array($token)) {
        \printf('Line %d: %s (%s) (%s)%s', $token[2], \token_name($token[0]), $token[0], $token[1], PHP_EOL);
        continue;
    }
    
    \printf('String token: %s%s', $token, PHP_EOL);
}

VS

$tokens = \PhpToken::tokenize($src, TOKEN_PARSE);
foreach ($tokens as $token) {
    \printf('Line %d: %s (%s) (%s)%s', $token->line, $token->getTokenName(), $token->id, $token->text, PHP_EOL);
}

For the rest of the talk we'll keep token_get_all() because it is still heavily used 😒

Tokens in PHP

Lorem ipsum

Plop plip

<?php echo "I love ponies";
Line 1: T_INLINE_HTML (313) (Lorem ipsum

Plop plip

)
Line 5: T_OPEN_TAG (382) (<?php )
Line 5: T_ECHO (324) (echo)
Line 5: T_WHITESPACE (385) ( )
Line 5: T_CONSTANT_ENCAPSED_STRING (315) ("I love ponies")
String token: ;
<?php

class Pony
{
    public const PUBLIC = 1;
}
Line 1: T_OPEN_TAG (382) (<?php
)
Line 2: T_WHITESPACE (385) (
)
Line 3: T_CLASS (364) (class)
Line 3: T_WHITESPACE (385) ( )
Line 3: T_STRING (311) (Pony)
Line 3: T_WHITESPACE (385) (
)
String token: {
Line 4: T_WHITESPACE (385) (
    )
Line 5: T_PUBLIC (358) (public) <---------------------------------- Difference between the keyword "public"
Line 5: T_WHITESPACE (385) ( )
Line 5: T_CONST (344) (const)
Line 5: T_WHITESPACE (385) ( )
Line 5: T_STRING (311) (PUBLIC) <---------------------------------- and the constant name "PUBLIC"
Line 5: T_WHITESPACE (385) ( )
String token: =
Line 5: T_WHITESPACE (385) ( )
Line 5: T_LNUMBER (309) (1)
String token: ;
Line 5: T_WHITESPACE (385) (
)
String token: }

Example

Using tokens, let's check if a class name respects upper camel case (MyClass).

foreach ($tokens as $i => $token) {
    if (FALSE === \is_array($token)) {
        continue;
    }

    if (T_CLASS !== $token[0]) {
        continue;
    }

    $classNameToken = $tokens[$i + 2];
    $className = $classNameToken[1];
    $line = $classNameToken[2];

    if (!isUpperCamelCase($className)) {
        printf('Class name "%s" should be in upper camel case at line "%d".%s', $className, $line, PHP_EOL);
    }
}

So PHP Tokens..

  • needs a lot of helpers to work with.
  • intuitive? πŸ€” not really...
    But it is better with the PhpToken class πŸ₯°πŸ₯°πŸ₯°
  • PHP_CodeSniffer (PHPCS) is mainly based on them. πŸ‘πŸ‘
  • scope is limited to a file: check inheritance etc..?

πŸ€“ Practice to have fun πŸ€“

Try to detect and ban code like $a = $a + 1;

Abstract Syntax Tree (AST)

🧐

Abstract Syntax Tree (AST)❓ πŸ€”

TL;DR

a data structure
to represent the structure of code

AST❓ πŸ€”

(2 * 3) + 4

Parser to generate the AST πŸ€“

From userland:

Some Nodes used in the AST by nikic/php-parser

  • Const: const
  • Stmt\Else: else
  • Stmt\Foreach: foreach
  • Stmt\Function: function
  • Expr\BinaryOp\Equal: ==
  • Expr\BinaryOp\Identical: ===
  • Expr\Variable: $foo
  • Scalar\LNumber: 42 (literal number => integer)
  • Scalar\DNumber: 42.0 (decimal number => float)
  • Scalar\String: "I am string"
  • etc..
See: PHP 7 grammar written in a pseudo language

Let's generate some AST using nikic/php-parser

<?php

use PhpParser\Error;
use PhpParser\NodeDumper;
use PhpParser\ParserFactory;

$code = '<?php echo "I love ponies";';

$parser = (new ParserFactory)->create(ParserFactory::PREFER_PHP7);
try {
    $ast = $parser->parse($code);
} catch (Error $error) {
    echo "Parse error: {$error->getMessage()}" . PHP_EOL;
    return;
}

echo (new NodeDumper)->dump($ast) . PHP_EOL;

Generated AST

<?php

echo "I love ponies";
array(
    0: Stmt_Echo(
        exprs: array(
            0: Scalar_String(
                value: I love ponies
            )
        )
    )
)

Generated AST

Lorem ipsum

Plop plip

<?php echo "I love ponies";

array(
    0: Stmt_InlineHTML(
        value: Lorem ipsum

    Plop plip


    )
    1: Stmt_Echo(
        exprs: array(
            0: Scalar_String(
                value: I love ponies
            )
        )
    )
)

Generated AST

<?php

$a = (2 * 3) + 4;
array(
    0: Stmt_Expression(
        expr: Expr_Assign(
            var: Expr_Variable(
                name: a
            )
            expr: Expr_BinaryOp_Plus(
                left: Expr_BinaryOp_Mul(
                    left: Scalar_LNumber(
                        value: 2
                    )
                    right: Scalar_LNumber(
                        value: 3
                    )
                )
                right: Scalar_LNumber(
                    value: 4
                )
            )
        )
    )
)

Generated AST

<?php

final class Pony
{
   public const PUBLIC = 1;
}
array(
    0: Stmt_Class(
        attrGroups: array(
        )
        flags: MODIFIER_FINAL (32)
        name: Identifier(
            name: Pony
        )
        extends: null
        implements: array(
        )
        stmts: array(
            0: Stmt_ClassConst(
                attrGroups: array(
                )
                flags: MODIFIER_PUBLIC (1)
                consts: array(
                    0: Const(
                        name: Identifier(
                            name: PUBLIC
                        )
                        value: Scalar_LNumber(
                            value: 1
                        )
                    )
                )
            )
        )
    )
)

Generated AST: a more "complex" code πŸ˜…πŸ˜…πŸ˜…

<?php

declare(strict_types=1);

namespace App\Acme;

final class Item {

    private string $value;

    public function __construct(string $value)
    {
        $this->value = $value;
    }

    public function __toString(): string
    {
        return $this->value;
    }
}

🦣🦣We skip the AST's text version because it is huge 🦣🦣

Example

Using the AST, let's check if a class name respects upper camel case (MyClass).

use PhpParser\Node\Stmt\Class_;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;

$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function enterNode(Node $node) {
        if (!$node instanceof Class_) {
            return;
        }

        if (!isUpperCamelCase($node->name)) {
            printf('Class name "%s" should be in upper camel case at line "%d".%s', $node->name, $node->getStartLine(), PHP_EOL);
        }
    }
});
$traverser->traverse($ast);

Example

Let's change the AST and generate the modified PHP code to force upper camel case

use PhpParser\Node\Identifier;
use PhpParser\Node\Stmt\Class_;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;
use PhpParser\PrettyPrinter;

$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function enterNode(Node $node) {
        if (!$node instanceof Class_) {
            return;
        }

        if (!isUpperCamelCase($node->name)) {
            $node->name = new Identifier(strToCamelCase($node->name), $node->getAttributes());
        }
    }
});
$newAst = $traverser->traverse($ast);

echo (new PrettyPrinter\Standard)->prettyPrintFile($newAst);

😍 Soooo simple! 😍

Example/Exercice πŸ˜›

Let's add the type to the property!

What we have:

<?php

namespace App\Acme;

final class A {
    /** @var string $v */
    public $v = 'coucou';
}

What we want:

<?php

namespace App\Acme;

final class A {
    public string $v = 'coucou';
}

Example/Exercice πŸ˜›

$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function enterNode(Node $node) {
        if (!$node instanceof Node\Stmt\Property) {
            return;
        }

        $docComment = DocComment::createFromDocBlock($node->getDocComment()->getText());
        $node->setAttribute('comments', NULL);
        $node->type = new Node\Identifier($docComment->getType());
    }
});
$newAst = $traverser->traverse($ast);

echo (new PrettyPrinter\Standard)->prettyPrintFile($newAst);

πŸ€“ Bonus πŸ€“

Did you know you could modify the AST during runtime? πŸ˜›

With a PHP Extension implementing the zend_ast_process hook!

See: PHP internal book

(⚠️ could lead to crashes or weird behavior without perfect understanding of the AST possibilities ⚠️)

So AST...

😍😍😍

❀️❀️❀️

What we saw πŸͺš (ba dum tss πŸ₯)

  • We use the "same" technics as PHP to do it: Tokens and AST 🌲🌴🌳
  • We are able to analyse and modify PHP code for simple purposes πŸ€“
  • We are able to contribute to tools like PHP_CodeSniffer (PHPCS), PHPStan, Psalm, Rector etc..
  • We are limited, like PHP, to analyse one file at a time.. πŸ˜₯ for the moment πŸ˜›
  • Big up to nikic and everyone who enhanced the php static analysis world πŸ‘πŸ‘πŸ‘

But wait... that's it? πŸ₯Ί

Next time we'll go a little bit further:

  • how to analyse a whole project πŸ€“
  • how phpstan or psalm or rector works internally πŸ€“
  • and who knows 😏

Readings

Thanks! Any Questions?

auto

Just an intro, other parts later

Analysis: process as a method of studying the nature of something or of determining its essential features and their relations

Static analysis is a process for determining the relevant properties of a (PHP) program without actually executing the program. Dynamic analysis is a process for determining the relevant properties of a program by monitoring/observing the execution states of one or more runs/executions of the program. Computing the code coverage according to a test suite is a standard dynamic analysis technique. Why static? Quicker, no need to have the full app running (no DB needed etc..)

PHP process one file at a time. We are going to do the same for this talk. 0000 ECHO string("I love ponies") 0001 RETURN int(1) To get opcode you could use Vulcan Logic Disassembler: http://pecl.php.net/package/vld

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token. Some authors term this a "token", using "token" interchangeably to represent the string being tokenized, and the token data structure resulting from putting this string through the tokenization process. The word lexeme in computer science is defined differently than lexeme in linguistics. A lexeme in computer science roughly corresponds to a word in linguistics, although in some cases it may be more similar to a morpheme.

The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.

The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.

The parsing stage takes the token stream from the Lexer as input and outputs an Abstract Syntax Tree (AST). The parser has two jobs: - verifying the validity of the token order by attempting to match them against any one of the grammar rules defined in its grammar file. This ensures that valid language constructs are being formed by the tokens in the token stream - generating the AST, which is a tree view of the source code that will be used during the next stage (compilation) The parser is generated with Bison via the zend_language_parser.y (BNF) grammar file. PHP uses a LALR(1) (look ahead, left-to-right) context-free grammar. The look ahead part simply means that the parser is able to look n tokens ahead (1, in this case) to resolve any ambiguities it may encounter whilst parsing. The left-to-right part means that it parses the token stream from left-to-right. We can view a form of the AST produced by the parser using the php-ast extension. This extension performs a few transformations upon the AST, preventing it from being directly exposed to PHP developers. This is done for a couple of reasons: - the AST is not particularly β€œclean” to work with (in terms of consistency and general usability) - the abstraction of the internal AST means that changes can be freely applied to it without risk breaking compatibility for PHP developers

When PHP 7 compiles PHP code it converts it into an abstract syntax tree (AST) before finally generating Opcodes that are persisted in Opcache. The zend_ast_process hook is called for every compiled script and allows you to modify the AST after it is parsed and created. This is one of the most complicated hooks to use, because it requires perfect understanding of the AST possibilities. Creating an invalid AST here can cause weird behavior or crashes. It is best to look at example extensions that use this hook: Google Stackdriver PHP Debugger Extension: https://github.com/GoogleCloudPlatform/stackdriver-debugger-php-extension/blob/master/stackdriver_debugger_ast.c Based on Stackdriver this Proof of Concept Tracer with AST: https://github.com/beberlei/php-ast-tracer-poc/blob/master/astracer.c