PhpToken::tokenize() returns an array of PhpToken objects
PhpToken {
public int $id;
public string $text;
public int $line;
public int $pos;
finalpublic __construct ( int $id , string $text , int $line = -1 , int $pos = -1 )
public getTokenName ( ) : string|nullpublic is ( int|string|array $kind ) : bool
public isIgnorable ( ) : bool
public __toString ( ) : string
publicstatic tokenize ( string $code , int $flags = 0 ) : array
}
<?php
$src = '<?php echo "I love ponies";';
$tokens = \PhpToken::tokenize($src, TOKEN_PARSE);
foreach ($tokens as $token) {
\printf('Line %d: %s (%s) (%s)%s', $token->line, $token->getTokenName(), $token->id, $token->text, PHP_EOL);
}
Line 1: T_OPEN_TAG (390) (<?php )
Line 1: T_ECHO (326) (echo)
Line 1: T_WHITESPACE (393) ( )
Line 1: T_CONSTANT_ENCAPSED_STRING (318) ("I love ponies")
Line 1: ; (59) (;)
For the rest of the talk we'll keep token_get_all() because it is still heavily used
Tokens in PHP
Lorem ipsum
Plop plip
<?phpecho"I love ponies";
Line 1: T_INLINE_HTML (313) (Lorem ipsum
Plop plip
)
Line 5: T_OPEN_TAG (382) (<?php )
Line 5: T_ECHO (324) (echo)
Line 5: T_WHITESPACE (385) ( )
Line 5: T_CONSTANT_ENCAPSED_STRING (315) ("I love ponies")
String token: ;
<?phpclassPony{
publicconstPUBLIC = 1;
}
Line 1: T_OPEN_TAG (382) (<?php
)
Line 2: T_WHITESPACE (385) (
)
Line 3: T_CLASS (364) (class)
Line 3: T_WHITESPACE (385) ( )
Line 3: T_STRING (311) (Pony)
Line 3: T_WHITESPACE (385) (
)
String token: {
Line 4: T_WHITESPACE (385) (
)
Line 5: T_PUBLIC (358) (public) <---------------------------------- Difference between the keyword "public"
Line 5: T_WHITESPACE (385) ( )
Line 5: T_CONST (344) (const)
Line 5: T_WHITESPACE (385) ( )
Line 5: T_STRING (311) (PUBLIC) <---------------------------------- and the constant name "PUBLIC"
Line 5: T_WHITESPACE (385) ( )
String token: =
Line 5: T_WHITESPACE (385) ( )
Line 5: T_LNUMBER (309) (1)
String token: ;
Line 5: T_WHITESPACE (385) (
)
String token: }
Example
Using tokens, let's check if a class name respects upper camel case (MyClass).
foreach ($tokens as $i => $token) {
if (FALSE === \is_array($token)) {
continue;
}
if (T_CLASS !== $token[0]) {
continue;
}
$classNameToken = $tokens[$i + 2];
$className = $classNameToken[1];
$line = $classNameToken[2];
if (!isUpperCamelCase($className)) {
printf('Class name "%s" should be in upper camel case at line "%d".%s', $className, $line, PHP_EOL);
}
}
So PHP Tokens..
needs a lot of helpers to work with.
intuitive? not really...
But it is better with the PhpToken class
π¦£π¦£We skip the AST's text version because it is huge π¦£π¦£
Example
Using the AST, let's check if a class name respects upper camel case (MyClass).
usePhpParser\Node\Stmt\Class_;
usePhpParser\NodeTraverser;
usePhpParser\NodeVisitorAbstract;
$traverser = new NodeTraverser();
$traverser->addVisitor(newclassextendsNodeVisitorAbstract{
publicfunctionenterNode(Node $node){
if (!$node instanceof Class_) {
return;
}
if (!isUpperCamelCase($node->name)) {
printf('Class name "%s" should be in upper camel case at line "%d".%s', $node->name, $node->getStartLine(), PHP_EOL);
}
}
});
$traverser->traverse($ast);
Example
Let's change the AST and generate the modified PHP code to force upper camel case
usePhpParser\Node\Identifier;
usePhpParser\Node\Stmt\Class_;
usePhpParser\NodeTraverser;
usePhpParser\NodeVisitorAbstract;
usePhpParser\PrettyPrinter;
$traverser = new NodeTraverser();
$traverser->addVisitor(newclassextendsNodeVisitorAbstract{
publicfunctionenterNode(Node $node){
if (!$node instanceof Class_) {
return;
}
if (!isUpperCamelCase($node->name)) {
$node->name = new Identifier(strToCamelCase($node->name), $node->getAttributes());
}
}
});
$newAst = $traverser->traverse($ast);
echo (new PrettyPrinter\Standard)->prettyPrintFile($newAst);
Analysis: process as a method of studying the nature of something or of determining its essential features and their relations
Static analysis is a process for determining
the relevant properties of a (PHP) program
without actually executing the program.
Dynamic analysis is a process for
determining the relevant properties of a
program by monitoring/observing the
execution states of one or more
runs/executions of the program.
Computing the code coverage according to a test suite is a standard dynamic analysis technique.
Why static?
Quicker, no need to have the full app running (no DB needed etc..)
PHP process one file at a time. We are going to do the same for this talk.
0000 ECHO string("I love ponies")
0001 RETURN int(1)
To get opcode you could use Vulcan Logic Disassembler: http://pecl.php.net/package/vld
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.
Some authors term this a "token", using "token" interchangeably to represent the string being tokenized, and the token data structure resulting from putting this string through the tokenization process.
The word lexeme in computer science is defined differently than lexeme in linguistics.
A lexeme in computer science roughly corresponds to a word in linguistics, although in some cases it may be more similar to a morpheme.
The syntax is "abstract" in the sense that
it does not represent every detail appearing in the real syntax,
but rather just the structural or content-related details.
For instance, grouping parentheses are implicit in the tree structure,
so these do not have to be represented as separate nodes. Likewise,
a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
The syntax is "abstract" in the sense that
it does not represent every detail appearing in the real syntax,
but rather just the structural or content-related details.
For instance, grouping parentheses are implicit in the tree structure,
so these do not have to be represented as separate nodes. Likewise,
a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
The parsing stage takes the token stream from the Lexer as input and outputs an Abstract Syntax Tree (AST).
The parser has two jobs:
- verifying the validity of the token order by attempting to match them against any one of the grammar rules defined in its grammar file. This ensures that valid language constructs are being formed by the tokens in the token stream
- generating the AST, which is a tree view of the source code that will be used during the next stage (compilation)
The parser is generated with Bison via the zend_language_parser.y (BNF) grammar file. PHP uses a LALR(1) (look ahead, left-to-right) context-free grammar. The look ahead part simply means that the parser is able to look n tokens ahead (1, in this case) to resolve any ambiguities it may encounter whilst parsing. The left-to-right part means that it parses the token stream from left-to-right.
We can view a form of the AST produced by the parser using the php-ast extension. This extension performs a few transformations upon the AST, preventing it from being directly exposed to PHP developers. This is done for a couple of reasons:
- the AST is not particularly βcleanβ to work with (in terms of consistency and general usability)
- the abstraction of the internal AST means that changes can be freely applied to it without risk breaking compatibility for PHP developers
When PHP 7 compiles PHP code it converts it into an abstract syntax tree (AST) before finally generating Opcodes that are persisted in Opcache.
The zend_ast_process hook is called for every compiled script and allows you to modify the AST after it is parsed and created.
This is one of the most complicated hooks to use, because it requires perfect understanding of the AST possibilities.
Creating an invalid AST here can cause weird behavior or crashes.
It is best to look at example extensions that use this hook:
Google Stackdriver PHP Debugger Extension: https://github.com/GoogleCloudPlatform/stackdriver-debugger-php-extension/blob/master/stackdriver_debugger_ast.c
Based on Stackdriver this Proof of Concept Tracer with AST: https://github.com/beberlei/php-ast-tracer-poc/blob/master/astracer.c