new mmir.grammar.GrammarConverter()
Example
var GrammarConverter = new mmir.require('mmirf/grammarConverter');
var gc = new GrammarConverter();
Requires
- module:util/loadFile
- module:util/isArray
- module:positionUtils
Methods
-
addProc(proc, isPrepend)
-
add pre-/post-processing step for running before/after
executeGrammar
Name Type Description proc
ProcessingStep the processing step: { //the name of the processing step name: string, //OPTIONAL pre-processing function: pre(input: string | Positions, isCalcPos: boolean) pre: Function, //OPTIONAL post-processing function: post(result: any, pos: Positions) post: Function }
isPrepend
Boolean | Number optional OPTIONAL if omitted (or FALSY): appended proc
to processing steps if number: insertproc
at this index into the processing steps-list if TRUE: prependproc
to processing steps- See:
-
removeProc
getProcIndex
procList
- mmir.grammar.stemmer
Example
//poitionUtils: var posUtil = mmir.require('mmirf/positionUtils'); //stemming function var stemFunc = ...; //add stemming function for pre-processing as first step grammarConverter.addProc({ name: 'stem', pre: posUtil.createWordPosPreProc(stem, this) }, true);
-
executeGrammar(text, options, callback){Object}
-
Execute the grammar. NOTE: do not use directly, but
mmir.SemanticInterpreter#interpret
instead, since that function applies some pre- and post-processing to the text (stopword removal en-/decoding of special characters etc.).Name Type Description text
String the text String that should be parse. options
Object optional additional parsing options (some grammar engines may support further options) options.debug: BOOLEAN enable printing debug information options.trace: BOOLEAN | FUNCTION enable printing verbose/tracing information (may not be supported by the grammar engine) callback
function optional if #isAsyncExec is TRUE, then executeGrammar will have no return value, but instead the result of the grammar execution is delivered by the callback
:function callback(result){ ... }
(see also description ofreturn
value below)Returns:
Type Description Object the result of the grammar execution: {phrase: STRING, phrases: OBJECT[], semantic: OBJECT}
The propertyphrase
contains thetext
which was matched (with removed stopwords). The propertyphrases
contains the matched TOKENS and UTTERANCES from the JSON definition of the grammar as properties as arrays (e.g. for 1 matched TOKEN "token":{token: ["the matched text"]}
). The returned propertysemantic
depends on the JSON definition of the grammar. NOTE: if #isAsyncExec is TRUE, then there will be no return value, but instead the callback is invoked with the return value. -
HELPER creates a copy of the stopword list and encodes all non-ASCII chars to their unicode representation (e.g. for save storage of stringified stopword list, even if file-encoding does not support non-ASCII letters).
Returns:
Type Description Array.<String> a copy of the stopword list, from the current JSON grammar (or empty list, if no grammar is present) -
getGrammarDef(){String}
-
Get grammar definition text. This is the "source code" input for the grammar compiler (i.e. syntax for jison, PEG.js or JS/CC). The grammar definition text is generated from the JSON grammar.
Returns:
Type Description String the grammar definition in compiler-specific syntax -
getGrammarSource(){String}
-
Get the compiled JavaScript grammar source code. This is the output of the grammar compiler (with additional JavaScript "framing" in
mmir.SemanticInterpreter#createGrammar
). This needs to be eval'ed before it can be executed (eval() will add the corresponding executable grammar to SemanticInterpreter).Returns:
Type Description String the compiled, JavaScript grammar source code -
getProcIndex(proc, startIndex){Number}
-
remove a processing step by its index (within
procList
) or its name NOTE: if multiple processing steps with the same name exist, the first one is removedName Type Description proc
String the name of the processing step startIndex
Number optional OPTIONAL start index for searching (DEFAULT: 0) - See:
-
addProc
removeProc
procList
Returns:
Type Description Number the index of the processing step, or -1, if there is no such processing step -
getStopWordsEncRegExpr()
-
FIX for stopwords that start or end with encoded chars (i.e. non-ASCII chars) This RegExp may be NULL/undefined, if no stopwords exist, that begin/end with encoded chars i.e. you need to check for NULL, before trying to use this RegExpr. Usage:
Example
//remove normal stopwords: var removedStopwordsStr = someStr.replace( gc.getStopWordsRegExpr(), ''); var removedStopwordsStr2 = removedStopwordsStr; if(gc.getStopWordsEncRegExpr()){ //NOTE replace stopwords with spaces (not with empty String as above, ie. with "normal" stopwords) removedStopwordsStr2 = gc.getStopWordsEncRegExpr().replace( gc.getStopWordsEncRegExpr(), ' '); }
-
HELPER uses #maskString for encoding non-ASCII chars to their Unicode representation, i.e.
\uXXXX
where XXXX is the Unicode HEX number. SHORTCUT for callingmaskString(str, '\\u', '')
.Name Type Description str
String the string for unicode masking computePositions
Boolean optional OPTIONAL DEFAULT: false Returns:
Type Description String | Object the unicode-masked string, or if computePositions
wastrue
a result object with{ text: STRING, // the masked string pos: [POSITION] // array of maskink-positions: {i: NUMBER, len: NUMBER, mlen: NUMBER} }
where POSITION is an object with{ i: NUMBER, // the index within the modified string len: NUMBER, // the length before the modification (i.e. of sub-string that is to be masked) mlen: NUMBER // the length after the modification (i.e. of sub-string that that was masked) }
Example
//for Japanese "下さい" ("please") maskAsUnicode("下さい") // -> "\u4E0B\u3055\u3044" //... and using default masking: maskString("下さい") // -> "~~4E0B~~~~3055~~~~3044~~"
-
Masks unicoded characters strings. Unicode characters are mask by replacing them with
~~XXXX~~
whereXXXX
is the four digit unicode HEX number.NOTE that this function is stable with regard to multiple executions: If the function is invoked on the returned String again, the returned String will be the same / unchanged, i.e. maskings (i.e. "~~XXXX~~") will not be masked again.
NOTE: currently, the masking pattern cannot be escaped, i.e. if the original String contains a substring that matches the masking pattern, it cannot be escaped, so that the unmask-function will leave it untouched.
Name Type Description str
String the String to process computePositions
Boolean optional OPTIONAL DEFAULT: false prefix
String optional OPTIONAL an alternative prefix used for masking, i.e instead of ~~
(ignored, if argument has other type thanstring
)postfix
String optional OPTIONAL an alternative postfix used for masking, i.e instead of ~~
(ignored, if argument has other type thanstring
)Returns:
Type Description String | Object the masked string, or if computePositions
wastrue
a result object with{ text: STRING, // the masked string pos: [POSITION] // array of maskink-positions: {i: NUMBER, len: NUMBER, mlen: NUMBER} }
where POSITION is an object with{ i: NUMBER, // the index within the modified string len: NUMBER, // the length before the modification (i.e. of sub-string that is to be masked) mlen: NUMBER // the length after the modification (i.e. of sub-string that that was masked) }
-
postproc(procResult, pos, processingSteps)
-
Post-processes the result from the applied grammar: * un-masks non-ASCI characters
addProc
can be used to add additional pre-/post-processing stepsName Type Description procResult
SemanticResult pos
Positions the position information (i.e. modifications) of the pre-processing steps processingSteps
Array.<ProcessingStep> optional OPTIONAL if given, use processingSteps
instead of (field)procList
NOTE positional argument (i.e. must specifypos
too)- See:
-
addProc
removeProc
getProcIndex
procList
-
preproc(thePhrase, pos, processingSteps){String}
-
Apply pre-processing to the string, before applying the grammar: * escape (i.e. "mask") non-ASCI characters * remove stopwords
addProc
can be used to add additional pre-/post-processing stepsName Type Description thePhrase
String pos
PlainObject optional OPTIONAL in/out argument: if given, the pre-processor will add fields with information on how the input string thePhrase
was modified By default the position information for escaped characters and removed stopwords will be added topos.escape
(seemaskString
for more details)pos.stopwords
(seeremoveStopwords
for more details) And the fieldpos._order
will contain the ordered list of pre-processing steps that where applied i.e. the enries correspond to the field names, e.g. by default the list would contain['escape', 'stopwords']
processingSteps
Array.<ProcessingStep> optional OPTIONAL if given, use processingSteps
instead of (field)procList
NOTE positional argument (i.e. must specifypos
too)- See:
-
addProc
removeProc
getProcIndex
procList
Returns:
Type Description String the pre-processed string -
recodeJSON(json, recodeFunc, isMaskValues, isMaskNames){Object}
-
Recodes Strings of a JSON-like object.
Name Type Description json
Object the JSON-like object (i.e. PlainObject) recodeFunc
function the "recoding" function for modifying String values: must accecpt a String argument and return a String String recodeFunc(String)
. The recodeFunc function is invoked in context of the GrammarConverter object. Example: this.maskString(). SeemaskString
.kisMaskValues
Boolean optional OPTIONAL if true, the object's property String values will be processed NOTE: in case this parameter is specified, then recodeFunc
must also be specified! DEFAULT: uses propertymaskValues
isMaskNames
Boolean optional OPTIONAL if true, the property names will be processed NOTE: in case this parameter is specified, then recodeFunc
andisMaskValues
must also be specified! DEFAULT: uses propertymaskNames
Returns:
Type Description Object the recoded JSON object -
removeProc(proc){ProcessingStep}
-
remove a processing step by its index (within
procList
) or its name NOTE: if multiple processing steps with the same name exist, the last one is removedName Type Description proc
Number | String the name or index of the processing step that should be removed - See:
-
addProc
getProcIndex
procList
Returns:
Type Description ProcessingStep the removed processing step, or undefined, if there was no matchin processing step -
Name Type Description thePhrase
String the string from which to remove stopwords (and trim()'ed) computePositions
Boolean optional OPTIONAL DEFAULT: false Returns:
Name Type Description the
String | Object string where stopwords were removed, or if computePositions
wastrue
a result object where the positions at which stopwords were removed will be available as an array:{ text: STRING, // the string with removed stopwords pos: [POSITION] // array of positions for removed stopwords: {i: NUMBER, len: NUMBER, mlen: NUMBER} }
where POSITION is an object with{ i: NUMBER, // the index within the modified string len: NUMBER, // the length before the modification (i.e. of sub-string that is to be masked) mlen: NUMBER // the length after the modification (i.e. of sub-string that that was masked) }
the
String string where stopwords were removed -
protectedsetGrammarDef(rawGrammarSyntax)
-
Sets the grammar definition text. This function should only be used during compilation of the JSON grammar to the executable grammar. NOTE: Setting this "manually" will have no effect on the executable grammar.
Name Type Description rawGrammarSyntax
String the grammar definition in compiler-specific syntax - See:
-
setGrammarFunction(func, isAsnc)
-
Set the executable grammar function. The grammar function takes a String argument: the text that should be parsed. a Function argument: the callback for the result. where the callback itself takes 1 argument for the result:
callback(result)
The returned result depends on the JSON definition of the grammar:func(inputText, resultCallback)
Name Type Description func
function the executable grammar function: func(string, object, function(object)) : object
isAsnc
Boolean optional OPTIONAL set to TRUE, if execution is asynchronously done. DEFAULT: FALSE - See:
-
exectueGrammar
-
Unmasks masked unicoded characters in a string. Masked unicode characters are assumed to have the pattern:
~~XXXX~~
whereXXXX
is the four digit unicode HEX number.NOTE that this function is stable with regard to multiple executions, IF the original String str did not contain a sub-string that conforms to the encoding pattern (see remark for
maskString
): If the function is invoked on the returned String again, the returned String will be the same, i.e. unchanged.Name Type Description str
String computePositions
Boolean optional OPTIONAL DEFAULT: false detector
RegExp optional OPTIONAL an alternative detector-RegExp: the RegExp must conatin at least one grouping which detects a unicode number (HEX), e.g. default detector is ~~([0-9|A-F|a-f]{4})~~
(note the grouping for detecting a 4-digit HEX number within the brackets).Returns:
Type Description String | Object the masked string, or if computePositions
wastrue
a result object with{ text: STRING, // the masked string pos: [POSITION] // array of maskink-positions: {i: NUMBER, len: NUMBER, mlen: NUMBER} }
where POSITION is an object with{ i: NUMBER, // the index within the modified string len: NUMBER, // the length before the modification (i.e. of sub-string that is to be masked) mlen: NUMBER // the length after the modification (i.e. of sub-string that that was masked) }