Rule based search¶
Overview¶
Camphr provides some rule based matching pipelines: PatternSearcher
and RegexRuler
, and MultipleRegexRuler
.
These pipelines are character-based, which means that they are more robust but could be more susceptible to false positives than token-based spaCy pipelines
Matcher and PhraseMatcher .
Usage: RegexRuler¶
Create a pipe
>>> import spacy >>> from camphr.pipelines import RegexRuler >>> nlp = spacy.blank("en") >>> pattern = r"[\d-]+" >>> pipe = RegexRuler(pattern, label="PHONE_NUMBER") >>> nlp.add_pipe(pipe)
Parse a text
>>> text = "My phone number is 012-2345-6666" >>> doc = nlp(text) >>> print(doc.ents) (012-2345-6666,) >>> print(doc.ents[0].label_) PHONE_NUMBER
Usage: MultipleRegexRuler¶
You can use multiple patterns with MultipleRegexRuler
Create a pipe
>>> import spacy >>> from camphr.pipelines import MultipleRegexRuler >>> nlp = spacy.blank("en") >>> patterns = {"PHONE_NUMBER": r"[\d-]+", "EMAIL": "[\w.]+@[\w.]+"} >>> pipe = MultipleRegexRuler(patterns) >>> nlp.add_pipe(pipe)
Parse a text
>>> text = "Phone: 012-2345-6666, email: bob@foomail.com" >>> doc = nlp(text) >>> print(doc.ents) (012-2345-6666, bob@foomail.com) >>> print([e.label_ for e in doc.ents]) ['PHONE_NUMBER', 'EMAIL']
Usage: PatternSearcher¶
PatternSearcher
is useful when you want to look up words based on a large dictionary, thanks to pyahocorasick .
This pipeline searches words based on characters, while spaCy provides a similar pipeline PhraseMatcher which is a token-based searcher.
Create a pipe
>>> import spacy >>> nlp = spacy.blank("en") >>> pipe = PatternSearcher.from_words(["text", "pattern searcher"]) # add words >>> nlp.add_pipe(pipe)
Parse a text
>>> text = "This is a test text for pattern searcher." >>> doc = nlp(text) >>> doc.ents (text, pattern searcher)