Rule based search¶

Overview¶

Camphr provides some rule based matching pipelines: PatternSearcher and RegexRuler, and MultipleRegexRuler. These pipelines are character-based, which means that they are more robust but could be more susceptible to false positives than token-based spaCy pipelines Matcher and PhraseMatcher .

Usage: RegexRuler¶

Create a pipe

>>> import spacy
>>> from camphr.pipelines import RegexRuler
>>> nlp = spacy.blank("en")
>>> pattern = r"[\d-]+"
>>> pipe = RegexRuler(pattern, label="PHONE_NUMBER")
>>> nlp.add_pipe(pipe)

Parse a text

>>> text = "My phone number is 012-2345-6666"
>>> doc = nlp(text)
>>> print(doc.ents)
(012-2345-6666,)
>>> print(doc.ents[0].label_)
PHONE_NUMBER

Usage: MultipleRegexRuler¶

You can use multiple patterns with MultipleRegexRuler

Create a pipe

>>> import spacy
>>> from camphr.pipelines import MultipleRegexRuler
>>> nlp = spacy.blank("en")
>>> patterns = {"PHONE_NUMBER": r"[\d-]+", "EMAIL": "[\w.]+@[\w.]+"}
>>> pipe = MultipleRegexRuler(patterns)
>>> nlp.add_pipe(pipe)

Parse a text

>>> text = "Phone: 012-2345-6666, email: bob@foomail.com"
>>> doc = nlp(text)
>>> print(doc.ents)
(012-2345-6666, bob@foomail.com)
>>> print([e.label_ for e in doc.ents])
['PHONE_NUMBER', 'EMAIL']

Usage: PatternSearcher¶

PatternSearcher is useful when you want to look up words based on a large dictionary, thanks to pyahocorasick . This pipeline searches words based on characters, while spaCy provides a similar pipeline PhraseMatcher which is a token-based searcher.

Create a pipe

>>> import spacy
>>> nlp = spacy.blank("en")
>>> pipe = PatternSearcher.from_words(["text", "pattern searcher"]) # add words
>>> nlp.add_pipe(pipe)

Parse a text

>>> text = "This is a test text for pattern searcher."
>>> doc = nlp(text)
>>> doc.ents
(text, pattern searcher)

Rule based search¶

Overview¶

Usage: RegexRuler¶

Usage: MultipleRegexRuler¶

Usage: PatternSearcher¶

Navigation

Related Topics