public class DaitchMokotoffSoundex extends java.lang.Object implements StringEncoder
The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavish and Yiddish surnames with similar pronunciation but differences in spelling.
The main differences compared to the other soundex variants are:
This implementation supports branching, depending on the used method:
encode(String)
- branching disabled, only the first code will be returned
soundex(String)
- branching enabled, all codes will be returned, separated by '|'
Note: this implementation has additional branching rules compared to the original description of the algorithm. The
rules can be customized by overriding the default rules contained in the resource file
org/apache/commons/codec/language/dmrules.txt
.
This class is thread-safe.
Soundex
,
Wikipedia - Daitch-Mokotoff Soundex,
Avotaynu - Soundexing and GenealogyModifier and Type | Class and Description |
---|---|
private static class |
DaitchMokotoffSoundex.Branch
Inner class representing a branch during DM soundex encoding.
|
private static class |
DaitchMokotoffSoundex.Rule
Inner class for storing rules.
|
Modifier and Type | Field and Description |
---|---|
private static java.lang.String |
COMMENT |
private static java.lang.String |
DOUBLE_QUOTE |
private boolean |
folding
Whether to use ASCII folding prior to encoding.
|
private static java.util.Map<java.lang.Character,java.lang.Character> |
FOLDINGS
Folding rules.
|
private static int |
MAX_LENGTH
The code length of a DM soundex value.
|
private static java.lang.String |
MULTILINE_COMMENT_END |
private static java.lang.String |
MULTILINE_COMMENT_START |
private static java.lang.String |
RESOURCE_FILE
The resource file containing the replacement and folding rules
|
private static java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> |
RULES
Transformation rules indexed by the first character of their pattern.
|
Constructor and Description |
---|
DaitchMokotoffSoundex()
Creates a new instance with ASCII-folding enabled.
|
DaitchMokotoffSoundex(boolean folding)
Creates a new instance.
|
Modifier and Type | Method and Description |
---|---|
private java.lang.String |
cleanup(java.lang.String input)
Performs a cleanup of the input string before the actual soundex transformation.
|
java.lang.Object |
encode(java.lang.Object obj)
Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.
|
java.lang.String |
encode(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.
|
private static void |
parseRules(java.util.Scanner scanner,
java.lang.String location,
java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping,
java.util.Map<java.lang.Character,java.lang.Character> asciiFoldings) |
java.lang.String |
soundex(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.
|
private java.lang.String[] |
soundex(java.lang.String source,
boolean branching)
Perform the actual DM Soundex algorithm on the input string.
|
private static java.lang.String |
stripQuotes(java.lang.String str) |
private static final java.lang.String COMMENT
private static final java.lang.String DOUBLE_QUOTE
private static final java.lang.String MULTILINE_COMMENT_END
private static final java.lang.String MULTILINE_COMMENT_START
private static final java.lang.String RESOURCE_FILE
private static final int MAX_LENGTH
private static final java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> RULES
private static final java.util.Map<java.lang.Character,java.lang.Character> FOLDINGS
private final boolean folding
public DaitchMokotoffSoundex()
public DaitchMokotoffSoundex(boolean folding)
With ASCII-folding enabled, certain accented characters will be transformed to equivalent ASCII characters, e.g. รจ -> e.
folding
- if ASCII-folding shall be performed before encodingprivate static void parseRules(java.util.Scanner scanner, java.lang.String location, java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping, java.util.Map<java.lang.Character,java.lang.Character> asciiFoldings)
private static java.lang.String stripQuotes(java.lang.String str)
private java.lang.String cleanup(java.lang.String input)
Removes all whitespace characters and performs ASCII folding if enabled.
input
- the input string to cleanuppublic java.lang.Object encode(java.lang.Object obj) throws EncoderException
This method is provided in order to satisfy the requirements of the Encoder interface, and will throw an EncoderException if the supplied object is not of type java.lang.String.
encode
in interface Encoder
obj
- Object to encodeEncoderException
- if the parameter supplied is not of type java.lang.Stringjava.lang.IllegalArgumentException
- if a character is not mappedsoundex(String)
public java.lang.String encode(java.lang.String source)
encode
in interface StringEncoder
source
- A String object to encodejava.lang.IllegalArgumentException
- if a character is not mappedsoundex(String)
public java.lang.String soundex(java.lang.String source)
In case a string is encoded into multiple codes (see branching rules), the result will contain all codes, separated by '|'.
Example: the name "AUERBACH" is encoded as both
Thus the result will be "097400|097500".
source
- A String object to encodejava.lang.IllegalArgumentException
- if a character is not mappedprivate java.lang.String[] soundex(java.lang.String source, boolean branching)
source
- A String object to encodebranching
- If branching shall be performed