Friday, July 29, 2011

Efficient Regex Pattern for Getting Hashtags

After digging around the Internet for a while and not finding a regex pattern that was able to produce all of the hashtags in a String, I finally created my own based on information I gathered from a few other places.

\B#[a-zA-Z][a-zA-Z0-9]+

My sources include the following:

I took this information and created a method in Salesforce to grab all of the hashtags from a String and return it in a Set, as shown below.

/**
 * Get the Set of hashtags (including
 * the '#' character) used within a String in
 * all lower case, for ease of comparison.
 *
 * @param  text The String text to analyze.
 * @return      The Set of hashtags
 *              used within the text.
 */
public static Set getHashtagSet(
        String text) {
    
    // Instantiate the resulting set.
    
    Set hashtagSet = new Set();
    
    // Only look for hashtags if text is given.
    
    if (text != null) {
        Pattern hashtagPattern = Pattern.compile(
                '\\B#[a-zA-Z][a-zA-Z0-9]+');
        Matcher hashtagMatcher =
                hashtagPattern.matcher(text);
        
        while (hashtagMatcher.find()) {
            hashtagSet.add(
                hashtagMatcher.group().toLowerCase());
        }   // while (hashtagMatcher.find())
    }   // if (text != null)
    
    // Return the results.
    
    System.debug('hashtagSet = ' + hashtagSet);
    
    return hashtagSet;
}   // public Set getHashtagSet(String)