- Jsoup has many features like parsing html document, searching inside Dom, manipulating dom element, cleaning the output with the help of jtidy.
- Jsoup provide whitelist feature for the sanitizing/cleaning the html.
- Whitelist allows what are the features that are passed to cleaning and others are discarded.
- Download Link of jar (jsoup-1.7.1.jar) :-
Methods In WhiteList:-
- addAttributesString (tag,String… keys) : Allows the tag and its listed attributes.
- none : Only text nodes are allowed.Other types are removed.
- simpleText : Allows only these tags b, em, i, strong, u.
- basic :a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, ul, http, https, ftp, mailto, rel=nofollow
- basicWithImages : Like basic plus image.
- relaxed: Allows full text and html body elements.
- addProtocols(tag, key,String… protocol):Allows element with attributes and list of values for this attribute.
- preserveRelativeLinks(boolean flag) : True preserves relative links.
Project Structure:-
Testing WhiteList :-
- The JSoupWhiteListDemo.java file are,
package com.sandeep.jsoup.whitelist;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class JSoupWhiteListDemo {
public static void main(String [] args){
String inputString ="<title>My Page</title><ul><li><em>Sandeep</em></li><li><em>Surabhi</em></li>
<li><img src='mySnamp.png'></li><li><a href='https:\loremipsumdollar.com'>click me</a></li></ul>";
/*for simpleText method*/
String outputString = Jsoup.clean(inputString, Whitelist.simpleText());
System.out.println("SIMPLETEXT OUTPUT : " + outputString);
/*for basic method*/
outputString = Jsoup.clean(inputString, Whitelist.basic());
System.out.println("BASIC OUTPUT : " + outputString);
/*for basicWithImages method*/
outputString = Jsoup.clean(inputString, Whitelist.basicWithImages());
System.out.println("BASICWITHIMAGES OUTPUT : " + outputString);
/*for none method*/
outputString = Jsoup.clean(inputString, Whitelist.none());
System.out.println("NONE OUTPUT : " + outputString);
/*for relaxed method*/
outputString = Jsoup.clean(inputString, Whitelist.relaxed());
System.out.println("RELAXED OUTPUT : " + outputString);
/*for addAtribute method*/
Whitelist customwhitelist1 = new Whitelist();
customwhitelist1.addAttributes("img", "src");
outputString = Jsoup.clean(inputString, customwhitelist1);
System.out.println("ADDATRIBUTE OUTPUT : " + outputString);
/*for addProtocols method*/
Whitelist customwhitelist2 = new Whitelist();
customwhitelist1.addProtocols("a", "href", "ftp", "http");
outputString = Jsoup.clean(inputString, customwhitelist2);
System.out.println("addProtocols OUTPUT : " + outputString);
}
}
Output:-
SIMPLETEXT OUTPUT : My Page
<em>Sandeep</em>
<em>Surabhi</em>click me
BASIC OUTPUT : My Page
<ul>
<li><em>Sandeep</em></li>
<li><em>Surabhi</em></li>
<li></li>
<li><a href=”https:loremipsumdollar.com” rel=”nofollow”>click me</a></li>
</ul>
BASICWITHIMAGES OUTPUT : My Page
<ul>
<li><em>Sandeep</em></li>
<li><em>Surabhi</em></li>
<li><img /></li>
<li><a href=”https:loremipsumdollar.com” rel=”nofollow”>click me</a></li>
</ul>
NONE OUTPUT : My PageSandeepSurabhiclick me
RELAXED OUTPUT : My Page
<ul>
<li><em>Sandeep</em></li>
<li><em>Surabhi</em></li>
<li><img /></li>
<li><a href=”https:loremipsumdollar.com”>click me</a></li>
</ul>
ADDATRIBUTE OUTPUT : My PageSandeepSurabhi
<img src=”mySnamp.png” />click me
addProtocols OUTPUT : My PageSandeepSurabhiclick me