最近发现用htmlparser解析一些网页时,繁体中文会变成乱码.分析了下原因,发现在用stringbean的时候htmlparser会自己根据meta来决定用哪种内码来解码,而有的网站在meta中是用gb2312来做charset,实际应用的时候又用到了gbk.gb2312是不能表示繁体的,所以就出现了乱码.解决的办法很简单,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判断,如果ret是gb2312就设置为gbk,这样问题就解决了.

修改的page.java的代码如下(/lexer/page.java)

public String getCharset (String content)

{

final String CHARSET_STRING = "charset";

int index;

String ret;

if (null == mSource)

ret = DEFAULT_CHARSET;

else

// use existing (possibly supplied) character set:

// bug #1322686 when illegal charset specified

ret = mSource.getEncoding ();

if (null != content)

{

index = content.indexOf (CHARSET_STRING);

if (index != -1)

{

content = content.substring (index +

CHARSET_STRING.length ()).trim ();

if (content.startsWith ("="))

{

content = content.substring (1).trim ();

index = content.indexOf (";");

if (index != -1)

content = content.substring (0, index);

//remove any double quotes from around charset string

if (content.startsWith ("\"") && content.endsWith ("\"")

&& (1 < content.length ()))

content = content.substring (1, content.length () - 1);

//remove any single quote from around charset string

if (content.startsWith ("'") && content.endsWith ("'")

&& (1 < content.length ()))

content = content.substring (1, content.length () - 1);

ret = findCharset (content, ret);

// Charset names are not case-sensitive;

// that is, case is always ignored when comparing

// charset names.

//                    if (!ret.equalsIgnoreCase (content))

//                    {

//                        System.out.println (

//                            "detected charset \""

//                            + content

//                            + "\", using \""

//                            + ret

//                            + "\"");

//                    }

}

}

}

if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem

//edited by linyunfan

return (ret);

}

在最后加入了这句

if(ret.equalsIgnoreCase("gb2312"))ret="GBK";

大盘预测

国富论

posted on 2008-10-09 13:33 华梦行 阅读(1608) 评论(3)  编辑  收藏

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐