需要帮助获取Java的网站HTML

发布时间：2020-12-15 08:34:52 所属栏目：Java 来源：我想从这个网站获取HTML： 0700

导读：我从 java httpurlconnection cutting off html获得了一些代码,我几乎从Java中的网站获取html的代码相同. 除了我无法使此代码使用的一个特定网站：我想从这个网站获取HTML： 07001 但我一直在抓垃圾角色.虽然它与http://www.google.com等任何其他网站都很好

我从 java httpurlconnection cutting off html获得了一些代码,我几乎从Java中的网站获取html的代码相同.
除了我无法使此代码使用的一个特定网站：

我想从这个网站获取HTML：

07001

但我一直在抓垃圾角色.虽然它与http://www.google.com等任何其他网站都很好用.

这是我正在使用的代码：

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

我不明白为什么它不适用于我上面提到的URL.

任何帮助将不胜感激.

解决方法

无论客户端的功能如何,该站点都会错误地压缩响应.通常,服务器应该只在客户端支持响应时gzip响应(到 Accept-Encoding: gzip).您需要使用 GZIPInputStream将其解压缩.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()),"UTF-8"));

请注意,我还将右侧字符集添加到InputStreamReader构造函数中.通常,您希望从响应的Content-Type标题中提取它.

有关更多提示,另请参阅How to use URLConnection to fire and handle HTTP requests?如果您想要的所有内容都是从HTML解析/提取信息,那么我强烈建议您使用类似Jsoup的HTML parser.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!