CrawlScript/WebCollector issues and pull requests

#137 - refactor: design and implementation smells

Pull Request - State: open - Opened by bhavya844 over 1 year ago

#136 - 访问的页面报502异常，但是还需要访问，visit异常设置了ExceptionUtils.fail(e)还是不行，怎么解决

Issue - State: open - Opened by Amnesiabht almost 2 years ago - 1 comment

#134 - Create TestNews.java

Pull Request - State: closed - Opened by HiIamHiep over 2 years ago

#133 - Bump jsoup from 1.11.3 to 1.15.3

Pull Request - State: open - Opened by dependabot[bot] almost 3 years ago
Labels: dependencies

#132 - Inefficient code detected in RegexRule.java

Issue - State: open - Opened by linci8210 almost 3 years ago - 1 comment

#131 - Bump mysql-connector-java from 5.1.46 to 8.0.28

Pull Request - State: open - Opened by dependabot[bot] about 3 years ago
Labels: dependencies

#130 - ContentExtractor.getContentByUrl返回的内容没有空行等格式排版

Issue - State: open - Opened by AmberYang678 about 3 years ago - 1 comment

#129 - Bump gson from 2.8.5 to 2.8.9

Pull Request - State: open - Opened by dependabot[bot] about 3 years ago
Labels: dependencies

#128 - Bump jsoup from 1.11.3 to 1.14.2

Pull Request - State: closed - Opened by dependabot[bot] almost 4 years ago - 1 comment
Labels: dependencies

#127 - 自动识别新闻时间部分存在BUG

Issue - State: closed - Opened by KTsama over 4 years ago - 1 comment

#126 - 大哥些官方群都加不了了啊。全都提示满了

Issue - State: closed - Opened by jiangqiang1996 over 4 years ago - 1 comment

#125 - 请问论文中的准确度是如何计算的？

Issue - State: open - Opened by fubicheng208 over 4 years ago

#124 - 访问连接307怎么处理啊

Issue - State: open - Opened by nikesb23 over 4 years ago

#123 - Bump junit from 4.12 to 4.13.1

Pull Request - State: open - Opened by dependabot[bot] almost 5 years ago
Labels: dependencies

#122 - Bump mysql-connector-java from 5.1.46 to 8.0.16

Pull Request - State: closed - Opened by dependabot[bot] about 5 years ago - 1 comment
Labels: dependencies

#121 - 删除日志

Pull Request - State: open - Opened by wangqifan over 5 years ago - 1 comment

#120 - out of memory 问题。

Issue - State: open - Opened by wangqifan over 5 years ago

#118 - 抽取时间的正则在时那点应该改成【0-9】？

Issue - State: open - Opened by bigzhouj over 5 years ago

#117 - 运行爬取CSDN示例代码时，出现RocksDBException，Failed to create a directory: C:\code\weibocrawler\crawl\crawldb: ϵͳÕҲ»µ½ָ¶

Issue - State: open - Opened by jack13163 over 5 years ago - 3 comments

#116 - ContentExtractor中的computeInfo函数会出现StackOverflowError

Issue - State: open - Opened by yanpeng over 5 years ago - 3 comments

#115 - 请问执行教程中的爬取CSDN博客原码出错

Issue - State: open - Opened by dyn1721 over 5 years ago - 1 comment

#114 - 亲问下分布式的版本在哪里

Issue - State: open - Opened by xiaowenhuman over 5 years ago

#113 - 2.73-alpha版如何忽略https证书过期问题？

Issue - State: open - Opened by hj287678654 over 5 years ago - 2 comments

#112 - Bump c3p0 from 0.9.5.2 to 0.9.5.4

Pull Request - State: open - Opened by dependabot[bot] over 5 years ago
Labels: dependencies

#111 - 请问如何在爬虫内部解决数据库连接过多的问题

Issue - State: open - Opened by linye271709915 almost 6 years ago

#110 - add unit tests for ContentExtractor

Pull Request - State: open - Opened by LordLRO almost 6 years ago

#109 - 抛异常的日志级别能不能改warn或error

Issue - State: open - Opened by xiejx618 almost 6 years ago

#108 - 继承BreadthCrawler，获取网页中文部分输出乱码

Issue - State: open - Opened by linye271709915 almost 6 years ago - 2 comments

#107 - Add demo for selenium crawler with cookie

Pull Request - State: open - Opened by smallyunet almost 6 years ago - 3 comments

#106 - 前端渲染的页面怎么样使用webcollector进行爬取数据

Issue - State: open - Opened by qiuqiu0802 almost 6 years ago

#104 - 发布包里包含log4j配置文件，会覆盖别人的log4j配置文件

Issue - State: closed - Opened by gaoxjin over 6 years ago - 3 comments

#103 - 爬取一段时间后总是会抛出RocksDBException异常，不清楚什么原因。

Issue - State: open - Opened by tanwubo over 6 years ago - 2 comments

#102 - WebCollector交流群

Issue - State: open - Opened by mdzz9527 over 6 years ago - 8 comments

#101 - Update README.md

Pull Request - State: open - Opened by x-otto-x almost 7 years ago

#100 - Update DemoCookieCrawler.java

Pull Request - State: closed - Opened by x-otto-x almost 7 years ago

#99 - Update README.md

Pull Request - State: open - Opened by x-otto-x almost 7 years ago

#98 - Update DemoCookieCrawler.java

Pull Request - State: closed - Opened by x-otto-x almost 7 years ago

#97 - WebCollector-Hadoop版本的源码请问有公开么？

Issue - State: closed - Opened by coderf187 almost 7 years ago - 1 comment

#96 - 有没有相关的交流群啊？

Issue - State: open - Opened by liushaofeng89 almost 7 years ago - 2 comments

#95 - 好像OkHttp ConnectionPool和Okio Watchdog没有正确关闭

Issue - State: open - Opened by lewiswu1209 almost 7 years ago - 4 comments

#94 - 能否将深度设置为只要有链接就会进行下一次爬取

Issue - State: closed - Opened by hxq201300 almost 7 years ago - 1 comment

#93 - 关于新版本设置UA不生效的问题

Issue - State: open - Opened by CNdarkmoon almost 7 years ago - 1 comment

#92 - 你好！ LockTimeoutException

Issue - State: closed - Opened by simplecnst almost 7 years ago - 1 comment

#91 - 如何判断爬虫结束

Issue - State: closed - Opened by djxhero almost 7 years ago - 1 comment

#90 - 重定向

Issue - State: closed - Opened by YYSpace almost 7 years ago - 4 comments

#89 - 你好，RamCrawler大约加了70个种子，执行结果不稳定

Issue - State: closed - Opened by gaoda1234 almost 7 years ago - 3 comments

#88 - StrategyCrawler类的stop方法能否立即停止爬虫行为

Issue - State: closed - Opened by BeQiang about 7 years ago - 1 comment

#87 - 如何使用这个框架爬取手机app的数据呢？

Issue - State: closed - Opened by x-otto-x about 7 years ago - 1 comment

#86 - 官网配置教程中的NewsCrawler.java报错

Issue - State: closed - Opened by MrKingHH about 7 years ago - 1 comment

#85 - 注入URL，只执行一部分

Issue - State: closed - Opened by x-otto-x about 7 years ago - 4 comments

#84 - BerkeleyDBReader读取berkerly种子历史文件，种子信息少了，而且执行的次数也少一次

Issue - State: closed - Opened by haixingmu about 7 years ago - 1 comment

#83 - Exception when updating db， java.lang.InterruptedException，org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.

Issue - State: closed - Opened by x-otto-x about 7 years ago - 3 comments

#82 - Bug with depth

Issue - State: closed - Opened by Aki1996 about 7 years ago - 1 comment

#81 - 关于IP代理的问题

Issue - State: closed - Opened by zhangzhengk about 7 years ago - 4 comments

#80 - 设置了Config.MAX_EXECUTE_COUNT,但是因超时而失败的种子好像没有再次抓取，这是怎么回事

Issue - State: closed - Opened by haixingmu over 7 years ago - 6 comments

#79 - depth太大可能导致OOM

Pull Request - State: closed - Opened by carryxyh over 7 years ago - 1 comment

#78 - depth过大导致内存溢出

Issue - State: closed - Opened by carryxyh over 7 years ago - 9 comments

#77 - 当我连续爬取时出现403？怎么解决

Issue - State: closed - Opened by df8305909 over 7 years ago - 1 comment

#76 - 能否实现重复爬取URL

Issue - State: closed - Opened by weiyinfu over 7 years ago - 1 comment

#75 - 正文提取问题

Issue - State: closed - Opened by Kaneki-x over 7 years ago - 1 comment

#74 - 多个爬虫同时爬取

Issue - State: closed - Opened by ljc930611 over 7 years ago - 5 comments

#73 - 为什么用WebCollector的2.7.1版本拿不到图片数据了呢？

Issue - State: closed - Opened by kongbb1 almost 8 years ago - 1 comment

#72 - 2.70版本HttpRequest中的setUserAgent()方法无效

Issue - State: closed - Opened by yanzuo1992 almost 8 years ago - 6 comments

#71 - 移除unused变量&优化代码

Pull Request - State: closed - Opened by feifeiiiiiiiiiii almost 8 years ago

#70 - 能否根据不同类型的seed实际情况，可以配置分配线程的数量？

Issue - State: closed - Opened by Janus-Xu almost 8 years ago - 1 comment

#69 - 打包中的源码注释是乱码的

Issue - State: closed - Opened by Janus-Xu almost 8 years ago - 1 comment

#68 - 有没有集群方案和demo？单机单线程测试爬取速度不理想

Issue - State: closed - Opened by Janus-Xu almost 8 years ago - 4 comments

#67 - 爬取给定url的所有子页面，子子页面

Issue - State: closed - Opened by yaoyuanyy almost 8 years ago - 3 comments

#66 - 增强时间提取功能添加一个gitignore file

Pull Request - State: closed - Opened by imalec-huang almost 8 years ago

#65 - CrawlDatumFormater.class bug问题

Issue - State: closed - Opened by ljc930611 about 8 years ago

#64 - CrawlDatums next中添加后续任务的问题

Issue - State: closed - Opened by ljc930611 about 8 years ago - 5 comments

#63 - jar包缺少问题

Issue - State: closed - Opened by ljc930611 about 8 years ago - 4 comments