grangier/python-goose issues and pull requests

#289 - docs: Fix a few typos

Pull Request - State: open - Opened by timgates42 about 3 years ago

#288 - Unable to execute the install script

Issue - State: closed - Opened by sudo-behappy about 3 years ago

#287 - Unable to use goose with Python 3

Issue - State: open - Opened by Ayokunle over 3 years ago - 1 comment

#286 - Create new_file.md

Pull Request - State: open - Opened by BrotherOrange about 5 years ago

#285 - Installation error

Issue - State: open - Opened by pol690 over 5 years ago - 2 comments

#284 - added support for HTTP and HTTPS proxy.

Pull Request - State: open - Opened by soundofsettling almost 6 years ago

#283 - Add support for HTTP and HTTPS proxies

Issue - State: open - Opened by soundofsettling almost 6 years ago

#282 - any paper or algorithm description about text extraction?

Issue - State: open - Opened by whqwill about 6 years ago

#281 - no result return and waiting

Issue - State: open - Opened by pigpeak about 6 years ago - 3 comments

#280 - what's python's version

Issue - State: open - Opened by charlotte-ling over 6 years ago - 1 comment

#279 - remove use-mirror as it is depriciated

Pull Request - State: closed - Opened by ravirnjn88 over 6 years ago

#278 - Goose is not extracting article whole text

Issue - State: open - Opened by AgoloAhmedElhady over 6 years ago

#277 - lots of temporary files in /tmp/goose

Issue - State: open - Opened by kingsaint almost 7 years ago - 1 comment

#276 - Japanease functionality

Issue - State: open - Opened by asafcombo almost 7 years ago

#275 - Not parsing following articles.

Issue - State: open - Opened by thekgt almost 7 years ago - 1 comment

#274 - Correct spelling mistakes.

Pull Request - State: open - Opened by EdwardBetts almost 7 years ago

#273 - PLEASE SUBMIT ISSUES TO GOOSE3

Issue - State: open - Opened by lababidi about 7 years ago

#272 - python-goose/goose/utils/encoding.py

Issue - State: open - Opened by marcelotournier about 7 years ago

#271 - ImportError: dynamic module does not define init function (init_imaging)

Issue - State: open - Opened by pratheepchowdhary over 7 years ago

#270 - Failed extraction from blogger post

Issue - State: open - Opened by piccolbo over 7 years ago - 12 comments

#269 - Fix #191: infinite recursion on some pages

Pull Request - State: closed - Opened by androm3da over 7 years ago - 1 comment

#268 - encoding error : input conversion failed due to input error, bytes 0xEC 0xD8 0xFD 0xFF

Issue - State: open - Opened by brookxs over 7 years ago

#267 - Allow custom search tags

Pull Request - State: closed - Opened by sproberts92 almost 8 years ago - 1 comment

#266 - [extractors/title.py] None value for `site_name` in line 40

Issue - State: open - Opened by kbandla almost 8 years ago

#265 - Not working with ABC News and The Hill articles

Issue - State: closed - Opened by sarakhedr almost 8 years ago

#264 - ModuleNotFoundError: No module named 'urlparse'

Issue - State: open - Opened by ghost almost 8 years ago - 2 comments

#263 - li tags in html not extracted

Issue - State: open - Opened by sparvind2000 about 8 years ago - 2 comments

#262 - Problems Parsing Titles

Issue - State: open - Opened by grantdelozier over 8 years ago - 1 comment

#261 - Travis Bugfix: No More `--use-mirrors` Option

Pull Request - State: open - Opened by mxamin over 8 years ago

#260 - Add Farsi (Persian) Language Support

Pull Request - State: open - Opened by mxamin over 8 years ago - 1 comment

#259 - #258 - Handle None from opengraph title

Pull Request - State: closed - Opened by aniruddha-adhikary over 8 years ago

#258 - title from opengraph can return None

Issue - State: open - Opened by aniruddha-adhikary over 8 years ago - 1 comment

#257 - Update content.py

Pull Request - State: open - Opened by abhigenie92 over 8 years ago

#256 - Link text is not included in cleaned text

Issue - State: open - Opened by smilledge over 8 years ago

#255 - Install should simply be 'pip install goose-extractor'?

Issue - State: open - Opened by andybak almost 9 years ago - 2 comments

#254 - EXSLT link seems to have changed

Pull Request - State: closed - Opened by andreis almost 9 years ago

#253 - extracting image from the content in my db

Issue - State: open - Opened by lip365 almost 9 years ago

#252 - Goose fails in extracting articles from The New York Times

Issue - State: closed - Opened by manalsali about 9 years ago - 5 comments

#251 - Title of project says "scrapping" but it's "scraping"

Issue - State: open - Opened by doda-zz about 9 years ago

#250 - og:image is not parsed correct if e.g. og:image:width exists on page

Issue - State: open - Opened by vonholst about 9 years ago - 1 comment

#249 - Bug: Infinite crawling recursion on some pages

Issue - State: open - Opened by simonwjackson about 9 years ago - 1 comment

#248 - Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features

Pull Request - State: open - Opened by Lol4t0 about 9 years ago - 4 comments

#247 - Fallback to 'http' as default url schema if needed

Pull Request - State: open - Opened by rastasheep about 9 years ago

#246 - Added Serbian stopwords

Pull Request - State: open - Opened by rastasheep about 9 years ago

#245 - Goose is not working on extracting data from Kissmetrics blog which have some meta tags present.

Issue - State: open - Opened by jijoy about 9 years ago - 1 comment

#244 - Handle gzipped pages gracefully

Pull Request - State: closed - Opened by daTokenizer over 9 years ago

#243 - fix(stopwords-id.txt): changed to Lucene stopwords

Pull Request - State: open - Opened by luthfianto over 9 years ago

#242 - h1,h2...h6 not returned

Issue - State: open - Opened by tamimibrahim over 9 years ago - 1 comment

#241 - Incompatible library version: _imaging.so requires version 13.0.0 or later, but libjpeg.8.dylib provides version 12.0.0

Issue - State: open - Opened by dbl001 over 9 years ago - 1 comment

#240 - Non-obvious failure grabbing top_image

Issue - State: open - Opened by Slater-Victoroff over 9 years ago - 2 comments

#239 - Not working on some urls

Pull Request - State: open - Opened by abhigenie92 over 9 years ago

#238 - HtmlFetcher does not handle gzip compression

Issue - State: open - Opened by kqr over 9 years ago - 2 comments

#237 - add gzip deflation to HtmlFetcher

Pull Request - State: open - Opened by kqr over 9 years ago - 1 comment

#236 - Forbes.com text extraction gives redundant date in some cases

Issue - State: open - Opened by ethan-hunt-007 over 9 years ago - 1 comment

#235 - Published_Date extraction

Issue - State: open - Opened by kmehl over 9 years ago

#234 - Can't extract content from huffington post (?)

Issue - State: open - Opened by jice-lavocat over 9 years ago - 2 comments

#233 - Why can't Goose extract these Chinese articles?

Issue - State: closed - Opened by motasay over 9 years ago - 8 comments

#232 - cleaned_text doesn't work everytime for the same website

Issue - State: closed - Opened by kmehl over 9 years ago - 1 comment

#231 - Read article content using goose retrieving nothing

Issue - State: open - Opened by abhigenie92 over 9 years ago - 2 comments

#230 - IOError

Issue - State: open - Opened by abhigenie92 over 9 years ago

#229 - NY Times doesn't work

Pull Request - State: open - Opened by abhigenie92 over 9 years ago - 1 comment

#228 - Hotfix for #219 - Missing real fix

Pull Request - State: open - Opened by jice-lavocat over 9 years ago

#227 - Can not get the image from a Chinese page even the text

Issue - State: open - Opened by SheldonWang3000 over 9 years ago - 2 comments

#226 - Can not install on mac

Issue - State: open - Opened by 1a1a11a over 9 years ago - 3 comments

#225 - fixing new york times content extraction failure

Pull Request - State: open - Opened by robmcdan over 9 years ago - 1 comment

#224 - Goose fails on nytimes articles

Issue - State: open - Opened by lsemel over 9 years ago - 2 comments

#223 - Russian articles are not extracted

Issue - State: open - Opened by szhem over 9 years ago

#222 - Turkish stopwords added

Pull Request - State: open - Opened by ufukk over 9 years ago

#221 - top_node algorithm? (test case included)

Issue - State: open - Opened by ThiemNguyen almost 10 years ago - 2 comments

#220 - Add python 3 support

Pull Request - State: open - Opened by vetal4444 almost 10 years ago - 11 comments

#219 - Link without domain.

Issue - State: open - Opened by warmspringwinds almost 10 years ago - 2 comments

#218 - Dateline in articles

Issue - State: open - Opened by cvelascorivera almost 10 years ago

#217 - Og site_name issue

Issue - State: open - Opened by grangier almost 10 years ago

#216 - Getting a No Such File or Directory error

Issue - State: open - Opened by lsemel almost 10 years ago - 1 comment

#215 - Algorithm used in goose ?

Issue - State: open - Opened by IndianShifu almost 10 years ago - 2 comments

#214 - Type fix: Issue #204

Pull Request - State: closed - Opened by amalfra almost 10 years ago

#213 - No Text Extracted for articles from domain http://www.clarin.com

Issue - State: open - Opened by sathappanspm almost 10 years ago - 1 comment

#212 - Clarification on how raw_html gets extracted

Issue - State: open - Opened by konradkonrad almost 10 years ago

#211 - Indonesian stopwords file contains too many other words than stopwords

Issue - State: open - Opened by luthfianto almost 10 years ago

#210 - Not getting any extracted text

Issue - State: open - Opened by peterswang almost 10 years ago - 1 comment

#209 - Tidy README.rst

Pull Request - State: closed - Opened by StevenMaude almost 10 years ago

#208 - meta charset options support

Issue - State: closed - Opened by kmmbvnr almost 10 years ago - 1 comment

#207 - More efficient title extraction and infinite recursion bug fix

Pull Request - State: open - Opened by slitayem almost 10 years ago

#206 - Maximum recursion depth exceeded

Issue - State: open - Opened by slitayem almost 10 years ago

#205 - More efficient title extraction and bugs fix

Pull Request - State: closed - Opened by slitayem almost 10 years ago

#204 - Spelling Error in documentation

Issue - State: open - Opened by ghost almost 10 years ago

#203 - Fix bug with site_name=None

Pull Request - State: closed - Opened by yprez almost 10 years ago - 3 comments

#202 - provide a facility to get all text in a webpage

Issue - State: open - Opened by aqp almost 10 years ago

#197 - Fix title extraction if title is same as site_name

Pull Request - State: open - Opened by vetal4444 about 10 years ago - 1 comment

#195 - Fix title cleaning

Pull Request - State: closed - Opened by slitayem about 10 years ago - 2 comments

#194 - Error in title extractor

Issue - State: open - Opened by nargiza-sarkulova about 10 years ago - 6 comments

#155 - Could not extract a Chinese html

Issue - State: closed - Opened by yetuweiba about 10 years ago - 3 comments

#148 - Goose is non-functional in Python 3

Issue - State: open - Opened by fake-name over 10 years ago - 13 comments

#138 - Timeout

Issue - State: closed - Opened by harikt over 10 years ago - 2 comments

#78 - WindowsError: [Error 32] The process cannot access the file because it is being used by another process

Issue - State: closed - Opened by idf almost 11 years ago - 17 comments

#64 - adding cookies support

Pull Request - State: closed - Opened by tgallant about 11 years ago - 2 comments

GitHub / grangier/python-goose issues and pull requests