mediacloud/metadata-lib issues and pull requests

#87 - MC metadata extraction investigation

Issue - State: closed - Opened by pgulley 2 months ago

#86 - Assess tweaks to content extraction to remove headlines at end of article

Issue - State: open - Opened by rahulbot 5 months ago - 2 comments
Labels: enhancement

#85 - Update htmldate requirement from ==1.7.* to >=1.7,<1.9

Pull Request - State: closed - Opened by dependabot[bot] 6 months ago - 1 comment
Labels: dependencies

#84 - Update trafilatura requirement from <1.7,>=1.4 to >=1.4,<1.9

Pull Request - State: closed - Opened by dependabot[bot] 6 months ago - 1 comment
Labels: dependencies

#83 - Further tweaking of User-Agent string?

Issue - State: closed - Opened by philbudne 7 months ago - 3 comments
Labels: question

#82 - central storage for User-Agent to use across MC projects

Pull Request - State: closed - Opened by rahulbot 8 months ago - 1 comment

#81 - store MC user-agent for use by our other libraries

Issue - State: closed - Opened by rahulbot 8 months ago
Labels: enhancement

#80 - Not capturing full article text

Issue - State: closed - Opened by jaypinho 8 months ago - 1 comment
Labels: wontfix

#79 - Update trafilatura requirement from <1.7,>=1.4 to >=1.4,<1.8

Pull Request - State: closed - Opened by dependabot[bot] 8 months ago - 1 comment
Labels: dependencies

#78 - Get automated release working

Issue - State: closed - Opened by rahulbot 8 months ago - 2 comments

#77 - ignore ports & handle IP domains in `normalize_url`

Pull Request - State: closed - Opened by rahulbot 8 months ago

#76 - Update requirements

Pull Request - State: closed - Opened by rahulbot 8 months ago - 1 comment

#75 - Update htmldate requirement from ==1.6.* to >=1.6,<1.8

Pull Request - State: closed - Opened by dependabot[bot] 8 months ago - 2 comments
Labels: dependencies

#74 - Fix title parsing failure (due to empty or whitespace title tag)

Pull Request - State: closed - Opened by rahulbot 9 months ago - 1 comment

#73 - mcmetadata.extract throwing AttributeErrors

Issue - State: closed - Opened by philbudne 9 months ago - 3 comments
Labels: bug

#72 - possible url normalization issues

Issue - State: open - Opened by philbudne 9 months ago - 1 comment
Labels: bug, question

#71 - Update static test fixtures

Pull Request - State: closed - Opened by rahulbot 10 months ago

#70 - centralize url unique hash generation with helper method in this package

Pull Request - State: closed - Opened by rahulbot 10 months ago - 1 comment

#69 - improve CI test run reliabiility by using cached fixtures?

Issue - State: closed - Opened by rahulbot 10 months ago
Labels: enhancement, question

#68 - allow capturing stats from individual extract calls

Pull Request - State: closed - Opened by rahulbot 10 months ago
Labels: enhancement

#67 - May want to remove story source related query parameters!

Issue - State: closed - Opened by philbudne 10 months ago - 1 comment

#66 - update requirements file to latest

Pull Request - State: closed - Opened by rahulbot 10 months ago

#65 - Small tweaks to handle whitespace in URLs

Pull Request - State: closed - Opened by rahulbot 10 months ago

#64 - Support defaults and overrides in `extract`

Pull Request - State: closed - Opened by rahulbot 10 months ago

#63 - support passing in a fallback publication date

Issue - State: closed - Opened by rahulbot 10 months ago - 2 comments
Labels: enhancement

#62 - Update htmldate requirement from ==1.5.* to >=1.5,<1.7

Pull Request - State: closed - Opened by dependabot[bot] 10 months ago - 2 comments
Labels: dependencies

#61 - Discuss possible enhancements to mcmetadata.extract

Issue - State: closed - Opened by philbudne 10 months ago - 2 comments
Labels: enhancement

#60 - Update dateparser requirement from ==1.1.* to >=1.1,<1.3

Pull Request - State: closed - Opened by dependabot[bot] 11 months ago - 2 comments
Labels: dependencies

#59 - Update tldextract requirement from ==3.6.* to >=3.6,<5.2

Pull Request - State: closed - Opened by dependabot[bot] 11 months ago - 2 comments
Labels: dependencies

#58 - Handling of URL parse failure

Issue - State: closed - Opened by philbudne 12 months ago
Labels: bug

#57 - Update tldextract requirement from ==3.6.* to >=3.6,<5.1

Pull Request - State: closed - Opened by dependabot[bot] 12 months ago - 1 comment
Labels: dependencies

#56 - Update tldextract requirement from ==3.4.* to >=3.4,<3.7

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 1 comment
Labels: dependencies

#55 - Update tldextract requirement from ==3.4.* to >=3.4,<3.6

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 1 comment
Labels: dependencies

#54 - Update htmldate requirement from ==1.4.* to >=1.4,<1.6

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 1 comment
Labels: dependencies

#53 - Switched from cchardet to faust-chardet, as the former is unmantained…

Pull Request - State: closed - Opened by pgulley over 1 year ago

#52 - mcmetadata not type checked by mypy

Issue - State: closed - Opened by philbudne over 1 year ago - 2 comments

#51 - Update trafilatura requirement from ==1.4.* to >=1.4,<1.7

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago
Labels: dependencies

#50 - update to latest version of trafilatura

Issue - State: closed - Opened by rahulbot over 1 year ago - 1 comment
Labels: enhancement, dependencies

#49 - Update trafilatura requirement from ==1.4.* to >=1.4,<1.6

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#48 - Update beautifulsoup4 requirement from ==4.11.* to >=4.11,<4.13

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago
Labels: dependencies

#47 - fix bugs from PT integration

Pull Request - State: closed - Opened by rahulbot over 1 year ago

#46 - addressing no nk error

Pull Request - State: closed - Opened by pgulley over 1 year ago - 1 comment

#45 - Crash because uri.query.params['nk'] can be None

Issue - State: closed - Opened by vbanos over 1 year ago - 2 comments
Labels: bug

#44 - Feature feed normalization

Pull Request - State: closed - Opened by rahulbot almost 2 years ago

#43 - Add feed_url.py

Pull Request - State: closed - Opened by philbudne almost 2 years ago

#42 - handle IP addresses better

Issue - State: closed - Opened by rahulbot almost 2 years ago - 1 comment
Labels: bug

#41 - Add a a check to avoid TypeError

Issue - State: closed - Opened by vbanos almost 2 years ago - 1 comment

#40 - Update htmldate requirement from ==1.3.* to >=1.3,<1.5

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago
Labels: dependencies

#39 - Update trafilatura requirement from ==1.3.* to >=1.3,<1.5

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago
Labels: dependencies

#38 - Update tldextract requirement from ==3.3.* to >=3.3,<3.5

Pull Request - State: closed - Opened by dependabot[bot] almost 2 years ago
Labels: dependencies

#37 - assess fasttext for language guessing speedup

Issue - State: closed - Opened by rahulbot almost 2 years ago - 1 comment
Labels: enhancement

#36 - upgrade dependencies

Issue - State: closed - Opened by rahulbot almost 2 years ago - 3 comments
Labels: enhancement

#35 - Fallback extractor

Pull Request - State: closed - Opened by pgulley about 2 years ago

#34 - handle empty content with no-encoding from HTML

Issue - State: closed - Opened by rahulbot about 2 years ago
Labels: bug

#33 - Unexpected AttributeError on extract

Issue - State: closed - Opened by vbanos about 2 years ago - 1 comment
Labels: bug

#32 - Improvement regarding content decoding/encoding

Issue - State: open - Opened by vbanos about 2 years ago
Labels: enhancement

#31 - Bug in extract method

Issue - State: closed - Opened by vbanos about 2 years ago - 1 comment
Labels: bug

#30 - Use latest htmldate and pass datetime max_date instead of string

Pull Request - State: closed - Opened by vbanos about 2 years ago

#29 - add in top image and other metadata

Pull Request - State: closed - Opened by rahulbot about 2 years ago - 2 comments

#28 - More efficient parameterized unit tests

Pull Request - State: closed - Opened by vbanos about 2 years ago - 1 comment

#27 - optimization on tag removal in readability-lxml extraction fallback

Issue - State: closed - Opened by rahulbot about 2 years ago
Labels: enhancement

#26 - improve trafilatura defaults

Issue - State: closed - Opened by rahulbot about 2 years ago
Labels: enhancement

#25 - create larger test set to compare results to main system data

Issue - State: closed - Opened by rahulbot about 2 years ago - 1 comment
Labels: enhancement

#24 - don't lowercase YouTube URLs for uniqueness hashing

Issue - State: closed - Opened by rahulbot about 2 years ago
Labels: bug

#23 - limit dates in future?

Issue - State: closed - Opened by rahulbot about 2 years ago - 2 comments
Labels: enhancement

#22 - Masking very frequent date parsing exceptions

Issue - State: closed - Opened by vbanos over 2 years ago - 1 comment

#21 - Unhandled exception we got in production

Issue - State: closed - Opened by vbanos over 2 years ago - 4 comments
Labels: bug

#20 - centralize dependencies in one place

Issue - State: closed - Opened by rahulbot over 2 years ago

#19 - You could also compile these regex in this method.

Issue - State: closed - Opened by vbanos over 2 years ago

#18 - Use set instead of list for improved performance

Issue - State: closed - Opened by vbanos over 2 years ago

#17 - You could compile this regex for better performance

Issue - State: closed - Opened by vbanos over 2 years ago

#16 - Use Beautifulsoup4 with lxml parser for faster performance

Issue - State: closed - Opened by vbanos over 2 years ago

#15 - Add cchardet dependency to speedup BeautifulSoup4

Issue - State: closed - Opened by vbanos over 2 years ago

#14 - investigate URLs failing extraction

Issue - State: closed - Opened by rahulbot over 2 years ago - 2 comments
Labels: wontfix

#13 - justify content extractor priorities with data and testing

Issue - State: closed - Opened by rahulbot over 2 years ago - 3 comments

#12 - Feature quick improvements

Pull Request - State: closed - Opened by rahulbot over 2 years ago

#11 - Stats for the success / failure of each extractor

Issue - State: closed - Opened by vbanos over 2 years ago

#10 - Improve exception handling

Issue - State: closed - Opened by vbanos over 2 years ago - 1 comment

#9 - Compile regular expressions to improve performance

Issue - State: closed - Opened by vbanos over 2 years ago

#8 - rename core branch from master to main

Issue - State: closed - Opened by rahulbot over 2 years ago - 1 comment

#7 - Prep for release to PyPi

Issue - State: closed - Opened by rahulbot over 2 years ago - 2 comments

#6 - Extract authors information when possible

Issue - State: closed - Opened by ibnesayeed over 2 years ago - 3 comments
Labels: enhancement

#5 - Building and installing cld2-cffi is failing

Issue - State: closed - Opened by ibnesayeed over 2 years ago - 2 comments

#4 - Extracting original domain from archived pages

Issue - State: closed - Opened by ibnesayeed over 2 years ago - 1 comment

#3 - Exception on non-news article pages

Issue - State: closed - Opened by ibnesayeed over 2 years ago

#2 - switch language detection for now

Pull Request - State: closed - Opened by rahulbot over 2 years ago

#1 - first pass at quickly integrating existing code

Pull Request - State: closed - Opened by rahulbot over 2 years ago - 1 comment

GitHub / mediacloud/metadata-lib issues and pull requests