[freamon] Currently working on: cross-posts

submitted by Andrew

[freamon] Currently working on: cross-posts

I've been thinking about what to do about cross-posts (e.g. where the same link is uploaded to both fediverse@lemmy.world and fediverse@lemmy.ml).

In terms of them being annoying, I don't yet know what to do about that.

My progress so far, and what it requires:
The Community table has an extra field (xp_indicator), for the field which determines if something is a cross-post or not. It defaults to URL, but it could be the title for communities like AskLemmy.
The Post table has an extra field (cross_posts), which is an array of other post ids (Note: this would lock PieFed into using Postgresql)
New posts, for local and ActivityPub, are checked to see if they are a cross-post, and the relevant posts are updated. This also happens for local edits and AP Update. In the DB, the posts in the screenshot looks like:

-[ RECORD 1 ]----------------------------------------------------------
id          | 27
title       | Springtime Ministrone
url         | https://www.bbcgoodfood.com/recipes/springtime-minestrone
cross_posts | {28,29,30}
-[ RECORD 2 ]----------------------------------------------------------
id          | 28
title       | Springtime Ministrone
url         | https://www.bbcgoodfood.com/recipes/springtime-minestrone
cross_posts | {27,29,30}
-[ RECORD 3 ]----------------------------------------------------------
id          | 29
title       | Springtime Ministrone
url         | https://www.bbcgoodfood.com/recipes/springtime-minestrone
cross_posts | {27,28,30}
-[ RECORD 4 ]----------------------------------------------------------
id          | 30
title       | Springtime Ministrone
url         | https://www.bbcgoodfood.com/recipes/springtime-minestrone
cross_posts | {27,28,29}

In the UI, posts with cross-posts get an extra icon, which when clicked bring you to another screen (similar to 'other discussions' in Reddit)

In terms of hiding duplicate posts from the feed, I don't yet know. If it was up to the back-end, it would require some extra DB activity that might be unacceptable speed-wise. This update would mean though, that a future API could provide a response similar to Lemmy for posts, so apps/frontends could merge duplicates the same way some of them do for Lemmy. Likewise, if there was a 'Hide posts marked as read' feature, it could regard any post ids in the cross_posts field as also being Read.

I have to wait a few days until the quota on my ngrok account resets (something in the Fediverse went crazy, I'd guess), so I thought I'd share here in the meantime. Also, it means the PR doesn't come out of the blue, and it can be discussed beforehand.

(also: it turns out I can't spell 'minestrone')

Log in to comment

4 Comments

Rimu , edited

I'm glad you posted here first, a PR of this impact deserves some discussion. Also anything involving non-trivial database changes can be quite difficult to reverse once live data gets involved so we need to be a bit more careful.

Community table

URLs are guaranteed to be unique whereas titles of posts are typed by people so we run a decent risk of falsely detecting a cross-post. Also a title like "Why are are you interested in this?" means something different depending on the community it is posted in. We could work around this by limiting the search space for title-based detection to posts within the last few days, and only for titles that are fairly long - to increase the chance of being unique?

Also I wonder about the potential for abuse & trolling.

Actually even for urls, perhaps we need to only check for duplicates within the last few days. When someone links to the home page of a site it can be for a variety of different reasons but if it's recent then they're probably for the same reason.

If we only use url then we don't need xp_indicator.

Posts

I did not know postgresql could do arrays, that's very interesting.

I'm not concerned about being locked in to postgresql as I'm making zero effort to test PieFed on other database systems so we are probably already locked in, accidentally. I know the full text search package requires postgresql, for example.

However while I can see the appeal of array fields I'd really prefer we use a normal DB table for the cross_posts data. It seems a lot easier to query and do joins on? I'd tend to use array fields for storing lists of data rather than IDs which act as foreign keys. https://stackoverflow.com/questions/58943211/am-i-breaking-2nf-rule-for-using-array-data-type-in-postgressql

Andrew [OP]

Oh, okay. I was only thinking of using 'title' for very few communities, like AskLemmy or ShowerThoughts, but I see how it could produce false positives even for those (I may also have been misled by the recent Issue into thinking title-based cross-posts happen more often than they do).

Speaking of that Issue, maybe the search for URL-based cross-posts could also happen in Redis - would be quicker, and would only be for recent stuff (depending on the expiry for how recent, of course).

Anyway, I'll share here how I eventually got DB arrays to work, in case anyone considers it for anything else:

from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.ext.mutable import MutableList
...
cross_posts = db.Column(MutableList.as_mutable(ARRAY(db.Integer)))

(they need to be mutable, because the DB won't update when they're added to, otherwise)

Fetching them is this code (called when the 'layers' icon is clicked):

@bp.route('/post/[HTML_REMOVED]/cross_posts', methods=['GET'])
def post_cross_posts(post_id: int):
    post = Post.query.get_or_404(post_id)
    cross_posts = Post.query.filter(Post.id.in_(post.cross_posts)).all()
    return render_template('post/post_cross_posts.html', post=post, cross_posts=cross_posts)

This isn't as bad as that Stack Overflow post, because it's not Joining those values with another table. The values in the array are sort-of self-references, rather than foreign keys, I think, so I assumed it'd be quicker than using another table (which would then refer back to the Post table again)

Rimu

Oh, well, if we can use Post.id.in_(), that's quite elegant! That goes a long way to mollifying my concerns. Let's do it!

Andrew [OP]

Okay. I'll nix the xp_indicator idea (which'll also make the code clearer), and keep plodding on.