• polakkenak@feddit.dk
    link
    fedilink
    arrow-up
    53
    ·
    2 days ago

    No, absolutely not. It is safe to assume that most/all open source (and otherwise) has been part of the training data. You need not look further than the fact that some models can recite Harry Potter from memory. There is no such thing as “clean room” for AI.

    • Captain Beyond@linkage.ds8.zone
      link
      fedilink
      arrow-up
      4
      ·
      18 hours ago

      Ironically though this makes the reverse a bit more defensible (i.e. using an LLM to reverse engineer a proprietary app) because that proprietary app’s source code is less likely to be among the publicly available dataset.

      But I imagine the corpos aren’t going to look fondly on that for obvious reasons.

    • StellarExtract@lemmy.zip
      link
      fedilink
      arrow-up
      6
      arrow-down
      4
      ·
      2 days ago

      This really isn’t true though, even if it is currently true in many cases. Case in point, if I wrote something and published it right now, it wouldn’t be part of any AI model yet. A party with a lot of money (like, say, a tech corporation) could easily create a bespoke coding model that is trained on everything but the desired libraries, thus achieving “clean room”.

      • polakkenak@feddit.dk
        link
        fedilink
        arrow-up
        10
        ·
        2 days ago

        In theory: Yes, future works are not yet part of the training data.

        In practice: It takes months or years for an open source project (or any new technology) to take off and be considered valuable.

        The other argument relies on said tech organization doing the right thing, and spending resources on training their own model (years and 100+ million) instead of including the cost of the lawsuit and pending fine in their cost/benefit analysis. I’m not aware that any such tech organization (with the means) exists.

        • StellarExtract@lemmy.zip
          link
          fedilink
          arrow-up
          1
          arrow-down
          1
          ·
          18 hours ago

          Again, while this may be currently true for the most part, this is not considering the future evolution of technology. Models are only going to continue getting cheaper to produce. While it is possible that it is prohibitively expensive today (and I’m not convinced that that’s the case universally) that will not be the case in the future as model training is essentially guaranteed to get dramatically cheaper in the coming years due to hardware advancements. Burying our heads in the sand now isn’t going to help anything.